br cell line panel expression and
1,001 cell line panel Hexa His tag and methylation datasets
Expression and methylation datasets from the cell line panel are available at https://www.cancerrxgene.org/gdsc1000/ and were processed previously (Iorio et al., 2016). The available expression dataset was generated using the Robust Multi-Array Average (RMA) method (Iorio et al., 2016) and it was normalized gene-wisely here to conduct analysis at the individual sample level and over-come the lack of transcriptional data from matched normal samples. The probability distribution Pg describing the expression of a given gene g across the cell lines was estimated using a non-parametric Gaussian kernel estimator. Each expression value xg;l (of gene g in cell line l) was assigned a normalized expression score equal to
where CDFgðxÞ is the value assumed by the cumulative distribution of the gene g expression at x:
QUANTIFICATION AND STATISTICAL ANALYSIS
Statistical analysis was performed using indicated algorithms and tests. Graphics were produced using R version 3.2.2: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, Austria).
Mutational Signatures Analysis
Mutational signatures identified in ICGC PCAWG Platinum release
The set of mutational signatures annotated across cell line and PDX datasets was extracted across 2,709 primary human cancers, as part of the Platinum version of ICGC PCAWG release (Alexandrov et al., 2018). The 96-channel mutational catalogs of the corre-sponding primary cancers were provided by PCAWG ICGC Mutational Signatures group (Table S3).
The computational framework for identification of mutational signatures across the Platinum PCAWG dataset incorporated two independent and distinct steps, termed SigProfiler (v. 2.1) and SigProfilerSingleSample (v. 1.2) (Alexandrov et al., 2018), based on previously developed methodologies (Alexandrov et al., 2015; Alexandrov et al., 2013b; Nik-Zainal et al., 2016). The code for both tools is freely available and can be downloaded from: https://www.mathworks.com/matlabcentral/fileexchange/38724-sigprofiler. The first step (SigProfiler) encompasses a hierarchical de novo extraction of mutational signatures based on somatic mutations and their immediate sequence context, while the second step (SigProfilerSingleSample) estimates the numbers of somatic mutations in an individual sample associated with a given set of mutational signatures. Numerical and graphical patterns of the 48 Platinum set of PCAWG signatures, including 9 signatures associated with technology-associated artifacts (termed ‘R1-9’ signatures), are pro-vided in Figure S1 and/or Table S1. Table S3 provides the estimated numbers of somatic mutations associated with these mutational signatures in 2,709 cancer samples.
Framework for analysis of mutational signatures on cell line and PDX datasets
Mutational signatures were annotated on cell line and PDX datasets using SigProfiler (v.2.1) and the SigProfilerSingleSample (v.1.2), modified as described below.
SigProfiler hierarchical de novo extraction of mutational signatures
SigProfiler was first used for de novo discovery of mutational signatures across five separate datasets, including 96-channel muta-tional catalogs (Table S3) from (1) exome sequences from 1,001 human cancer cell lines, (2) exome sequences from 577 PDX models and 25 of the available originating tumors, (3) exome sequences from 63 cell line clones, (4) whole-genome sequences from 136 cell line clones and (5) whole-genome sequences from 36 single cells.
For a given set of the mutational catalogs, the previously developed algorithm (Alexandrov et al., 2013b) was applied in a hierar-chical manner to an input matrix M ˛ RK+3G of non-negative natural numbers with dimension K 3 N, where K reflects the number of mutation types and G corresponds to the number of samples. The algorithm first deciphers the minimal set of mutational signatures that optimally explains the proportion of each mutation type and then estimates the contribution of each signature across the sam-ples. More specifically, the algorithm makes use of a well-known blind source separation technique, termed nonnegative matrix factorization (NMF). NMF identifies the matrix of mutational signatures, P ˛ RK+3N, and the matrix of the activities of these signatures,
E ˛ RN+3G. Identification of the unknown number of signatures, N, is based on the robustness of the overall solution; the method-ology has been previously described (Alexandrov et al., 2013b). The identification of M and P is done by minimizing the generalized