Integrative analysis of a cancer somatic mutome
© Hernández et al. 2007
Received: 04 December 2006
Accepted: 05 February 2007
Published: 05 February 2007
The consecutive acquisition of genetic alterations characterizes neoplastic processes. As a consequence of these alterations, molecular interactions are reprogrammed in the context of highly connected and regulated cellular networks. The recent identification of the collection of somatically mutated genes in breast tumors (breast cancer somatic "mutome") allows the comprehensive study of its function and organization in complex networks.
We analyzed functional genomic data (loss of heterozygosity, copy number variation and gene expression in breast tumors) and protein binary interactions from public repositories to identify potential novel components of neoplastic processes, the functional relationships between them, and to examine their coordinated function in breast cancer pathogenesis. This analysis identified candidate tumor suppressors and oncogenes, and new genes whose expression level predicts survival rate in breast cancer patients. Mutome network modeling using different types of pathological and healthy functional relationships unveils functional modules significantly enriched in genes or proteins (genes/proteins) with related biological process Gene Ontology terms and containing known breast cancer-related genes/proteins.
This study presents a comprehensive analysis of the breast somatic mutome, highlighting those genes with a higher probability of playing a determinant role in tumorigenesis and better defining molecular interactions related to the neoplastic process.
Recent landmark work has described the genetic landscape of the breast and colorectal cancer genomes by identifying the collection of somatically mutated genes (cancer somatic mutome) that contributes to the neoplastic process in these cancer types . Most of these genes were not previously identified as linked to human cancer and some of them encode uncharacterized proteins. A larger set of "passenger" mutations or mutations present at a frequency that is too low to determine their relationship with cancer were also identified, prompting further genetic and molecular characterization.
Most biological processes involve groups of genes and proteins that behave in a coordinated way to perform a cellular function . The coordinated task of genes/proteins can be represented by different types of functional relationships (e.g. gene co-expression, genetic interactions, protein binary interactions, protein complex membership) . Network modeling has been used to predict new gene/protein functions and to define pathway components or modulators of particular processes [reviewed in [4–6]]. The application of similar approaches has also identified new genes responsible for human diseases [7, 8].
Defining biological processes at the systems-level will help to understand cancer cellular networks. The application of an integrative "omic" approach to the breast cancer somatic mutome is encouraged by the identification of uncharacterized genes/proteins and because the complete wiring diagram of functional associations has yet to be determined. The aim of this study is therefore to comprehensively describe the status of candidate breast cancer tumor suppressors and oncogenes at different molecular levels (from gene to protein), to predict new functional relationships between them and to provide new hypotheses regarding their coordinated molecular function in the neoplastic process. This study is focused on the somatic mutome described by Sjoblom et al. , which contains validated (contributing to the neoplastic process) and non-validated (i.e. harboring putative "passenger" mutations or mutations present at a frequency that is too low to determine their relationship with the neoplastic process) gene sets (total 672), combined with previously known somatically mutated breast cancer genes compiled in the COSMIC database .
Loss of heterozigosity analysis
To investigate the role of somatically mutated breast cancer genes as classical tumor suppressors or oncogenes, we first examined genomic loss of heterozigosity (LOH) using a whole-genome SNP genotyping data set . This data set has a resolution of one SNP every ~210 genomic kilo-bases and contains information from 42 breast tumors (20 non basal-like, 18 basal-like and 4 BRCA1 tumors) and matched healthy breast tissue samples.
When all breast tumors were considered, mutated genes in the validated set showed LOH ranging from 4% to a maximum of 76% (TP53)(Additional file 1). As was expected, other genes showing relatively high percentages of LOH in breast tumors were BRCA1 (52%) and MRE11A (50%). Remarkably, of the validated genes only CDH5 was previously described in detail as showing LOH , which might be explained by the unbiased approach used to identify the breast cancer somatic mutome, or by the inexistence of LOH as a second-hit genetic mechanism common to this set of genes. The detection of ~33% of LOH at the TMPRSS6 locus supports its role as a tumor suppressor suggested by a previous observation that TMPRSS6 nucleotide variants conferred a risk of breast cancer . However, LOH should be interpreted with caution as it shows a high correlation with chromosome location (e.g. complete LOH of chromosome 17). LOH results do not significantly vary between basal-like and non basal-like tumor subtypes except for the isodisomy of chromosomes 14, 17 and X .
Copy number analysis
Using the same data set described above, genes in the validated set showed copy numbers (CNs) ranging from 1.60 to 3.37 across basal-like and non basal-like tumors (Additional file 2). As expected for tumors with relatively higher levels of genomic instability, broader margins of CN variation were observed in BRCA1 tumors, ranging from 0.57 to 3.82. Examination of gene expression and critical regions with CN > 2 identified nine candidate oncogenes (Fig. 1C, CN > 2 column and up-regulated genes in tumors). Notably, one of these genes, GAB1, was previously suggested to act as an oncogene in cellular transformation . CN analysis also identified critical regions of genomic loss that were not evident in the LOH analysis, such as the SORL1 and TECTA loci that showed loss and expression down-regulation particularly in basal-like tumors (Fig. 1B and 1C). Thus, eight additional genes showed CN < 2 in a critical region and concordant down-regulation in tumors, which suggests their role as tumor suppressors (Fig. 1C, CN < 2 column and down-regulated genes in tumors).
In addition to the particular genes mentioned above, the correlation of LOH, CN and expression data identified four concordant gene clusters (i.e. close located loci). First, the amplification and over-expression of ABCB10 and NUP133 genes at chromosome 1 in basal-like and luminal A and B tumors. Remarkably, the amplification of ATP-binding cassette (ABC) transporter genes is commonly found in cancer cell lines as a probable mechanism of drug resistance  and nuclear pore (NUP) subunits have been found over-expressed in breast tumors . Second, the loss and down-regulation of COL7A1, DNASE1L3, FLNB and RNU3IP2 at chromosome 3, particularly in basal-like and luminal B tumors. Third, the loss and down-regulation of AEGP, GSN, NUP214 and SPTAN1 at chromosome 9, particularly in luminal A and B tumors. Finally, the loss and down-regulation of SORL1 and TECTA at chromosome 11, particularly in basal-like tumors. These genomic mutome clusters suggest that, in addition to point mutations, large-scale alterations of these regions might constitute a mechanism contributing to the neoplastic process.
An examination of binary protein interactions also highlights the possible need for more detailed mutational analyses of specific cellular components. Thus, an association between the breast and colorectal mutomes identified by Sjoblom et al.  is revealed by examining interactions between proteins of the extracellular matrix and cytoskeleton functional module (Fig. 4B). In this module, four out of nine proteins included were found to be mutated in breast tumors and three were found to be mutated in colorectal tumors by Sjoblom et al. .
Next, we investigated the existence of coordinated molecular tasks by examining the level of connectivity between mutome gene products in the interactome network. We compared the size (number of nodes and edges) of the largest component generated by direct interactions between mutome validated proteins and compared it to equivalent randomly selected sets of 100 proteins. The results showed that mutome gene products are highly connected, more so than expected by chance (interactions/node, empirical P value < 0.05), thus supporting the theory that they are involved in related molecular pathways or functions. However, this observation is partially dependent on the presence of p53 and BRCA1, which exhibit extremely high connectivity. Without taking into account p53 and BRCA1, the level of connectivity of the validated mutome is still moderately high with respect to equivalent, randomly selected protein sets (empirical P value < 0.15). These results suggest greater centrality of the breast somatic mutome proteins and are consistent with earlier observations involving previously known human cancer proteins .
When non-validated gene products are included in the interactome analysis, a large component with 127 edges and 94 nodes is revealed (Fig. 5B). Eight non-validated gene products occupy critical positions in this component, connecting validated and/or benchmark proteins: BCAR1 (breast cancer anti-estrogen resistance 1) links ADAM12 and GSN, therefore mediating extracellular matrix and cytoskeleton remodeling; and three gene products show a high degree of connectivity (between 5–10 interactions; PIK3R1, PLCG1 and POU2F1), which suggests a central role in the transmission of molecular information within this component. PIK3R1 and PLCG1 are involved in intracellular signaling cascades and their differential regulation is known to be involved in tumorigenesis [28, 29], while POU2F1 interacts with several known breast cancer-associated proteins (i.e. BRCA1, BARD1 and PARP1) [30, 31]. Together, these observations suggest a coordinated function between validated and non-validated gene products in the breast cancer neoplastic process.
Clustering analysis has previously proved useful for the identification of functionally related genes or proteins . To further examine the higher-level organization of the breast cancer mutome, we identified densely interconnected regions of the interactome harboring a higher proportion of mutome gene products than expected by chance. One such cluster shows enrichment in functional annotations of the TGF-beta and insulin signaling pathways as well as DNA transcriptional activity (Fig. 5C, cluster A). Another cluster shows enrichment for centrosome-related tasks and DNA transcriptional activity (Fig. 5C, cluster B). Cluster enrichment therefore points to known critical functional modules involved in breast tumorigenesis.
Mutome network modeling
Cluster analysis of this network identifies underlying molecular mechanisms of breast cancer. Analysis of densely connected sub-graphs and their GO terms identified functional modules enriched for apoptosis, cell division, cell differentiation, G-protein coupled receptor protein signaling pathway, intracellular signaling cascade, regulation of transcription, regulation of translation and signaling transduction (Fig. 6). Some benchmark genes/proteins can be located in these modules, supporting their role in the neoplastic process. These observations support the theory that the network modeled here represents a framework for a more in-depth experimental study of genes/proteins related to breast cancer somatic alterations.
Although issues of specificity and sensitivity in the detection of the mutome will probably be addressed in the future, particularly regarding germline genomic CN variation  and the likelihood of detecting sequence changes as presented by Sjoblom et al. , by examining functional genomic (LOH, CN and gene expression) data in breast tumors, this study supports newly identified tumor suppressors and oncogenes. Through the examination of protein binary interactions, this study further provides new hypotheses regarding the functional associations of these gene products. Finally, the integration of pathological and healthy functional relationships generated a mutome network model that provides a framework for studying the coordinated molecular function of mutome genes/proteins.
The apparent discrepancy between cancer genomic and expression changes for some genes, such as genomic CN > 2 and expression down-regulation, is not exceptional and has been observed previously . Autoregulation of gene expression, dosage compensation, epistatic modifications, or merely issues such as the sensitivity and specificity of LOH/CN and expression analyses can explain these apparent discrepancies. As is to be expected, the proportion of down-regulated genes is higher in CN < 2 than in CN > 2 regions, while the proportion of up-regulated genes is higher in CN > 2 than in CN < 2 regions (Fig. 1C). Nonetheless, experimental investigation of these genes/proteins is required to demonstrate their role as tumor suppressors or oncogenes.
The integrative study also serves as an indication of new prognosis markers. For the mutome genes, the integrative analysis of genomic copy number and expression data strongly indicates that DBN1 is a candidate oncogene that, when highly expressed in tumors with respect to healthy tissues, predicts poor survival in breast cancer patients (Fig. 3). Low expression ratios of ABCA3 and low or medium expression ratios of SPTAN1 may also predict poor survival. ABCA3 was previously identified as an ER-regulated gene , which supports its involvement in breast tumorigenesis, and SPTAN1 was involved in chemotherapy resistance in ovarian cancer , which makes this gene a potential target for cancer treatment. Finally, the interactome analysis of molecular pathways provides new hypotheses for the identification of genes potentially associated with survival outcome. SPTAN1 interacts with GRIND2 and SLC9A2, both of which interact with the product of the ABL1 proto-oncogene. Activated ABL1 kinase promotes invasion of breast cancer cells . Since low expression ratios of SPTAN1 predict poor survival, SPTAN1 could therefore act as a negative regulator of ABL1 activity.
The integration of omic data highlights likely functional candidates of a particular biological process with increased confidence [7, 38]. The strategy used here is applicable to other cancer types and would help to identify new tumor suppressor genes and oncogenes and the wiring diagram of functional interactions between them. The analysis of the breast cancer somatic mutome indicates that at least a few of the genes identified by Sjoblom et al.  play a key role in the breast cancer neoplastic process. These results will help to focus subsequent experimental characterizations on key gene/protein candidates.
We have presented the first comprehensive omic analysis of a cancer somatic mutome. Our analysis supports the theory that a few of these genes play a key role in the breast cancer neoplastic process. This study also provides new hypotheses for the coordinated function of these genes/proteins as tumor suppressors or oncogenes. Network modeling identifies hundreds of new potential pathological associations between the cancer genes/proteins studied. Extensive future research will be carried out by different groups focusing on each of the candidate genes highlighted by Sjoblom et al. . Our study provides a possible framework for the appropriate initial categorization of these genes.
Genomic data analysis
To analyze LOH and CN alterations in breast tumors, we used the Gene Expression Omnibus (GEO) record GSE3743 . Data were normalized and modelled using dChip software . LOH and CN were obtained after mapping genes in build 35.1 of the NCBI human genome sequence. For each gene and sample we took the closest SNPs to infer LOH and CN. If there was a mismatch in LOH calling for surrounding SNPs, the gene was left as missing for that particular sample. LOH profile correlation and confidence intervals (CI) were computed using Cohen's kappa coefficient of agreement, suitable for categorical data. We then classified genes as showing similar profiling if the lower limit of the CI was greater than 0.6. PCC was used to assess CN profile correlations, setting 0.6 as the lower cut-off. To determine the level of correlation between gene expression and genomic CN variation, we used PCC and FDR adjusted P values. All these analyses were performed using the R statistical software package .
Gene expression data analysis
Breast cancer gene expression was analyzed using two large data sets [10, 13]. Data from Richardson et al.  was down-loaded from the GEO record GSE3744 and analyzed using the limma and affy packages in R. Background correction, normalization and averaging of expression values were computed using the RMA algorithm . Differentially expressed genes were detected after computing an empirical Bayes moderated t-statistic and P values adjusted by a FDR of 5%. Data obtained from Hu et al.  was previously normalized and analyzed using the t-test. To evaluate co-expression, we used the data set of van 't Veer et al. , calculated PCCs and significance levels based on the t-distribution. A hierarchical algorithm was used to cluster genes, taking as distance the absolute value of 1-PCC. To evaluate prognosis, we used the Hu et al. data set  and fitted a Cox regression model to each gene using the overall survival information. An adjusted model taking into account tumor grade and ER status was also fitted for each gene. Likelihood ratio tests were used to evaluate the effect of gene expression on survival. For genes that appeared significant in both models, expression was categorized into tertiles using Kaplan-Meier curves. For these genes, the (non-parametric) log-rank test was calculated. The replica data set used for survival analysis was that of Chang et al. .
Human interactome network and clustering analyses
The human interactome network was built by combining three previously published data sets, which mainly represent experimentally-verified interactions [22–24]. The Gandhi et al.  data set contains compiled and filtered protein binary interactions from all currently available databases (HPRD, BIND, DIP, MINT, INTACT and MIPS). High-confidence yeast two-hybrid interactions from Rual et al.  and Stelzl et al.  were then included. After removing common interactions between the three data sets, the resulting network contained 8,174 nodes and 27,810 edges. The Molecular Complex Detection (MCODE) algorithm  was used to detect densely connected regions in the interactome network. To calculate the enrichment of mutome proteins in network clusters, a binomial distribution was used. Enrichment in GO terms was investigated using OntoExpress tools  and GENECODIS . To determine the level of connectivity between validated mutome gene products, we compared the number of nodes and interactions in the largest component generated by direct interactions between these proteins (73 of 122 were mapped in the interactome) to the number of nodes and interactions generated by 100 iterations of 73 randomly chosen proteins in the interactome.
MAP would like to offer his personal thanks to Marc Vidal, for introducing him to and developing his knowledge of the world of complex systems. This work was supported by the Fundació la Caixa (grant BM05-254-00 awarded to MAP), the Catalan Institute of Oncology (PH), the Instituto de Salud Carlos III (RCESP-C03/09 and RTICCC-C03/10) and SAF2003/5821. MAP is a Ramón y Cajal Researcher with the Spanish Ministry of Education and Science.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.