Analysis of expression along chromosomes
In each graph of Figures 2, 3, 4, 5, we plotted the numbers of patient samples with tumor up/down regulation (percentage on informative cases) for all genes according to their position on the chromosome. In these plots, the smoothing of the curve is achieved by averaging over 50 consecutive genes.
Significant deviations from average expression in a particular chromosomal region is not sufficient to infer coordinated deregulation. This is because it does not allow to infer whether all genes of a region are actually de-regulated in the same subset of patients. They could also be de-regulated in different patients. Consider three genes G1, G2, G3 and their expression in patients A,B,C,D. Each gene is up-regulated in 50% of patients. If the genes are up-regulated in different patients (G1 is up-regulated in A/B, G2 is up-regulated in B/C, G3 is up-regulated in C/D), then one can not assume that there is a regional up-regulation in all patients. However, if the genes are up-regulated in the same patients (G1, G2 and G3 are all up-regulated in A and B), then it is fair to assume that they have undergone coordinated regional up-regulation. Chance effects more likely create non-coordinated up-regulation. To capture such a gene-versus-gene correlation structure, we performed the following for a given chromosome region:
For each pair of genes of a given chromosome region we count the number of their coordinated (simultaneous) up-regulations (based on the above computed fold changes) over the set of patients and the number of coordinated down-regulations, separately. These values can be represented in gray-scale plots: one gray scale plot for the coordinated up-regulation and a similar one for coordinated down-regulation. Both, horizontal and vertical axis comprise genes of the chromosome region in the right chromosomal order (see Figures 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32). The darkness of squares represents the number of coordinated up- or down-regulations, respectively. Coordinately up-regulated regions show up as squares with high "correlation" measures along the diagonal. Such resulting cross-comparison matrices can be visualized interactively for any chromosomal region on our supplementary website along with heat maps of expression intensities and are used in Figures 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32. Alternatively, we applied "correlation" measures like Pearson correlation coefficients on fold changes, mutual information, and set-theoretic coefficients like the Dice and Jaccard coefficients on binary patterns of up-regulation and down-regulation (only available on our website ).
Although this analysis is already instructive for the visual identification of general up/down-regulation of a particular region, it does not allow to infer the precise boundaries of deregulated regions. Several software packages for the analysis of array CGH data exist that have been announced to also be suited for the analysis of expression data [42–44]. In the following, we used the ChARM software package . ChARM can be used to infer intervals of variable size with significant positive or negative signal amplitudes in ordered data, such as log(intensity) values in array CGH data and mRNA expression data. We applied the ChARM algorithm on different data sets that harbor information about the numbers of patients with coordinated up- and down-regulation of expression for all genes on human autosomes and the X chromosome. For each chromosome six separate data sets were prepared, according to scanning window sizes of 5, 11, 21, 31, 41, 51. Within each window all possible gene pairs (excluding self comparisons) were considered. For each gene pair, the number of coordinated up-regulated (counted as +1) and down-regulated (counted as -1) was determined. For each window the sum of these gene pair-specific values divided by the total number of pairs gave the cumulative misregulation score (CMS). In a sliding window approach, each gene was associated with a CMS value. CMS values for genes at the edges of chromosomes were calculated with reduced window sizes. The main theoretical advantage of the use of CMS scores compared to raw up-regulation counts or averaged expression ratios is that it captures only information from co-regulated neighboring gene pairs: Noise signals fluctuate across genes and may more often lead to artificial assignment of high expression ratios between two genes. In contrast, real signals of regional up-/down-regulation lead to consistent changes in the same patients for two genes. For each window size, CMS data sets of each chromosome were subject to ChARM analysis. ChARM determines borders of regions with high signal amplitudes in ordered data, here regions of expression imbalances along a chromosome, by an expectation-maximization approach. In addition, ChARM provides different statistical estimates to judge the significance of expression deregulation in a particular chromosomal region . The identified deregulated regions were further evaluated manually using heat maps and the above mentioned gene-versus-gene "correlation" plots (see above, Figures 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 and accompanying website).