Non-invasive lung cancer diagnosis and prognosis based on multi-analyte liquid biopsy

Chest LDCT provides an effective approach for lung cancer screening, yet has been found to generate a large number of false positives during practice due to excessive diagnosis of pulmonary lesions of indeterminate clinical significance. In this study, we performed comprehensive genetic and epigenetic profiling of cfDNA from lung cancer patients and individuals bearing benign lung lesions, using ultra-deep targeted sequencing and targeted bisulfite sequencing. We found that cfDNA mutation profile alone has relatively limited power in distinguishing malignant from benign plasma, while cfDNA methylation profiling showed a better performance for classification of the two groups and combination of genetic and epigenetic features of cfDNA along with serum protein marker further improved the classification accuracy. We also identified novel methylation-based prognostic markers and showed that an integrated model that combined cfDNA mutational status and methylation-based prognostic markers improved prediction for lung cancer survival. Our results highlight the potential of the multi-analyte assay for non-invasive lung cancer diagnosis and prognosis.

Lung cancer (LC) is the leading cause of death in many countries including China. The stage at which LC is diagnosed has a significant impact on prognosis. However, timely detection of LC remains difficult since patients are often asymptomatic at early stages. Low-dose computed tomography (LDCT) is the most extensively recommended LC screening method currently, but it poses radiation risks and only a small fraction of the nodules detected are true lung cancers. In clinical practice, it remains a challenge to differentiate malignant tumors from benign solitary pulmonary nodules, which may greatly benefit from non-invasive diagnostic tools. TNM stage currently remains the most widely used prognostic tool in lung cancer. However, the variability of survival within staging groups suggests that search for additional prognostic parameters is necessary. Molecular alterations such as cancer driver gene mutational status and expression signatures have been implicated in LC prognosis; meanwhile, there have been emerging evidence that support the prognostic value of epigenetic alterations, which remains to be fully elucidated.
Circulating tumor DNA (ctDNA) in plasma of cancer patients provides valuable information for cancer genome and also holds great promise for non-invasive cancer detection [1,2]. However, since ctDNA is diluted by abundant circulating cell-free DNA (cfDNA) of noncancerous origins, its detection poses significant challenges especially during early stages of cancer when the tumor mass is small. In this study, we developed a set of experimental and computational tools to measure both genetic and epigenetic signals from plasma cfDNA of LC patients as well as patients bearing benign lung nodules (BLN) using high-throughput sequencing [3], aiming to explore the potential utility of blood-based biomarkers for LC diagnosis and prognosis.

Results and discussions
Targeted ultra-deep sequencing detected distinct mutational spectra of plasma cfDNA and WBC gDNA A cohort of 128 LC patients represented a natural tumor stage distribution (66% of the cases were stage 0 or stage I) and 94 BLN patients were enrolled in this study ( Fig. 1a and Table 1). To detect genomic sequence alterations, we performed targeted ultra-deep nextgeneration sequencing (NGS) on plasma cfDNA extracted from 111 LC patients and 78 BLN patients using a panel covering exons of 139 cancer driver genes selected based on TCGA and COSMIC databases (Supplementary Table 1 and 2, and Supplementary Fig. 1). Adaptors that contained 6 bp duplex unique molecule identifiers (UMI) were used in the library preparation procedure to enable subsequent removal of PCR duplicates and error-correction based on consensus generation. A set of stringent thresholds were then applied to identify the most reliable somatic variants (See Methods for details). In total, 193 and 46 mutations were detected in 75 (68%) LC patients and 33 (42%) BLN plasma cfDNA, respectively (Supplementary Fig. 2 and 3).
Since some variants might derive from clonal hematopoiesis (CH) and confound the mutational analysis [4], genomic DNA (gDNA) of white blood cell (WBC) from cfDNA mutation-positive participants was also sequenced. Non-synonymous variants were detected in WBC of 73 (97%) LC patients and 33 (100%) BLN patients respectively (Supplementary Fig. 4 and 5). Among WBCshared cfDNA variants, the most frequently mutated genes included TP53, CBL, APOB, and CSMD3 for LC plasma, and CBL, CSMD3, and STAT3 for BLN plasma ( Supplementary Fig. 6). Moreover, allele frequencies (AFs) of variants shared by cfDNA and matched WBC samples were highly correlated (Fig. 1b), suggesting that these mutations indeed originated from WBC and should be removed for downstream analysis [5]. The percentages of cfDNA variants matching corresponding WBC sample were 20.7% (40 out of 193) for LC cfDNA and 39.1% (18 out of 46) for BLN cfDNA, suggesting that a significant portion of cfDNA variants was derived from CH, especially in BLN plasma (p = 8.89E-03, chi-squared test). Notably, a number of these mutations were hotspot mutations of cancer driver genes ( Supplementary Fig. 7), suggesting that CH variants may significantly confound cfDNA analysis if not analyzed in parallel.
Smoking is an important risk factor for lung cancer. We observed that within LC patients, smokers appeared to carry a higher mutation burden in plasma cfDNA than never-smokers ( Supplementary Fig. 11). 28 mutations  Table 4), although the fraction of positive samples was much less, compared to LC plasma (29.49% vs. 60.36%, p = 2.87E-05, chi-squared test). These mutations had AFs ranging from 0.05 to 1.91% (Fig. 1d). The most frequently mutated genes in BLN plasma were KRAS (5%), CSMD3 (4%), APC (3%), ATM (3%), KMT2D (3%), and TP53 (3%) (Fig. 1e), representing a distinct mutational spectrum from LC cfDNA. Notably, 39.3% (11 out of 28) of these were COSMIC hotspot mutations (Supplementary Table 4). These results revealed that, in contrast to common belief, plasma cfDNA from BLN patients also carried genomic sequence alterations including mutations in cancer driver genes, albeit less frequently. These alterations could have arisen from somatic clonal expansions in normal tissues [6]. Also, it was noted that some of the benign lesions included in our study were regarded as premalignant lung lesions, such as atypical adenomatous hyperplasia (AAH, 11% of BLN cases in our study). Previous study showed that AAH indeed harbored cancer driver mutations, such as those in gene KRAS, BRAF, APC, KMT2D, and TP53, and these mutations could be readily detected in matched plasma cfDNA [7]. Taken together, these results highlight the potential challenges for differentiating malignant versus benign plasma based on the cfDNA mutation spectrum.
Classification models based on somatic mutations to distinguish LC from BLN Next, we investigated whether LC plasma cfDNA had a stronger mutational burden than that of BLN patients. To quantify the cfDNA mutational burden, we constructed a mutation score for each cfDNA sample as either a simple summation of the allele fractions of all variants identified (SUMAF), or a weighted sum of the  Fig. 1f and Supplementary Fig. 12) and the SUMAF model had a similar performance. These results showed that the classification models built on mutation score alone had limited classification capability for differentiating LC and BLN plasma, contradicting some earlier studies which suggested that mutational status could be used to diagnose LC from BLN with high specificity and modest sensitivity [8], a conclusion that may have suffered from potential bias caused by limited sample sizes used. Our work obtained from a larger sample size (128 LC and 94 BLN plasma) suggested that genomic sequence alterations in cancer driver genes carried by BLN cfDNA might be more prevalent than previously thought, therefore limiting the utility of mutation-based diagnostic assays. A multianalyte approach is more likely to improve the detection of cancer signal.

Classification of LC and BLN plasma based on cfDNA methylation data
To identify LC-specific epigenetic changes, we performed whole-genome bisulfite sequencing (WGBS) on 25 pairs of LC tissue and normal tissue adjacent to the tumor (NAT) (Fig. 2a). Three hundred fifteen differentially methylated regions (DMRs) were identified, including 293 hype-methylated DMRs and 22 hypo-methylated DMRs ( Fig. 2b; see Methods for details). There were a lot more hyper DMRs than hypo DMRs, consistent with the belief that genomic regulatory regions such as promoters of potential tumor suppressor genes undergo remarkable hypermethylation in tumorigenesis [9]. Gene ontology (GO) annotations revealed that the 293 hyper DMRs were significantly enriched for genes encoding DNA-binding domains and homeobox domains, as well as genes involved in the developmental and transcriptional regulation process (Fig. 2c). These genes are likely to be potential tumor suppressor genes, and many of which haven't been implicated as such previously (such as SEC31B, ZNF274, and NXPH1). Unsupervised hierarchical clustering using the regional methylation ratio of the identified DMRs perfectly separated LC tissues and NAT with the exception of a single LC sample, highlighting the pronounced epigenetic dysregulation of LC cells (Fig. 2b). We next performed a comprehensive analysis of 5-mC methylation profile of cfDNA for 111 LC patients and 87 BLN patients using targeted bisulfite sequencing, covering 5.6 million CpG sites (Supplementary Table 1). Compared to BLN plasma, increased methylation levels were observed in LC plasma cfDNA for hypermethylated DMRs identified from tissue sequencing, as expected ( Supplementary Fig. 13). The DMR methylation levels of cfDNA appeared to be lower in smokers than neversmokers, in both the LC and BLN group ( Supplementary  Fig. 14). To test the diagonostic value of methylation markers, we first built a random forest model with 6fold cross-validation (CV) that classified LC from BLN plasma based on hyper DMRs, which achieved an AUC of 0.71 (Supplementary Fig. 15). A feature selection process was then carried out to minimize the number of DMR markers while maintaining the performance. By selecting features with the highest feature importance in each fold of the CV, a model consisting of 47 DMRs was obtained, achieving a similar performance with an AUC of 0.71 while reducing the feature size by 84.0%, and still outperforming models based on mutation status alone ( Supplementary Fig. 16). These results indicated that LC-specific methylation changes carried by plasma cfDNA could be effective biomarkers for diagnosing lung cancers versus benign lesions. Here, the difference in model performance comparing to previous study on methylation markers could be attributed to the different study populations and cfDNA analysis methods. The smaller cohort size in the previous study may have also caused over-fitting and/or over-estimation of the model performance.
Multi-omics analysis to differentiate LC from BLN plasma Next, we attempted to integrate multi-omics features of the cfDNA to further improve the diagnostic power of our classification model among samples with complete measurements (Supplementary Table 1). Indeed, among 91 LC and 71 BLN cfDNA samples that underwent both genetic and epigenetic profiling, we found that models that combined 59 most informative DMRs and the wSU-MAF mutation score achieved a CV-AUC of 0.77 with a sensitivity of 76.1% and a specificity of 59.2%, compared to an AUC of 0.68 achieved by mutation score alone (DeLong p-value< 0.01) and an AUC of 0.74 achieved by methylation features alone (DeLong p-value = 0.85) (Supplementary Fig. 17).
We then tried to incorporate serum protein markers into the classification model. Among 5 serum markers measured, including CEA, CYFRA21-1, NSE, CA19-9, and CA125, only CEA level appeared to be significantly higher in LC patients than BLN patients (p = 0.04, Student's t-test), producing a modest classification AUC of 0.66 ( Supplementary Fig. 18 and 19). The multi-omics predictive models based on the combination of wSU-MAF mutation score, regional methylation ratio of 54 selected DMRs, and the serum CEA level achieved an AUC of 0.78, with 76.9% sensitivity and 58.3% specificity (Fig. 2d) on the set of samples with complete measurements of all three types of analytes (74 LC and 60 BLN). This result showed a further improvement in diagnostic accuracy compared to the models without CEA in the same set of samples (AUC = 0.74, DeLong p-value = 0.02) (Fig. 2e). To our knowledge, this is the first proofof-concept study to demonstrate that genetic, epigenetic, and proteomic analytes could be combined to improve the performance of liquid biopsy-based diagnostic assay for LC. Here, the mediocre performance of the final multi-omics model could be attributed to the fact that a large proportion of the LC cases (n = 85, 66%) included in the study were stage 0 or stage I and the majority of the cases (n = 97, 75%) were lung adenocarcinoma (LUAD) which were previously suggested to release less ctDNA into the bloodstream compared to lung squamous cell carcinoma (LUSC) (n = 23, 18% in this study). A recent study which used methylation-based ctDNA markers for non-invasive detection of multiple cancer types also found low sensitivies for early-stage lung cancers [9]; another recent study which integrated multiple genomic features to develop a ctDNA-based assay for LC detection also reported modest performance for stage I and II lung cancers (the Lung-CLiP model; AUC = 0.69-0.71) [5], corroborating our findings. Care needs to be taken when apply the findings presented in this study to cohort with different clinicopathological characteristics, and additional study with larger sample size would be necessary to validate current findings.

cfDNA mutation burden and methylation status as prognostic factors for LC
We first tested whether mutational status (wSUMAF, < 0 vs. > 0) was associated with LC overall survival (OS). We found that a higher mutation burden was associated with a significantly worse OS (Fig. 2f). This association was also significant among stage I patients ( Supplementary  Fig. 20). Next, we attempted to identify potential methylation-based prognostic biomarkers (Fig. 2g) and to incorporate these markers into the prediction model for prognostic stratification of the LC patients. Previously, multiple methylation-based prognostic classifiers had been reported for lung cancer, however, the reported markers were mostly inconsistent. The inconsistency could be explained by limited sample sizes, variations in study design, as well as different detection methods used. We first obtained corresponding coefficients for candidate features using penalized Cox regression among training set and incorporated these features into the model (Supplementary Fig. 21). The methylation-based prognostic score (MPS) was then calculated for each individual as a weighted sum of the methylation level of 12 selected DMRs (Supplementary Table 5) multipliedby their corresponding coefficients. The MPS was then combined with the mutation score as the bi-omics prognosis score. Patients with a high mutational burden and a high MPS were categorized as the high prognosis score group, which had a significantly worse OS than the low prognosis score group in the testing set (Fig. 2h). Finally, to avoid information loss due to categorization, we modeled both mutation score and MPS continuously in two multivariate Cox proportional hazard models on wSUMAF only, as well as in combination with MPS. We found the latter model achieved a higher AUC (Likelihood ratio test p-value = 0.27, Supplementary Fig. 22), although the difference between the two models was not statistically significant, which may be attributed to the limited number of samples included in the tesing set (n = 38). Taken together, these results suggest that integrated genomic features have the potential to be used as better prognostic biomarkers for LC [10].

Conclusion
In summary, we performed comprehensive genetic and epigenetic profiling of cfDNA from lung cancer patients and individuals bearing benign lung lesions. We found that the combination of genetic and epigenetic features of cfDNA along with serum protein marker CEA showed the best classification capability to differentiate the malignant vs. benign cases. Also, an integrated model that combined cfDNA mutational status and methylationbased prognostic markers has potential to improve prediction for lung cancer survival. As blood sample is relatively easily to collect for detection in distinct clinical scenarios than imagings, our results highlight the possibility of multi-analyte blood based assay for non-invasive lung cancer diagnosis and prognosis.