Participant characteristics and disposition
We included 1,214 participants with previously untreated diseases: 381 primary liver cancer (PLC), 298 colorectal cancer (CRC), 292 lung adenocarcinoma (LUAD), and 243 healthy volunteers without cancer (Fig. 1A). This study was approved by the Ethnic Committees and in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments. Written informed consents were provided by all participants. Details about enrollment information are in Supplementary Materials and Methods. The participants were subject to WGS and fragmentomic feature extraction and randomly split into the training and test datasets in a 1:1 ratio. We took the whole training dataset to build the first-level cancer detection model and then the cancer samples in the training dataset to train the second-level cancer origin model. The workflow of model construction is described in Fig. 1B and Supplementary Materials and Methods. Briefly, we extracted five distinct features covering cfDNA fragmentation size, motif sequence, and copy number variation from the WGS data, namely Fragment Size Coverage (FSC), Fragment Size Distribution (FSD), EnD Motif (EDM), BreakPoint Motif (BPM), and Copy Number Variation (CNV). The fragmentomic features implemented five machine learning algorithms, including Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest, Deep Learning, and XGBoost, and integrated to establish the ensemble stacked model. It is worth noting that the model was built solely in the training dataset, while the test dataset remained untouched until the model was finalized. We evaluated the cancer detection model in the test dataset and then took the true-positive cases to validate the cancer origin model. Healthy and cancer participants' demographics and characteristics (Table S1) are comparable between the training and test datasets. More importantly, the cancer samples are highlighted by the majority of early-stage diseases [PLC: stage IA/IB 117/191 (61.3%) in the training cohort and 126/190 (66.3%) in test cohort; CRC: stage 0/I 149/149 (100.0%); LUAD stage IA/IB 146/146 (100.0%)].
Differentiating cancer and non-cancer subjects by the cancer detection model
We reached a superior AUC value of 0.983 (95% CI: 0.975-0.992) for detecting all cancer subjects in the test dataset (Fig. 2A). The PLC group has the highest AUC (0.999, 95% CI: 0.975-0.992), followed by the CRC (0.974, 95% CI: 0.955-0.993) and LUAD (0.973, 95% CI: 0.957-0.989) groups. Healthy subjects have lower cancer scores than cancer subjects, and the three cancer types showed similar score distribution (Fig. 2B). The cancer score of 0.39 rendered a 95.0% specificity (95% CI: 89.5-98.2%). The corresponding sensitivities are 95.5% (95% CI: 93.2-97.1%) for all cancer subjects (Fig. 2C), and 100.0% (95% CI: 98.1-100.0%), 94.6% (95% CI: 89.7-97.7%), and 90.4% (95% CI: 84.4-94.7%) for PLC, CRC and LUAD, respectively (Table S2). We observed an upward trend from the early to later stages for the distribution of cancer scores in all-cancer, PLC, and CRC classes (Fig. S1). A propensity score matching analysis balanced the age and sex factors between cancer and non-cancer groups in the test dataset. The resultant subset consisting of 113 PLC, 73 CRC, 85 LUAD, and 85 age and gender-matched healthy controls remained high performance in distinguishing cancer patients from non-cancer controls (AUC: 0.988, 95% CI: 0.980-0.996, Fig. S2A). We also performed 10-fold cross-validation during training to evaluate model overfitting. The 10-fold cross-validation AUCs for all-cancer and individual cancer types were equally high compared to the independent test dataset (Fig. S2B), reassuring that overfitting was not a major concern.
Our model exhibited ultrasensitivity in detecting cancers at various stages (Fig. 2D). The sensitivity is above 90% for stages 0 and I, and elevated to nearly 100% for stages II and III. Furthermore, we used patient demographics and clinical characteristics to categorize disease subgroups for evaluation (Table S3 and Figs. S3-S5). The model's detection sensitivity was consistently high even in the challenging categories, such as MIA and <1 cm tumors of LUAD. We assessed the model's robustness by gradually down-sampling the coverage to 1× (Fig. 2E Table S4). Despite a slight dip, the model remained stable with over 91.5% sensitivity for all-cancer. Even for the least detectable class of LUAD, the sensitivity at 1× is still above 87%.
Furthermore, the cancer detection model was assessed in a preliminary at-risk patient cohort and showed an overall specificity of 92.4% (Table S5, details in Supplementary Results).
Locating cancer at its origin by the cancer origin model
All test dataset patients correctly identified as "Cancer" by the cancer detection model were subsequently analyzed in the cancer origin model. The model correctly identified the cancer origin for 431 patients (accuracy 0.931, 95% CI: 0.900-0.950) for the three cancer types (Fig. 2F and Table S6). The sensitivities for individual cancer types were 97.4% (95% CI: 94.0-99.1%), 94.3% (95% CI: 89.1-97.5%), and 85.6% (95% CI: 78.4-91.1%) for PLC, CRC, and LUAD, respectively. We plotted the cancer origin scores of each type for all patients (Fig. 2G). Generally, the top scores matched the true cancer types. Such consistency is the most compelling for the PLC patients, followed by the CRC patients, while the LUAD group has more erroneous CRC predictions (Fig. 2F and G). We further inspected the origin scores of the misinterpreted patients (Fig. S6). The score differences between the true origin and the misinterpreted class were minimal (≤ 0.05) for potential improvement. The cancer origin model is robust with lower coverage WGS data (Fig. 2H and Table S7). The accuracies for PLC, CRC, and LUAD at 1× coverage are 97.7%, 92.4%, and 90.6%, respectively, whereas the predictions of each patient at different sequencing coverages were listed in Fig. 2I, H.
Our study has several limitations. First, we performed the proof-of-concept study using liver cancer, colorectal cancer, and lung cancer for their high prevalence. Targeting a broader population and more cancer types, including the less prevalent ones, would be necessary to develop the assay and eliminate cancer treatment inequity. Second, we are expanding our current cohort size to enable independent validation and improve the estimation accuracy of relatively small-size subgroups (e.g., cHCC-ICC, MIA, stage IB LUAD).