Targeted fragment end profiling in cfDNA
To design an assay capable of capturing cfDNA fragment end profile shifts in cancer, we examined the available ATAC-seq datasets of 410 tumor samples from The Cancer Genome Atlas (TCGA). These data characterize chromatin accessibility in 23 cancer types, including colon adenocarcinoma (COAD) and renal cell carcinoma (RCC) with a peak resolution of 500 bp [10]. Of these, we selected 48 COAD-specific and 48 RCC-specific peaks based on their normalized scores and specificity for the corresponding cancer type (Supplementary Fig. 1). To accurately analyze cfDNA fragment end profiles (cfDNA-FEP) in genomic regions of interest, we used a modified anchored multiplex PCR approach followed by NGS [12]. Briefly, ligation of the universal adapter to cfDNA is followed by primer extension from the target primer pool such that the resulting products contain universal adapter sequences at the 3′-ends. The ligated adapters contain unique molecular identifiers (UMIs) to effectively remove PCR duplicates during downstream analysis and converge read counts to the number of original cfDNA molecules. Subsequent amplification is performed with universal primers to reduce PCR biases. This scheme allows for targeted amplification while preserving information about the original end coordinates of the cfDNA fragments (Fig. 1A). The distributions of relative end positions reflect cfDNA fragmentation profiles specific to each target region. We hypothesized that the cfDNA end profile pattern in open-chromatin regions might differ between healthy individuals and cancer patients.
cfDNA-FEP on a clinical cohort
A cohort of 175 individuals with histologically confirmed CRC (n = 58) and RCC (n = 57), as well as age-matched healthy individuals (n = 60), was divided into two batches (n = 116 and n = 59) that were processed individually to account for potential batch effects (Fig. 1B). After library preparation and paired-end next-generation sequencing, we performed UMI-clustering to remove PCR duplicates and then aligned clustered reads to the reference genome. To generate end profiles, we retrieved only proper pairs where the second reads overlapped the target primer binding sites. The relative start positions of the corresponding first reads represent the end profile of cfDNA molecules for each region (Fig. 1C, top). We examined the densities of the distributions in each target region and found that, in at least some regions, the average density at the peaks differed between cancer and control groups (Fig. 1C, bottom). These peaks in density denote sites where cfDNA is predominantly cleaved and may vary due to nucleosome repositioning, change in chromatin state, or aberrant nuclease activity associated with pathology [13]. To build a classifier, we selected the most prominent peaks in the range of 0–99 bp (Peak1) and 100–300 bp (Peak2) within each target region (Supplementary Fig. 2). The normalized log2 ratio of the densities in Peak1 and Peak2 served as a single-value metric of the fragmentation profile for each target region, or the fragmentation score (FS). We further noted that dinucleotides at the ends of the cfDNA fragments were not evenly distributed, with CC being the most frequent motif (Fig. 1D). This is consistent with the previous reports on hepatocellular carcinoma and may be a consequence of aberrant nuclease activity in cancer [14]. Therefore, we used the frequencies of sequence motifs along with FS values as predictors for the cfDNA-FEP model.
Patient samples classification
We trained support vector machine classifiers on a dedicated training dataset to select the best-performing model based on the area under the ROC curve. The final classifier was able to distinguish cancer and healthy samples on the training dataset (10 times 10-fold cross-validation) with a mean AUC = 0.91 (sd = 0.09, n = 100) (Fig. 2A) and on the unseen test dataset with an AUC = 0.94 and an accuracy of 0.9 (Fig. 2C). The cfDNA-FEP classifier generates a cancer score for each cfDNA sample. This metric reflects the probability that the cfDNA sample is from a patient with cancer (Fig. 2B). For samples from the test dataset, we observed only a slight decrease in classifier performance for early-stage (I, II) cancer (AUC = 0.91, accuracy 0.87) compared with late-stage (III, IV) disease (AUC = 0.96, accuracy 0.89). The median cancer score for healthy and stage I cancer samples was 0.275 (sd = 0.294, n = 15), and 0.831 (sd = 0.162, n = 12), respectively, making a classification of even early-stage cancers feasible with the decision cutoff of 0.5 (Fig. 2B). Stage IV cancer samples (n = 12) were labeled with a median cancer score of 0.955, sd = 0.205. The cfDNA-FEP was able to detect both cancer types in the test set with similar performance: AUC for RCC and COAD test set samples was 0.94.