Genome-wide cell-free DNA methylation analyses improve accuracy of non-invasive diagnostic imaging for early-stage breast cancer

Early detection is crucial to improve breast cancer (BC) patients’ outcomes and survival. Mammogram and ultrasound adopting the Breast Imaging Reporting and Data System (BI-RADS) categorization are widely used for BC early detection, while suffering high false-positive rate leading to unnecessary biopsy, especially in BI-RADS category-4 patients. Plasma cell-free DNA (cfDNA) carrying on DNA methylation information has emerged as a non-invasive approach for cancer detection. Here we present a prospective multi-center study with whole-genome bisulfite sequencing data to address the clinical utility of cfDNA methylation markers from 203 female patients with breast lesions suspected for malignancy. The cfDNA is enriched with hypo-methylated genomic regions. A practical computational framework was devised to excavate optimal cfDNA-rich DNA methylation markers, which significantly improved the early diagnosis of BI-RADS category-4 patients (AUC from 0.78–0.79 to 0.93–0.94). As a proof-of-concept study, we performed the first blood-based whole-genome DNA methylation study for detecting early-stage breast cancer from benign tumors at single-base resolution, which suggests that combining the liquid biopsy with the traditional diagnostic imaging can improve the current clinical practice, by reducing the false-positive rate and avoiding unnecessary harms. Supplementary Information The online version contains supplementary material available at 10.1186/s12943-021-01330-w.


Main text
Breast cancer (BC) is the most common cancer in women worldwide [1]. Mammogram and ultrasound are routinely administered to detect early BC in asymptomatic females, but prone to underestimation or overdiagnosis [1,2]. The Breast Imaging Reporting and Data System (BI-RADS) categories, based on mammograms and ultrasonography, have been used to standardize the risk assessment for breast lesions [3]. However, the risk of malignancy for BI-RADS category 4 lesions varies from 3 to 94%, and this large statistical dispersion might lead to unnecessary biopsies according to current clinical guidelines [4].
Mutation-based circulating tumor DNA (ctDNA) analysis has been used for detecting early relapse, analyzing acquired resistance, and guiding adjuvant therapy in several cancers [5][6][7]. Moreover, combining liquid biopsy with diagnostic imaging has been recently demonstrated to have better performance [8]. However, the lack of multiple common mutations in BC has limited the sensitivity of mutation-based ctDNA detection [9]. On the other hand, DNA methylation-based markers have been effective for the early detection of many cancer types [10,11]. However, most methylation-based studies were conducted to detect individuals with cancer among a population of non-conditional healthy controls using the locus-specific or CpG-rich genomic regions technologies [10,11], methylated DNA immunoprecipitation sequencing (MeDIP-seq) [12], and reduced representation bisulfite sequencing (RRBS) [13]. Genome-wide DNA methylation characteristics of cfDNA between the malignant and benign tumors at single-base resolution remain largely unknown.

Study design
Herein, we recruited 210 consecutive female patients with BI-RADS category 4 breast lesions that were biopsied after mammography and ultrasonography examinations from the Cancer Hospital of the Chinese Academy of Medical Sciences and Peking Union Medical College (CHCAMS, n = 160, discovery cohort) and the Harbin Medical University Cancer Hospital (HMUCH, n = 50, validation cohort) from April 1, 2019, to August 31, 2019. This study was reviewed and approved by the ethics committee of each participating hospital. Each participant provided written informed consent. The diagnosis of each patient was based on the pathology results from resection specimens by the surgical biopsies or core needle biopsies. Twenty tumor samples (10 malignant and 10 benign) were collected for whole-genome bisulfite sequencing (WGBS) from patients that underwent biopsies at CHCAMS. A median of 3 mL of plasma was collected from all participants (n = 210) before surgery ( Fig. 1). A diagnostic model using the identified cfDNA methylation markers alone or in combination with imaging findings improved the accuracy of earlystage BC detection (Fig. 2a). The cfDNA methylome of each participant was also measured by the WGBS. Because of the low amount of ctDNA in the total cfDNA, we devised a computational framework to boost the detection of cfDNA methylation markers, which consisted of identification of differentially methylated regions (DMRs) based on the tumor samples, cfDNA enrichment analysis, fragment size selection, fragment-based statistical inference for cfDNA malignant ratios, and a prediction model of early-stage BC ( Fig. 2b; Supplementary methods).

Clinical characteristics of the participants
Six patients in the discovery cohort and one patient in the validation cohort were excluded due to low data quality. A total of 77 patients with BCs and 77 patients with benign lesions from CHCAMS constituted the discovery cohort. Forty-nine patients from HMUCH constituted the validation cohort, 24 of whom had biopsyconfirmed BCs (Fig. 1). The breast lesions were further interpreted as BI-RADS subcategories 4a, 4b, and 4c, depending on the probability of malignancy. Patients were followed up every 3 months for 6 months using mammograms and ultrasounds until a final diagnosis was made. All of the patients with confirmed BCs were in the early stages of the disease, and most of them were hormone receptor positive/luminal (Tables S1 and S2).

Identification of cfDNA methylation markers
The cfDNA concentrations were surveyed in each cohort, and none of them could directly discriminate between malignant and benign tumors (Fig. S1A). The quality and size distribution of plasma DNA samples was assessed by the Agilent 2100 Bioanalyzer (Agilent, USA). All distribution showed a distinct peak for the cfDNA (Fig. S1B). Furthermore, to remove the genomic DNA contamination, fragments lower than 500 bp in length were retained during bead-based library purification. After library preparation and sequencing, we inferred the size profile of cfDNA by analyzing the WGBS data (n = 101 malignant and 102 benign), which showed a typical pattern of the size distribution for cfDNA with a prominent mode at 167 bp (Fig. S1C). The cfDNA together with the 19 tissue samples yielded a total of 4.0 Tb of WGBS data, covering roughly 88% of the reference genome with 11.2× depth on average (Table S3).
We found that cfDNA fragments of WGBS are enriched in coding regions and intergenic regions compared to the loss in gene promoter regions (Fig. 3a). The amount of corresponding cfDNA to different genomic regions was negatively correlated with the CpG density and GC content ( Fig. 3b and Fig. S2). A total of 57,575 DMRs (51,962 hypo-DMRs vs. 5613 hyper-DMRs) was identified between 9 malignant and 10 benign tumor samples (Fig. S3). As expected, CpG density was significantly higher in hyper-DMRs than in hypo-DMRs (p < 0.0001; Fig. 3c). Accordingly, the average amount of cfDNA fragments in hypo-DMRs is significantly higher than ones in hyper-DMRs in all of malignant and benign samples (p < 0.0001; Fig. 3d). The depletion of the cfDNA fragments in CpG-rich hyper-DMRs could be due to the preferential digestion of open chromatin regions in cfDNA [14]. Those findings help us to optimize cfDNA methylation markers in hypomethylated regions. Aberrant hypomethylation, generally in the closed chromatins of cfDNA-rich gene bodies or intergenic regions, is a common feature of various malignant tumors [14]. To ensure quantification of high-quality cfDNA, the hypo-DMRs were selected as candidate DNA methylation markers.
The cfDNA malignant ratio was computed for the hypo-DMRs for each patient sample (Supplementary methods). The 10 optimal hypo-DMRs to distinguish between malignant and benign plasma samples were selected using data from the discovery cohort. They were mostly in the intergenic regions, in which the malignant ratios were significantly higher from patients with malignant tumors than ones from patients with benign lesions (p < 0.05 for all; Fig. 3e). However, there are four functional genes (RYR2, RYR3, GABRB3, and DCDC2C) and two lncRNAs (AC096570.1 and LINC00923) in the ten methylation markers (Table S4).
A prediction model for early-stage breast cancer using cfDNA methylation Using the cfDNA malignant ratios of the markers, a predictive score of cfDNA methylation (cfMeth score) in each plasma sample in the discovery cohort was  Fig. 2 The workflow of data analysis and models development. a The images of standard mammography and ultrasonography were interpreted and classified according to the fifth edition of the Breast Imaging Reporting and Database System (BI-RADS) standard by two experienced radiologists independently at each center. The cfDNA methylome of each participant was measured by the whole-genome bisulfite sequencing (WGBS). Additionally, machine learning was applied to identify cfDNA methylation markers of early-stage breast cancers. Finally, a diagnostic model using the identified cfDNA methylation markers in combination with radiology and ultrasound findings was accessed the ability to improve the accuracy of early-stage breast cancer detection. DCIS, ductal carcinoma in situ. b A comprehensive framework was devised to develop the classifier for identifying early-stage breast cancer based on the cfDNA methylation markers from massive cfDNA WGBS fragments. It includes four processes: the differentially methylated region (DMR) calling, cfDNA enrichment, cfDNA origin inference, and model development.
First, 5613 hyper-DMRs and 51,962 hypo-DMRs were identified from WGBS data of 10 benign and 9 malignant breast primary tissue samples. Second, the cfDNA enrichment scores were computed by the mean number of fragments in DMRs. Then, fragment size selection was conducted to reduce the effect of plenty of non-tumor cfDNA in plasma based on that ctDNA fragments are shorter than non-tumor cfDNA fragments. Next, a fragment-based strategy was devised to statistically infer the origin (malignant or not) of each fragment, based on the distributions of DNA methylation pattern of tissues in DMRs. Finally, a predictive score (cfMeth score) based on cfDNA methylation ratio in each plasma sample was computed using a random forest classifier computed using a random forest classifier. To reduce overfitting, 10-fold cross-validation was used (Fig. S4).
The area under the curve (AUC) of the model obtained for the discovery cohort and the validation cohort were 0.89 (95% CI, 0.84-0.94; Fig. 3f) and 0.81 (95% CI, 0.69-0.93; Fig. 3g), respectively, which was superior to the AUCs of BI-RADS findings (AUC = 0.78-0.79 for mammography and ultrasound) or relevant tumor biomarkers (AUC = 0.57-0.70 for CA15-3 and CEA; Fig. S5). The cfMeth scores in the prediction model were positively and robustly associated with the stages of BC, which include the ductal carcinoma in situ (Fig. S6). Additionally, lower cfMeth scores were associated with lower histologic grade and proliferation indices (Ki-67 measured by immunohistochemistry; Figs. S6 and S7).

A combinational model of cfMeth scores and diagnostic imaging
A diagnostic model combining cfMeth scores with mammography and ultrasound was developed in the discovery cohort using the ridge logistic regression. The combined model performed better than either one of the separate approaches (AUC = 0.94, 95% CI, 0.90-0.97; Fig. 3h). The malignant and benign groups had statistically different distributions of the combined scores in both the discovery and validation cohorts (Fig. S8). A cutoff point of the combined model was selected to be at which the false-negative rate was less than 2% in the discovery cohort, with a sensitivity of 98.7%, a specificity of 68.8%, and an accuracy of 83.8% (Table S6 (Table S5).

Discussion
Liquid biopsies are emerging as a non-invasive adjunct or alternative to standard tumor biopsies [5-7, 10, 11]. Compared to previous locus-specific methylation analysis [10][11][12][13], this study provided a useful resource of extensive data to uncover cfDNA methylation characteristics at the genome-wide scale in breast cancer. Based on the WGBS methods covering~88% human genomes, we have demonstrated the amount of cfDNA in different genomic regions negatively correlating with their CpG density.
Comparing to the recent study of the cfDNA methylation-based classifier for identifying BCs from healthy female controls [15], this study demonstrated a better performance by this combination strategy, even successful at distinguishing malignant breast lesions from benign ones, especially in stage I and II. Clinical use of the combined approach might reduce the number of unnecessary biopsies in women with BI-RADS category 4 findings.
However, there were also some limitations in this study. As a prior study, the tissue sample size was relatively small. Besides, the onset ages were not adequately matched in this study, which might lead to some potential bias in the methylation selection. The large-scale replication studies with more tumor tissue and cfDNA from multicenter will benefit to investigate the subtypespecific methylation biomarkers, the clinical utility and stability of the combined non-invasive strategy in future work.

Conclusions
In conclusion, we performed a blood-based wholegenome DNA methylation study at the single-base resolution for detecting early-stage breast cancer, which suggests that combining liquid biopsy with traditional diagnostic imaging can improve the accuracy of earlystage breast cancer diagnoses.