Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (2024)

  • Journal List
  • Brief Bioinform
  • v.25(5); 2024 Sep
  • PMC11367762

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (1)

Link to Publisher's site

Brief Bioinform. 2024 Sep; 25(5): bbae430.

Published online 2024 Sep 2. doi:10.1093/bib/bbae430

PMCID: PMC11367762

PMID: 39222060

Jilei Liu, Hongru Shen, Kexin Chen,Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (2) and Xiangchun LiLarge language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (3)

Author information Article notes Copyright and License information PMC Disclaimer

Associated Data

Supplementary Materials
Data Availability Statement

Abstract

Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model—instruction-tuned LLM for assessment of cancer (iLLMAC)—that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773–0.959] for cancer diagnosis and 0.924 (95% CI, 0.841–1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794–0.977) and 0.956 (95% CI, 0.89–1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849–0.976) for cancer diagnosis and 0.938 (95% CI, 0.885–0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.

Keywords: cell-free DNA, early cancer diagnosis, large language models

Introduction

The advent of large language models (LLMs) has revolutionized natural language understanding [1, 2]. LLMs such as Generative Pre-Training 3 (GPT-3), Large Language Model Meta AI (LLaMA), and Pathways Language Model (PALM) are continuously setting new benchmarks in this field [3–5]. These foundation models, trained on extensive and diverse datasets, can be fine-tuned for various downstream applications to achieve state-of-the-art performance in natural language understanding task [6–9]. However, LLMs often misinterpret instructions and exhibit biases in their generated contents [10]. To address these issues, instruction-based tuning has been developed to better align LLMs with human instructions and expectations [11]. This involves curating human demonstrations to fine-tune LLMs, improving their adherence to human instructions and extending their capabilities to tasks such as classification, summarization, and question-answering [12–14]. Despite these advancements, the application of instruction-tuned LLMs in cell-free deoxyribonucleic acid (cfDNA)-based cancer diagnosis has not yet been explored.

cfDNA is mainly released from cell apoptosis, necrosis, and active secretion [15, 16]. Studies have reported that cfDNA fragments are not randomly generated with size distributions of cfDNA fragments center around a median of 167bp and exhibit non-random fragmentation patterns [17–19]. These patterns have been exploited for early cancer diagnosis, as patients with cancer show distinctively altered fragmentation profiles compared to healthy individuals [17, 20, 21]. For instance, the size distribution of cfDNA fragments in cancer patients shows greater variabilities than in individuals without cancer [17]. Based on this finding, Zhang and colleagues reported superior performance of a machine learning method in exploiting size distribution of cfDNA for the detecting hepatocellular carcinoma (HCC) [22]. Additionally, the flanking-end sequences (known as end-motifs) of cfDNA fragments have demonstrated to be a promising marker for detection of HCC. Jiang and colleagues reported that patients with HCC exhibited increased motif diversity scores (MDSs) as compared with individuals without HCC [18]. Specifically, end-motifs such as CCCA, CCAG, and CCTG were found to be more prevalent in patients with HCC [18]. Recently, Jiang and colleagues revealed distinct patterns of cfDNA cleavage by factorizing end-motif matrix, giving rise to fragmentation profiles that exhibit association with varying nucleic acid enzyme activity [23]. These fragmentation profiles exhibit high potential in the identification of individuals with HCC. Meanwhile, distinctive end-motif patterns are not only discernible in plasma but can also be identified within urine samples. Zhou and colleagues observed that jagged end indexed end-motif was notably reduced in the urine samples of patients with bladder cancer by employing a tailored bisulfite sequencing methodology [24].

Inspired by the success of LLM in natural language understanding [13, 14], we herein present a LLM-based model—instruction-tuned LLM for assessment of cancer (iLLMAC)—that can detect cancer using cfDNA end-motif profiles. We developed this model with cfDNA sequencing data curated from 2451 individuals. The sequencing modalities include whole genome sequencing, bisulfite sequencing, and 5-hydroxymethylcytosine sequencing (5hmC). We evaluated the performance of the model in the diagnosis of cancer and detection of HCC with internal- and external-testing sets. We demonstrated that iLLMAC is able to achieve high detection accuracy on different modalities of cfDNA data. Besides the development of iLLMAC, our study presents a new method for cfDNA-based cancer diagnosis.

Results

An overview on the development of instruction-tuned large language model for assessment of cancer

The workflow to develop iLLMAC is depicted in Fig. 1. The whole process consists of three stages: preparation of training and testing datasets (Fig. 1A), instruction-tuning of open-sourced foundation model LLaMA to obtain iLLMAC (Fig. 1B), and evaluation of iLLMAC in detection of cancer (Fig. 1C). We obtained a frequency matrix Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (4) for the 4-kmer end-motifs, where n is the number of data points. Subsequently, we ranked the end-motifs in descending order to constitute end-motif sentence simply via string concatenation. To obtain the fragmentation profile sentences, we performed non-negative matrix factorization (NMF) for matrix Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (5). The fragmentation profile sentence was derived from matrix Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (6) (see Materials and methods). We also calculated MDS for each data point (see Materials and methods). Finally, each data point has four attributes: end-motif sentence (denoted as e), fragmentation profile sentence (denoted as f), MDS (denoted as s), and diagnostic label (denoted as l). In the training set, we concatenated e, f, s, and l according to the template specified in Fig. 1B as training data points. In the testing set, we excluded l but simply concatenated e, f, and s as input context c. We next instruction-tuned the foundation model LLaMA with these curated training set. Instruction-tuning is performed to model the conditional probability of l by taking the concatenated sentence of e, f, and s as context (see Materials and methods), giving rise to iLLMAC. In the evaluation stage, we provided iLLMAC with the input context c and let it generate the predicted diagnostic label (denoted as l*). By comparing l and l*, we can measure its performance.

Open in a separate window

Figure 1

A flowchart depicting the development and validation of iLLMAC. This flowchart illustrates types of fragmentome data, development of iLLMAC via instruction-tuning demonstrations and evaluation of iLLMAC.

High performance of instruction-tuned large language model for assessment of cancer

The whole-genome sequencing (WGS) testing dataset consists of 24 non-cancerous controls and 20 cancer patients diagnosed with HCC, 6 with nasopharyngeal carcinoma, 6 with lung cancer, 4 with head and neck squamous cell carcinoma, and 3 with colorectal cancer. We observed that iLLMAC outperformed its benchmarked methods such as MDS and NMF by significant margins in diagnosis of cancer (Fig. 2A) and in detection of HCC (Fig. 2B). The performance of iLLMAC is gradually increasing when more end-motifs are included (Fig. 2). Specifically, iLLMAC achieved an AUROC value of 0.886 [95% confidence interval (CI), 0.794–0.977], accuracy of 0.857 (0.746–0.933), sensitivity of 0.846 (0.695–0.941), and specificity of 0.875 (0.676–0.973) in diagnosis of cancer, outperforming the best benchmarked method [MDS, 0.853 (0.76–0.945), 0.794 (0.673–0.885), 0.692 (0.524–0.830), and 0.958 (0.789–0.999), respectively]. In the detection of HCC, iLLMAC achieved an AUROC value of 0.956 (0.890–1.0), accuracy of 0.937 (0.845–0.982), sensitivity of 0.900 (0.683–0.988), and specificity of 0.953 (0.842–0.994), also outperforming the best benchmarked method [NMF, 0.917 (0.834–1.0), 0.905 (0.804–0.964), 0.750 (0.509–0.913), and 0.977 (0.877–0.999), respectively]. The classification metrics such as accuracy, specificity and sensitivity are provided in Tables 1 and ​and22.

Open in a separate window

Figure 2

The ROC curves of iLLMAC with different number of end-motifs and benchmarked baseline methods in the diagnosis of cancer (A) and detection of HCC (B). iLLMAC-16, iLLMAC-32, and iLLMAC-64 indicate that iLLMAC was evaluated with 16, 32, and 64 end-motifs, respectively.

Table 1

Classification metrics of iLLMAC and baseline methods in diagnosis of cancer on the internal-testing set

MethodAccuracySensitivitySpecificityF1
MDS0.794 (0.673–0.885)0.692
(0.524–0.830)
0.958
(0.789–0.999)
0.806
NMF0.746
(0.621–0.847)
0.692
(0.524–0.830)
0.833
(0.626–0.953)
0.771
DELFI0.762
(0.638–0.860)
0.692
(0.524–0.830)
0.875
(0.676–0.973)
0.783
iLLMAC-16a0.810
(0.691–0.898)
0.795
(0.635–0.907)
0.833
(0.626–0.953)
0.838
iLLMAC-32a0.825
(0.709–0.909)
0.821
(0.665–0.925)
0.833
(0.626–0.953)
0.853
iLLMAC-64a0.857
(0.746–0.933)
0.846
(0.695–0.941)
0.875
(0.676–0.973)
0.88

aiLLMAC-k with length of k end-motifs. iLLMAC-16/−32/−64: iLLMAC with length of 16, 32, and 64 end-motifs, respectively. Bold values represent the highest-performing method for each metric.

Table 2

Classification metrics of iLLMAC and baseline methods in detection of HCC on the internal-testing set

MethodAccuracySensitivitySpecificityF1
MDS0.810
(0.691–0.898)
0.500
(0.272–0.728)
0.953
(0.842–0.994)
0.625
NMF0.905
(0.804–0.964)
0.750
(0.509–0.913)
0.977
(0.877–0.999)
0.833
DELFI0.889
(0.784–0.954)
0.800
(0.563–0.943)
0.930
(0.809–0.985)
0.821
iLLMAC-16a0.921
(0.824–0.974)
0.900
(0.683–0.988)
0.930
(0.809–0.985)
0.878
iLLMAC-32a0.952
(0.867–0.990)
0.900
(0.683–0.988)
0.977
(0.877–0.999)
0.923
iLLMAC-64a0.937
(0.845–0.982)
0.900
(0.683–0.988)
0.953
(0.842–0.994)
0.9

Open in a separate window

aiLLMAC-k with length of k end-motifs. iLLMAC-16/−32/−64: iLLMAC with length of 16, 32, and 64 end-motifs, respectively. Bold values represent the highest-performing method for each metric.

To further verify the performance of iLLMAC, we generated an external-testing subjected to WGS that consists of 32 patients with HCC, 2 patients with cervical cancer, 2 patients with colorectal cancer, 3 patients with esophageal cancer, 3 patients with ovarian cancer, 1 patient with head and neck squamous cell carcinoma, 1 patient with lung cancer and 30 individuals without cancer from Tianjin Cancer Hospital (see Materials and methods). On our external-testing set, we observed that iLLMAC achieved an AUROC of 0.912 (0.849–0.976) in diagnosis of cancer (Fig. 3A), outperforming benchmarked methods [NMF, 0.751 (0.631–0.871); MDS, 0.581 (0.438–0.724); Delong’s test, P < 0.001]. The accuracy, sensitivity, and specificity achieved by iLLMAC are 0.865 (0.765–0.933), 1.000 (0.920–1.000), and 0.667 (0.472–0.827), respectively. In the task of HCC detection, iLLMAC achieved an AUROC value of 0.938 (0.885–0.992), which is also higher than the benchmarked methods [NMF, 0.538 (0.437–0.728), MDS, 0.571 (0.419–0.722); Delong’s test, P < 0.001; Fig. 3B]. The accuracy, sensitivity, and specificity achieved by iLLMAC 0.905 (0.815–0.961), 0.844 (0.672–0.947), and 0.952 (0.838–0.994), respectively. Detailed metrics for classification are listed in Table 3.

Open in a separate window

Figure 3

The ROC curves of iLLMAC, NMF and MDS on our external-testing set in the diagnosis of cancer (A) and detection of HCC (B). iLLMAC-64 indicates that iLLMAC was evaluated 64 end-motifs.

Table 3

Classification metrics of iLLMAC in diagnosis of cancer and detection of HCC on our external-testing set.

TaskMethodAccuracySpecificitySpecificityF1
Diagnosis of cancerNMF0.784
(0.673–0.871)
0.636
(0.478–0.776)
1.000
(0.884–1.000)
0.778
MDS0.284
(0.185–0.401)
0.477
(0.325–0.633)
0.000
(0.000–0.116)
0.442
iLLMAC-64a0.865
(0.765–0.933)
1.000
(0.920–1.000)
0.667
(0.472–0.827)
0.898
HCC detectionNMF0.703
(0.585–0.803)
0.656
(0.468–0.814)
0.738
(0.580–0.861)
0.656
MDS0.716
(0.599–0.815)
0.469
(0.291–0.653)
0.905
(0.774–0.973)
0.588
iLLMAC-64a0.905
(0.815–0.961)
0.844
(0.672–0.947)
0.952
(0.838–0.994)
0.885

Open in a separate window

aiLLMAC-k with length of k end-motifs. iLLMAC-64: iLLMAC with length of 64 end-motifs. Bold values represent the highest-performing method for each metric.

Instruction-tuned large language model for assessment of cancer on datasets subjected to bisulfite and 5-hydroxymethylcytosine sequencing

On three internal testing datasets including HCC, non-small cell carcinoma (NSCLC) and healthy controls subjected to bisulfite sequencing and 5hmC sequencing (see Materials and methods), iLLMAC also achieved strong ability in distinguishing cancer patients from non-cancerous controls. For instance, it achieved an AUROC value of 0.993 (95% CI, 0.989–1.0) on the dataset subjected to targeted bisulfite-sequencing dataset, 0.870 (95% CI, 0.716–1.0) on the dataset subjected to whole-genome bisulfite sequencing, and 0.848 (95% CI, 0.752–0.945) on the dataset subjected to 5hmC sequencing (Fig. 4). The other classification metrics such as accuracy, specificity and sensitivity are provided in Supplementary Table 1. This demonstrated that iLLMAC is robust towards different data modalities generated by different sequencing types.

Open in a separate window

Figure 4

ROC curves of iLLMAC on three datasets subjected to targeted-bisulfite sequencing (A), whole-genome bisulfite sequencing (B), and 5hmC sequencing (C).

Ablation test of instruction-tuned large language model for assessment of cancer

The input for iLLMAC has three components: end-motif sentence, MDS, and fragmentation profile derived from NMF. As MDS and fragmentation profile can be calculated from end-motif counts, we therefore performed ablation test by removing MDS and fragmentation profile from the input. On the internal-testing dataset, the performance achieved by iLLMAC in diagnosis of cancer and HCC were minimally affected (Fig. 5A and B). On our external-testing set, AUROC values achieved by iLLMAC reduced from 0.912 (0.849–0.976) to 0.864 (0.78–0.949), and 0.938 (0.885–0.992) to 0.91 (0.824–0.996) in the diagnosis of cancer and HCC detection, respectively. Moreover, on the three additional datasets subjected to bisulfite and 5hmC sequencing, AUC values of iLLMAC did not change significantly (Supplementary Fig. 1). For instance, on the dataset subjected to targeted bisulfite-sequencing and the dataset subjected to whole-genome bisulfite sequencing, AUROC values achieved by iLLMAC remained unchanged. On the dataset subjected to 5hmC sequencing, AUC values achieved by iLLMAC decreased from 0.848 (0.752–0.945) to 0.809 (0.704–0.914). The classification metrics for these ablation test results were provided in Supplementary Tables 2 and 3.

Open in a separate window

Figure 5

Ablation study of iLLMAC in the diagnosis of cancer and HCC detection on WGS internal-testing set (A and B) and external-testing set (C and D).

Discussion

We presented a promising LLM for cancer diagnosis via instruction-tuning of the natural language foundation model with demonstrations compiled from 2451 plasma cfDNA samples subjected to WGS, bisulfite, and 5hmC sequencing. Comprehensive evaluation confirmed that this model achieved high classification performance in the diagnosis of cancer and HCC. This study presents a new method for cfDNA-based cancer diagnosis and offer new insights into the utilization of LLM in translational oncology.

The fragmentation patterns buried in cfDNA have been identified as useful liquid-biopsy markers in diagnosis of cancer [15–17]. Traditional cfDNA-based diagnosis of cancer can be categorized into alignment-based machine learning method and alignment-free statistical method. Our method belongs to none of these two categories and it can be considered as alignment-free deep-learning-based method. DNA evaluation of fragments for early interception (DELFI) is one of the most frequently used alignment-based machine learning method developed by Cristiano and colleagues to incorporate multiple features of cfDNA fragmentome data. Computational intensive steps such as reads mapping and subsequent detection of copy number variations are required to use DELFI. In contrast, our method did not require these computational intensive steps. The alignment-free statistical methods include MDS and NMF-based decomposition of end-motif frequency matrix. These statistical methods assume the independency among end-motifs whereas our method does not require such assumption but let the LLM to capture their co-occurrence. Meanwhile, we observed that iLLMAC can achieve comparable or at least better performance as compared with DELFI, MDS, and NMF on the datasets examined.

iLLMAC exhibits several advantages. Firstly, it runs very fast during inference and can provide an end-to-end diagnostic result. We noted that iLLMAC can process an input on a laptop by using 3s. Secondly, iLLMAC achieved consistently higher and more stable classification performance. For example, with respect to the diagnosis of cancer and HCC detection on the internal-testing dataset, iLLMAC achieved respective AUC values of 0.886 and 0.956 whereas these values are 0.781 and 0.924 for DELFI, 0.81 and 0.917 for NMF, and 0.853 and 0.733 for MDS. Thirdly, iLLMAC is data-agnostic; therefore, it can process different types of data exactly in the same way whereas alignment-based methods require different data processing strategies. Fourthly, the development of iLLMAC is very computationally efficient and quite simple—instruction-tuning LLaMA for one epoch on our curated demonstration training set is sufficient. This process took 5 hours on a single DGX AI machine that has 8 A100 graphics processing units (GPUs) and each GPU has 40Gb memory.

Our method was not without limitations. Only cfDNA end-motif profilings were used to develop iLLMAC, the other cancerous signals such as gene mutations and genomic copy number alterations cannot be exploited. Therefore, there is a potential risk of compromising the diagnostic capability of the model. Furthermore, clinical parameters such as age and tumor stage are not available for the collected datasets, precluding examination of iLLMAC’s performance with respect to these factors. Additionally, due to the limited computational resources of our server, we can only instruction-tune LLaMA model of 7 billion parameters but are not able to instruction-tune LLaMA with larger parameters such as LLaMA models with 13, 33, and 65 billion parameters. Instruction-tuning with larger model capacities are expected to increase the performance.

In the context of cancer detection, plasma-derived cfDNA is preferred over serum due to its lower background of wild-type DNA, which enhances the sensitivity for detecting tumor-specific alterations known as circulating tumor DNA [25]. Plasma allows for a clearer distinction of cancer-related genetic changes by reducing the potential masking effect caused by the release of DNA from white blood cells during the clotting process in serum [26]. Additionally, plasma offers a better limit of detection for specific variants and contributes to the standardization of methodologies, making results more comparable and reliable across different studies and clinical settings [27]. The pre-analytical stability of cfDNA in plasma is another advantage, as it can be processed within a specific time frame after blood collection without significant changes in cfDNA levels, which is essential for clinical applications where immediate processing may not be feasible.

In summary, we presented an LLM called iLLMAC for cfDNA-based cancer diagnosis and demonstrated its superior performance on different datasets. Our study would open a new avenue for exploiting LLM in the field of liquid-biopsy-based early cancer diagnosis.

Materials and methods

Public datasets

We collected the plasma sequencing data of 2451 individuals that were subjected to WGS, bisulfite sequencing, and 5hmC from The Sequence Read Archive (SRA), National Genomics Data Center (NGDC), and European Genome-phenome Archive (EGA). The accession numbers of these datasets are EGAS00001003409 [18], PRJNA574555 [28], and PRJCA000816 [29].

‘EGAS00001003409’ contains plasma cfDNA data from 188 individuals that were subjected to WGS and whole-genome bisulfite sequencing (WGBS). There are 129 individuals subjected to WGS include 34 patients with HCC,10 with head and neck squamous cell carcinoma, 10 with CRC, 10 with lung cancer, 10 with nasopharyngeal carcinoma, and 55 individuals without cancer. There are 59 out of these 188 individuals were also subjected to whole-genome bisulfite sequencing, including 34 patients with HCC and 25 individuals without cancer. We randomly split this dataset into training set and internal-testing set. The WGS training set consists of 66 samples and internal-testing set consists of 63 samples; The WGBS training set consists of 28 samples and internal-testing set consists of 31 samples.

‘Dataset PRJNA574555’ contains 1171 patients with HCC and 959 healthy individuals that were subjected to targeted-bisulfite sequencing. We randomly them split into training set (n = 1072) and testing set (n= 1058).

‘Dataset PRJCA000816’ consisted of 66 patients with cancer and 67 healthy controls that were subjected to 5hmC of plasma cfDNA. We randomly them split into training set (n = 60) and testing set (n = 73).

Information about these four datasets were provided in Supplementary Table 4. We trained iLLMAC by integrating all training sets of these datasets and subsequently evaluated its performance separately on each testing set.

Our external-testing set

We compiled a dataset of 74 individuals as the external-testing set. This dataset consisted of 32 patients with HCC, 2 patients with cervical cancer, 2 patients with colorectal cancer, 3 patients with esophageal cancer, 3 patients with ovarian cancer, 1 patient with head and neck squamous cell carcinoma, 1 patient with lung cancer, and 30 healthy controls. The plasma samples of these individuals were from Cancer Biobank of Tianjin Medical University Cancer Institute and Hospital. This study was approved by the Ethics Committees of Tianjin Cancer Hospital (Institutional Review Board approval number: EK20240105) and all participants provided written informed consent.

Sample collection, preparation, and sequencing

A 5-mL sample of whole blood was collected from each patient in an Ethylenediaminetetraacetic acid (EDTA) tube and processed immediately. The plasma and cellular components were separated by centrifugation at 1600 × g for 10min at 4°C. Plasma was further centrifuged for 10min at 16000 × g at 4°C to remove any remaining cellular debris and then stored at −80°C. The cfDNA of plasma was extracted with a QIAamp® Circulating Nucleic Acid Kit (cat. no. 55114).

The concentration of extracted cfDNA was quantified by a Qubit Fluorometer (Thermo Fisher Scientific, USA) and the size distribution was detected using a Qsep-400 (Bioptic). The total cfDNA of each plasma sample was inputted for library preparation using the VAHTS Universal DNA Library Prep Kit for MGI (Vazyme-Tech, Nanjing, China).

CfDNA isolation and construction of the WGS library were both performed using MGISP-960 High-Throughput Automated Sample Preparation System (MGI-Tech) according to the manufacturer’s protocol. Briefly, purified cfDNA was subjected to end-repairing, A-tailing, ligation modules, polymerase chain reaction (PCR) amplification, and single-strand circularization. All single-strand circular DNA libraries were sequenced on the MGISEQ-2000 platform (MGI-Tech) with paired-end reads to generate approximately 15Gb of whole-genome data for each sample (MGI-Tech).

Preparation of instruction-tuning data

For each sample, we randomly selected 1 million reads and divided them into 10 groups; therefore, each group has 0.1 million reads. For each group, we counted the frequencies of 256 4-kmers from the 5′-end of sequenced cfDNA reads, giving rise to a vector of a denoting the end-motif frequencies. Therefore, we can obtain an end-motif frequency matrix A for the training set.

End-motif sentence (denoted as e)

To obtain the end-motif sentence, we sorted vector a in a descending order. Therefore, we obtained a sorted list of end-motifs Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (12), where Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (13) is the name of the ith end-motif and the frequency of Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (14). We set Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (15) to 64 to obtain the top-ranking 64 end-motifs and concated them into a sentence by white space.

MDS (denoted as s)

It described the distribution of frequencies of cfDNA end-motifs, which is defined as:

Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (16)

where Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (17) is the frequencies of the ith end-motif.

Fragmentation profile sentence (denoted as f)

To obtain the fragmentation profile, we performed NMF deconvolution for the end-motif frequency matrix A obtained from the training set. Specifically, A was factorized into two non-negative matrices W and H according to:

Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (18)

According to Zhou and colleagues [30], W can be considered as fragmentomic signatures and H is fragmentation profile. We used the nmf algorithm implemented in the NMF R package to perform NMF decomposition for A. The optimal rank is set to 4 according to cophenetic coefficient and sparsity score (Supplementary Fig. 2). Let H:,i represent the ith column of H. Suppose the fragmentation profile values of H:,i = {Q0, Q1, Q2, Q3} and Q1≥ Q0≥ Q2≥ Q3, the fragmentation profile sentence can be represented as ‘Fragmentation profile: Q1 Q0 Q2 Q3’.

Diagnostic label (denoted as Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (19))

It is a sentence description of the sample. For example, ‘it is cancer’ or ‘it is normal control’.

Eventually, each data point has four attributes: end-motif sentence (denoted as e), fragmentation profile sentence (denoted as f), MDS (denoted as s), and diagnostic label (denoted as l). In the training set, we concatenated e, f, s, and l according to the template specified in Fig. 1B as training data points to develop iLLMAC. In the testing set, we excluded l but simply concatenated e, f, and s as input context c.

In the evaluation stage, we provided iLLMAC with the input context c and let it generate the predicted diagnostic label (denoted as l*). By comparing l and l*, we can measure its performance. For patient-level prediction, we use the fraction of predicted cancer labels as the final prediction score. For example, suppose that n data points are collected from a given individual and k of them are predicted to have cancer label; therefore, the predicted score is calculated as p = k/n.

Development of instruction-tuned large language model for assessment of cancer

We instruction-tuned the open-sourced LLaMA model that has approximately 7 billion parameters. The checkpoint of this pretrained model was downloaded from https://huggingface.co/meta-llama/Llama-2-7b-hf. LLaMA is a foundation model for natural language understanding and it was built upon transformer decoder. It works by taking a sequence of words as an input and predicts a next word to recursively generate text. Each input words are embedded to a vector of 4096 dimension, subsequently processed through 32 layers of transformer decoder blocks to learn the contextual relationships among different words. We instruction-tuned LLaMA with our demonstration data by using Adam optimizer, batch size of 16, weight decay of 0.01, and initial learning rate of 2e-5 for one epochs. Learning rate was warmup for 3% of the steps and decreased towards zero by following cosine scheduling. This model was trained with PyTorch (version 1.7.1) and transformers (version 4.21.1) on NVIDIA DGX A100 with 8 GPUs each with 40 Gb memory. The input sequence length was set to 64.

Deoxyribonucleic acid evaluation of fragments for early interception

The DELFI method was proposed by Cristiano and colleagues [17] to analyze genome-wide cell-free DNA fragmentation for cancer detection. We followed the pipeline curated by the authors deposited at https://github.com/Cancer-Genomics/delfi_scripts. Specifically, we used bwa (v0.7.17) for sequence alignment, samtools (v0.1.19) for sorting and deduplicating the alignment files, and bedtools (v2.30.0) for producing bed files. R packages, such as GenomicRanges (v1.50.0), GenomicAlignments (v1.34.0), and tidyverse (v2.0.0), were used to get cfDNA fragmentome and copy number features; caret (v6.0–93) was used for data preprocessing and building gradient boosting machine classifier.

Statistical analysis

We conducted our experiment with Python (v3.7.10), R (v4.2.1), ggplot2 (v3.3.6), and pROC (v1.18.0). The 95% CIs of AUROC values were calculated by using DeLong’s methods implemented in pROC. We calculated accuracy, sensitivity and specificity by using R software package caret (v6.0–93). The 95% CIs for accuracy, sensitivity and specificity were calculated by using Clopper–Pearson method [31].

Key points

  • The study introduces an instruction-tuned LLM called iLLMAC, designed for cancer detection using cfDNA end-motif profiles.

  • iLLMAC demonstrates superior performance compared to existing methods like MDS, NMF, and DELFI, achieving high AUROC scores in both internal and external testing sets for cancer diagnosis and HCC detection.

  • The model maintains high classification accuracy across different sequencing types, including WGS, whole-genome/targeted bisulfite sequencing, and 5hmC.

Supplementary Material

Supplementary_figures_and_tables_bbae430

Click here to view.(155K, docx)

Acknowledgements

This work was supported by Cancer Biobank of Tianjin Medical University Cancer Institute and Hospital. We are grateful for researchers for their generosity to make their data publicly available.

Conflicts of interest: The authors declare that they have no conflict of interest.

Contributor Information

Jilei Liu, Key Laboratory of Cancer Prevention and Therapy, Tianjin Cancer Institute, Tianjin’s Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.

Hongru Shen, Key Laboratory of Cancer Prevention and Therapy, Tianjin Cancer Institute, Tianjin’s Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.

Kexin Chen, Department of Epidemiology and Biostatistics, Key Laboratory of Molecular Cancer Epidemiology of Tianjin, Key Laboratory of Cancer Prevention and Therapy, Tianjin’s Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.

Xiangchun Li, Key Laboratory of Cancer Prevention and Therapy, Tianjin Cancer Institute, Tianjin’s Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2021YFC2500400 to K.C.), National Natural Science Foundation of China (Grant No. 32270688 and 31801117 to X.L.), Program for Changjiang Scholars and Innovative Research Team in University in China (Grant No. IRT_14R40 to K.C.). This work was funded by Tianjin Key Medical Discipline (Specialty) Construction Project (TJYXZDXK-009A).

Code availability

Code and demo data are available at https://github.com/deeplearningplus/iLLMAC.

Data availability

Data is publicly available at EGA database (number EGAS00001003409), Sequence Read Archive (No.: PRJNA574555) and NGDC (number PRJCA000816) databases.

Author contributions

Xiangchun Li and Kexin Chen designed and supervised the study; Xiangchun Li and Jilei Liu performed data analysis and wrote the manuscript; Xiangchun Li and Jilei Liu developed the model; Xiangchun Li, Hongru Shen, and Jilei Liu collected data. Jilei Liu, Xiangchun Li, and Kexin Chen revised the manuscript.

References

1. Workshop B, Scao TL, Fan A. et al. Bloom: a 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. 2022.

2. Ziegler DM, Stiennon N, Wu J. et al. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. 2019.

3. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877–901. [Google Scholar]

4. Touvron H, Lavril T, Izacard G. et al. Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023.

5. Chowdhery A, Narang S, Devlin J, et al. Palm: scaling language modeling with pathways. J Mach Learn Res 2023;24:1–113. [Google Scholar]

6. Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–65. 10.1038/s41586-023-05881-4. [PubMed] [CrossRef] [Google Scholar]

7. Bommasani R, Hudson DA, Adeli E. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021.

8. Devlin J, Chang M-W, Lee K. et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.

9. Radford A, Narasimhan K. Improving language understanding by generative pre-training. 2018. Preprint at https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language_understanding_paper.pdf.

10. Wu J, Yang S, Zhan R. et al. A survey on llm-gernerated text detection: necessity, methods, and future directions. arXiv preprint arXiv:2310.14724. 2023.

11. Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 2022;35:27730–44. [Google Scholar]

12. Peng B, Li C, He P. et al. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277. 2023.

13. Chung HW, Hou L, Longpre S, et al. Scaling instruction-finetuned language models. J Mach Learn Res 2024;25:1–53. [Google Scholar]

14. Wei J, Bosma M, Zhao VY. et al. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. 2021.

15. Gao Q, Zeng Q, Wang Z, et al. Circulating cell-free DNA for cancer early detection. The Innovation 2022;3:100259. 10.1016/j.xinn.2022.100259. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

16. Lo YMD, Han DSC, Jiang P, et al. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science 2021;372:eaaw3616. 10.1126/science.aaw3616. [PubMed] [CrossRef] [Google Scholar]

17. Cristiano S, Leal A, Phallen J, et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 2019;570:385–9. 10.1038/s41586-019-1272-6. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

18. Jiang P, Sun K, Peng W, et al. Plasma DNA end-motif profiling as a fragmentomic marker in cancer, pregnancy, and transplantation. Cancer Discov 2020;10:664–73. 10.1158/2159-8290.CD-19-0622. [PubMed] [CrossRef] [Google Scholar]

19. Snyder MW, Kircher M, Hill AJ, et al. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 2016;164:57–68. 10.1016/j.cell.2015.11.050. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

20. Foda ZH, Annapragada AV, Boyapati K, et al. Detecting liver cancer using cell-free DNA fragmentomes. Cancer Discov 2023;13:616–31. 10.1158/2159-8290.CD-22-0659. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

21. Mathios D, Johansen JS, Cristiano S, et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun 2021;12:5060. 10.1038/s41467-021-24994-w. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

22. Zhang X, Wang Z, Tang W, et al. Ultrasensitive and affordable assay for early detection of primary liver cancer using plasma cell-free DNA fragmentomics. Hepatology 2022;76:317–29. 10.1002/hep.32308. [PubMed] [CrossRef] [Google Scholar]

23. Zhou Z, Ma M-JL, Chan RWY, et al. Fragmentation landscape of cell-free DNA revealed by deconvolutional analysis of end motifs. Proc Natl Acad Sci 2023;120:e2220982120. 10.1073/pnas.2220982120. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

24. Zhou Z, Cheng SH, Ding SC, et al. Jagged ends of urinary cell-free DNA: characterization and feasibility assessment in bladder cancer detection. Clin Chem 2021;67:621–30. 10.1093/clinchem/hvaa325. [PubMed] [CrossRef] [Google Scholar]

25. Pittella-Silva F, Chin YM, Chan HT, et al. Plasma or serum: which is preferable for mutation detection in liquid biopsy? Clin Chem 2020;66:946–57. 10.1093/clinchem/hvaa103. [PubMed] [CrossRef] [Google Scholar]

26. Chan KA, Yeung S-W, Lui W-B, et al. Effects of preanalytical factors on the molecular size of cell-free DNA in blood. Clin Chem 2005;51:781–4. 10.1373/clinchem.2004.046219. [PubMed] [CrossRef] [Google Scholar]

27. Kloten V, Rüchel N, Brüchle NO, et al. Liquid biopsy in colon cancer: comparison of different circulating DNA extraction systems following absolute quantification of KRAS mutations using Intplex allele-specific PCR. Oncotarget 2017;8:86253–63. 10.18632/oncotarget.21134. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

28. Xu R, Wei W, Krawczyk M, et al. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Epi Nature Mater 2017;16:1155–61. 10.1038/nmat4997. [PubMed] [CrossRef] [Google Scholar]

29. Hu X, Luo K, Shi H, et al. Integrated 5-hydroxymethylcytosine and fragmentation signatures as enhanced biomarkers in lung cancer. Clin Epigenetics 2022;14:15. 10.1186/s13148-022-01233-7. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

30. Zhou Q, Kang G, Jiang P, et al. Epigenetic analysis of cell-free DNA by fragmentomic profiling. Proc Natl Acad Sci U S A 2022;119:e2209852119. 10.1073/pnas.2209852119. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

31. Julious SA. Two-sided confidence intervals for the single proportion: comparison of seven methods by Robert G. Newcombe, statistics in medicine 1998; 17:857-872. Stat Med 2005;24:3383–4. 10.1002/sim.2164. [PubMed] [CrossRef] [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA (2024)
Top Articles
Latest Posts
Article information

Author: Tuan Roob DDS

Last Updated:

Views: 5857

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Tuan Roob DDS

Birthday: 1999-11-20

Address: Suite 592 642 Pfannerstill Island, South Keila, LA 74970-3076

Phone: +9617721773649

Job: Marketing Producer

Hobby: Skydiving, Flag Football, Knitting, Running, Lego building, Hunting, Juggling

Introduction: My name is Tuan Roob DDS, I am a friendly, good, energetic, faithful, fantastic, gentle, enchanting person who loves writing and wants to share my knowledge and understanding with you.