Cancer disrupts several layers of biological blueprints, including the order of DNA sequences and chemical tags on DNA called methylation. In cancer patients, tumor samples taken directly from places like the colon or skin contain a mixture of healthy cells, with normal levels of methylation, and cancer cells, with abnormal methylation. This mixture makes it difficult for doctors to distinguish between them and to determine which methylation signals truly come from the tumor.
In addition, directly sampling tumors often requires patients to undergo painful surgery. Some scientists have suggested sampling a patient’s blood as an alternative for initial diagnosis. However, blood samples face the same issues and often only contain tiny amounts of cancer DNA.
In the past, scientists summarized average methylation levels across many DNA fragments from a patient’s sample to estimate how much cancer DNA versus normal DNA it contained. However, this traditional method loses valuable information about rare and very subtle disruptions in the patient’s DNA. Researchers from Germany and Belgium argued that this lost information is important to detect and diagnose cancers early. Therefore, they introduced a new computational tool called MethylBERT to address this challenge. This tool analyses DNA methylation using individual strings of DNA sequences, or sequence reads, which retain these subtle details.
The team built MethylBERT using the same technology that powers modern language models like ChatGPT, referred to as transformer architecture. They repurposed this technology to understand the language of DNA and its methylation signals rather than human words. Each sequence read served as a short “sentence” for the model to study and learn the difference between tumour DNA and normal DNA.
The researchers trained MethylBERT in 2 stages. First, they exposed it to a template dataset of human DNA, called the human reference genome. They used this dataset to teach the model to recognize patterns in DNA sequences without any information about methylation or disease. It’s like teaching students to read using only the alphabet that makes up words without any additional context. The system learned to distinguish different 3-letter DNA combinations and recognized that some bases, specifically the C and G of ATCG, occur in specific patterns. The researchers found that this pre-training step was essential because when they skipped it, the model failed to accurately classify cancerous cells versus normal cells.
Then, they fine-tuned the pre-trained model using DNA sequences from actual cancer and healthy samples, teaching it to recognize known tumor-specific methylation patterns. This strategy is similar to teaching students other grammar elements, like punctuation and idioms, to add context and meaning to words. The model learned that some regions of DNA had high levels of methylation in tumors and low levels or zero methylation in normal cells, and vice versa. The researchers designed the system to output probability scores indicating the likelihood that each DNA fragment originated from a tumor or from normal tissue.
The team tested MethylBERT against existing methods using simulated DNA sequence data with varying levels of complexity. They demonstrated that their method was highly accurate in detecting cancer DNA even when analyzing DNA fragments from locations in the genome that have only a few sequence reads, whereas traditional methods struggled. It was also able to detect very low amounts of tumor DNA in blood from colorectal and pancreatic cancer patients, further demonstrating its potential use in non-surgical cancer detection.
The scientists found that the model took a long time to train on the human genome data, so they tested whether models pre-trained on mouse genomes could also analyze human cancer samples. They found that mouse-trained models performed nearly as well as human-trained versions when applied to human cancer data, producing only minor differences in probability distributions. They attributed this observation to the idea that DNA is consistently organized across mammals, allowing the model to apply knowledge learned from one organism to another.
The researchers concluded that MethylBERT can identify cancer DNA in DNA sequence data obtained from any sequencing platform, regardless of how complex the methylation signal is or the size of the tumor DNA in the sample. They also noted that the current version of MethylBERT requires large computational resources for training and use, so they’ve already begun work on a more efficient version.
