All living organisms contain independent bits of DNA or other genetic material that move around within their genome or from one organism to another. Scientists call these genetic materials mobile genetic elements, which differ based on the host organism. That is, humans have different mobile genetic elements than bacteria, which have different mobile genetic elements than yeast. Two types of mobile genetic elements scientists study are microscopic infectious agents, called viruses, and circular bits of DNA from microorganisms, called plasmids.
When these elements move within an organism’s genome, they can help the organism by causing them to evolve beneficial traits, they can hurt the organism by causing mutations that lead to diseases like cancer, or they can be silent with no consequences for the organism. Researchers want to capture all the mobile genetic elements in a patient’s genome so they can fully understand its characteristics.
For instance, consider a patient who has an infection of unknown cause. If clinicians can capture every potentially harmful element in the patient’s blood sample, they can more easily understand the infection and prescribe or create an effective treatment. Likewise, if scientists want to understand what makes a soil sample highly fertile at the genetic level, it helps if they can identify and understand all the genetic elements in the soil sample.
Scientists can identify the genetic sequences of every element in an environmental or clinical sample using a DNA sequencing method called metagenomics sequencing. However, they can only use the sequence data it produces if they have the right tools to extract loads of useful information.
Researchers in the past have analyzed metagenomic sequence data to identify mobile genetic elements, but many of the existing tools could not accurately identify both viruses and plasmids at the same time in the same sample. This drawback limited the researchers’ abilities to understand all the genetic elements in clinical samples. To solve this limitation, researchers at the Lawrence Berkeley National Laboratory and Los Alamos National Laboratory developed a new computational tool, geNomad, that combines the strengths of existing methods to accurately identify and classify plasmids and viruses in metagenomic sequence data.
The researchers retrieved microbial and eukaryotic sequence data from databases like The Genome Taxonomy Database. They filtered them to retain the highest quality sequences based on how complete they were and how little contamination they contained. They also removed artifacts and replicates from the sequence data. Then, they assembled about 625k known plasmid and virus sequences, which they used to train and develop their geNomad tool.
Next, they built a comprehensive protein dictionary to identify the plasmids and viruses based on their protein sequences. They retrieved protein alignments from existing databases, like Pfam, and built the dictionary using statistical methods after removing replicates and low-quality data. Finally, they used existing annotation databases to describe the functions of these proteins in geNomad, so it contained information on the name, gene function, and genetic code of the proteins.
The authors described geNomad as an AI tool that classifies mobile genetic elements by identifying, extracting, and interpreting patterns in genome data. It also classifies these elements based on the proteins they contain, by comparing them with the protein dictionary. Finally, geNomad uses features of these proteins, like sequence structure, density, and pattern frequency, to output a confidence score that tells the user how likely it is for the sequence to be a plasmid or a virus.
The authors measured the effectiveness of geNomad against 10 other previously existing tools. They found it’s faster than the other tools, and can identify even the smallest mobile genetic element in any combination of samples with high precision and sensitivity. It also discovered new classes of RNA viruses and giant viruses. They reported geNomad produced highly reproducible results, and was easy to use for both technical and non-technical users. They also found geNomad was computationally efficient, which makes it suitable for large-scale surveys of publicly available sequencing data.
The authors remarked that identifying mobile genetic elements with tools like geNomad will help scientists monitor clinically relevant microorganisms, and characterize sequences for antimicrobial-resistant genes. They anticipate geNomad will also be valuable for plasmid and viral researchers who want to better understand the diversity of plasmids in natural environments.