[ad_1]
Scientists from the Massachusetts Institute of Technology (MIT) and the Institut Pasteur in France have developed a technique for reconstructing entire genomes, including the human genome, on a personal computer. This technique is about a hundred times faster than current state-of-the-art approaches and uses one-fifth of the resources. The study, published on September 14 in the journal Cellular systems, enables a more compact representation of genome data inspired by the way words, rather than letters, provide condensed building blocks for language models.
“We can quickly assemble entire genomes and metagenomes, including microbial genomes, on a modest laptop,” says Bonnie Berger, Simons professor of mathematics in the MIT Computing and AI Lab and author of the study. “This ability is essential for assessing changes in the gut microbiome linked to disease and bacterial infections, such as sepsis, so that we can treat them faster and save lives.”
Genome assembly projects have come a long way since the Human Genome Project, which completed assembly of the first complete human genome in 2003 at a cost of around $ 2.7 billion and over a decade. international collaboration. But if human genome assembly projects no longer take years, they still require days and massive computing power. Third-generation sequencing technologies offer terabytes of high-quality genomic sequences with tens of thousands of base pairs, but genome assembly using such a large amount of data has proven difficult.
To approach genome assembly more efficiently than current techniques, which involve making pairwise comparisons between all possible pairs of reads, Berger and his colleagues turned to language models. Building on the concept of a Bruijn graph, a simple and efficient data structure used for genome assembly, the researchers developed a minimizer-space Bruin graph (mdBG), which uses short sequences of nucleotides. called minimizers instead of single nucleotides.
“Our minimizer-space Bruijn plots store only a small fraction of the total nucleotides, while preserving the overall structure of the genome, allowing them to be orders of magnitude more efficient than conventional Bruijn plots,” explains Berger.
The researchers applied their method to assemble real HiFi data (which has near-perfect single-molecule readability) to Drosophila melanogaster fruit flies, as well as human genome data provided by Pacific Biosciences (PacBio). When they evaluated the resulting genomes, Berger and his colleagues found that their mdBG-based software required about 33 times less time and 8 times less computer hardware of random access memory (RAM) than other genome assemblers. Their software performed genome assembly of human HiFi data 81 times faster with 18 times less memory usage than Peregrine assembler and 338 times faster with 19 times less memory usage than Peregrine. the hifiasm assembler.
Next, Berger and his colleagues used their method to construct an index for a collection of 661,406 bacterial genomes, the largest such collection to date. They found that the new technique could search the entire collection for antimicrobial resistance genes in 13 minutes, a process that took 7 hours using standard sequence alignment.
“We knew our representation was efficient, but we didn’t know it would fit as well in real data, after further code optimizations,” Berger explains.
“The whole idea works and doesn’t require some of the usually expensive pre-processing steps, like error correction, performed by most other genome assembly methods,” says Rayan Chikhi, researcher and group leader at the Institut Pasteur and author of the study.
“We can also handle sequencing data with error rates of up to 4%,” adds Berger. “With long read sequencers with varying error rates falling in price quickly, this capability opens the door to democratizing sequencing data analysis.”
Berger notes that while the method currently works best when processing PacBio HiFi reads, which fall well below a 1% error rate, it may soon be compatible with Oxford Nanopore ultra-long reads, which currently has 5-12% error rates but may soon offer 4% readings.
“We plan to reach out to scientists in the field to help them develop rapid genomic testing sites, going beyond PCR and tag arrays that might miss important differences between genomes,” Berger said.
A new assembler to decode the genomes of microbial communities developed
Cellular systems, Ekim et al. : “Minimizer-space by Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer” www.cell.com/cell-systems/full… 2405-4712 (21) 00332-X, DOI: 10.1016 / j .cels.2021.08.009
Quote: Scientists can now assemble entire genomes on their personal computers in minutes (2021, September 14) retrieved September 15, 2021 from https://phys.org/news/2021-09-scientists-entire-genomes- personal-minutes.html
This document is subject to copyright. Other than fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.
[ad_2]
Source link