Achievement demonstrates feasibility of making all of life’s code easily searchable, researchers say
A tool that functions like a Google for DNA has demonstrated its promise for making all of the world’s biological sequence data cheaply and easily searchable, according to the Swiss team that developed it. In a proof of principle study, the researchers say they successfully indexed 10% of the world’s known DNA, RNA, and protein sequences—and the same method could be used to do the rest.
The advance, posted last month on bioRxiv, used a computational tool the group recently developed called MetaGraph to organize and compress publicly available sequence data into a searchable format—much as internet search engines do for web pages and their content. The resulting indexes, available for download and via a web portal, allow users to scan sequences comprising trillions of base pairs and billions of amino acids.
The research “represents a massive achievement and a landmark in our ongoing pursuit of the grand challenge of indexing all publicly available sequencing data,” says Rob Patro, a computational biologist at the University of Maryland who wasn’t involved in the pilot effort.
Such a resource could aid myriad areas of research, from identifying novel viruses to revealing disease-associated RNA sequences. Although MetaGraph isn’t the only project aiming for this goal, the team has created some of the largest indexes so far and calculates that its tool will be relatively inexpensive to use.
The need is pressing, Patro and others note. Repositories storing DNA, RNA, and protein sequence data are expanding exponentially. The Sequence Read Archive (SRA), a genetic database run by the National Institutes of Health’s National Center for Biotechnology Information (NCBI) and collaborators, already contains more than 50 thousand trillion base pairs (50 petabases) from organisms including humans and other animals, plants, and bacteria.
Current bioinformatics tools can’t scan this much data all at once, especially for sequences that haven’t yet been assembled into genomes. Researchers have to narrow down the sequence collections before they can search them. Several groups hope to solve this problem by compressing sequences from larger databases into a more organized data structure, or index, designed for easy searching in downloadable files or online portals.
In 2020, bioinformatician André Kahles, computer scientist Gunnar Rätsch, both at ETH Zürich, and their colleagues presented an early version of MetaGraph. The team used its tool, in which mathematical structures known as de Bruijn graphs represent overlaps between sequences, to index more than 1 million records from the SRA, totaling about 3 petabases. They have already employed MetaGraph in projects including identifying the microbial makeup of different cities.
Now, the team has an improved version of MetaGraph, and has harnessed it to index 5 petabases from the SRA and other databases, comprising sequences from microbes, fungi, plants, humans, and the human gut microbiome. Some indexes in the new paper reduce tens of terabases of data into about 10 gigabytes—small enough to work with on a personal computer. Although building the initial indexes is
expensive—hundreds of thousands of dollars for all the SRA, the researchers say—users can query the data sets much more cheaply than with existing techniques.
The work is “hugely exciting,” says Lesley Hoyles, a bioinformatician and microbiologist at Nottingham Trent University. With data repositories ballooning in size, “anything that can reduce the compute storage and energy costs … is a massive plus for researchers worldwide.” Such approaches could lessen barriers to genomic research for scientists in low- and middle-income countries, she adds. “Work could easily be done on cheap laptops.”
Other groups are also making progress. Last year, the Pasteur Institute won €2 million from the European Research Council to launch its IndexThePlanet project to catalog all data in the SRA. And researchers at NCBI are working on their own indexing tool, called Pebblescout. “It’s a very, very active field at the moment,” says Zamin Iqbal, a computational biologist at the University of Bath who worked on AllTheBacteria, a project that assembled bacterial sequence data to make them more easily searchable.
Patro suggests that because of MetaGraph’s index sizes, it could be slower than other tools on some particularly large tasks, such as looking up millions of sequences from a sample simultaneously. It’s also not yet clear how best to update the indexes with new sequence data, he adds. There’s also the challenge of funding the project, as well as all the computational costs that accompany it. Indeed, whether the tool ends up being widely adopted will partly depend on “addressing the social and administrative questions of how such a substantial resource should be hosted, updated, and maintained,” Patro says, adding that it seems “infeasible (and unfair) to expect an individual research group” to take on this enormous task.
Kahles and Rätsch agree, saying they hope the work will inspire other groups, and larger organizations such as NCBI or the SRA, to pick up the project and help index the remaining 90% of sequence data for use by researchers. “We show them here: ‘It’s possible—please do it,’” Rätsch says.
- vitorio and Adenman
- 2
Recommended Comments
There are no comments to display.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.