Using genomics to find risk factors for major diseases or relatives requires the costly and time-consuming ability to analyze large numbers of genomes. A team led by a computer scientist from Johns Hopkins University has leveled the playing field by developing a cloud-based platform that gives genomics researchers easy access to one of the world’s largest genomics databases.
The new platform, known as AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space), provides access to thousands of analysis tools, patient records, and more than 300,000 genomes to any researcher with an Internet connection. The study, a National Human Genome Institute (NHGRI) project, was published today in Cell Genomics.
“AnVIL is inverting the model of genomics data sharing, offering unprecedented new opportunities for science by connecting researchers and datasets in new ways and promising to enable exciting new discoveries,” said project co-leader Michael Schatz, Bloomberg Distinguished Professor of Computer Science and Biology at Johns Hopkins.
AnVIL will be transformative for institutions of all sizes, particularly smaller institutions that lack the resources to build their own data centers. It is our hope that AnVIL will level the playing field so that everyone has equal access to make discoveries.Michael Schatz
Typically, genomic analysis begins with researchers downloading massive amounts of data from centralized warehouses to their own data centers, a process that is not only time-consuming, inefficient, and expensive, but also makes collaboration with researchers at other institutions difficult.
“AnVIL will be transformative for institutions of all sizes, particularly smaller institutions that lack the resources to build their own data centers. It is our hope that AnVIL will level the playing field so that everyone has equal access to make discoveries” said Schatz.
Genetic risk factors for ailments such as cancer or cardiovascular disease are often very subtle, requiring researchers to analyze thousands of patients’ genomes to discover new associations. The raw data for a single human genome comprises about 40GB, so downloading thousands of genomes can take takes several days to several weeks: A single genome requires about 10 DVDs worth of data, so transferring thousands means moving “tens of thousands of DVDs worth of data,” Schatz said.
Furthermore, many studies necessitate the integration of data collected at multiple institutions, which means that each institution must download its own copy while ensuring patient-data security. This challenge is expected to grow in the future as researchers embark on ever-larger studies requiring the simultaneous analysis of hundreds of thousands to millions of genomes.
“Connecting to AnVIL remotely eliminates the need for these massive downloads and saves on the overhead,” Schatz says. “Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud. It also makes sharing datasets much easier so that data can be connected in new ways to find new associations, and it simplifies a lot of computing issues, like providing strong encryption and privacy for patient datasets.”
AnVIL also provides researchers with several major analysis tools, including Galaxy, developed in part at Johns Hopkins, along with other popular tools such as R/Bioconductor, Jupyter notebooks, WDLs, Gen3, and Dockstore to support both interactive analysis and large-scale batch computing. Collectively, these tools allow researchers to tackle even the largest studies without having to build out their own computing environments.
Researchers from around the world are currently using the platform to investigate a wide range of genetic diseases, including autism spectrum disorders, cardiovascular disease, and epilepsy. Schatz’s team, which is part of the Telomere-to-Telomere Consortium, used it to reanalyze thousands of human genomes with the new reference genome, resulting in the discovery of over 1 million new variants.
The AnVIL team has already collected petabytes of data from several of the NHGRI’s largest projects, including hundreds of thousands of genomes from the Genotype-Tissue Expression (GTEx), Centers for Mendelian Genetics (CMG), and Centers for Common Disease Genomics (CCDG) projects, with plans to host many more in the near future.