Advertisement

NIH makes its coronavirus genomic data publicly accessible in the cloud

Researchers can now quickly access the data for free, so long as they have an NIH award.
(Getty Images)

The National Institutes of Health is making genomic data about the coronavirus publicly accessible to researchers in the cloud for the first time.

Created by the National Center for Biotechnology Information, the Coronavirus Genome Sequence Dataset consists of researcher-submitted data, including normalized Sequence Read Archive (SRA) file formats. The SRA is a bioinformatics repository of DNA sequences.

Researchers with active NIH awards can now quickly access the dataset at no cost via the Registry of Open Data on Amazon Web Services, and the agency plans to make it available on more public data cloud platforms.

“Containing COVID-19 outbreaks and preparing for future pandemics will require a deep understanding of the SARS-CoV-2 genome in the context of other COVID-19 patients and the broader Coronaviridae family,” said Ryan Layer, assistant professor at the University of Colorado Boulder’s BioFrontiers Institute, in a statement. “The NCBI Coronavirus Genome Sequence Dataset makes over a decade of viral genome data publicly accessible for researchers, empowering anyone in the research community to participate in the pandemic response.”

Advertisement

The dataset contains more than 13,000 SRA runs, NIH says. The project is part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative. STRIDES is a collaboration between NIH and AWS to use the cloud to assist researchers with active NIH awards.

The data being made available will help researchers understand not only COVID-19 but other pandemic diseases. Differences in genetic sequences among infected patients help researchers determine how quickly the virus is evolving, and genetics are thought to play a role in how patients react to infection. Diagnostic testing can also be fine tuned.

The dataset itself consists of two buckets: one containing raw and normalized files categorized by SRA accession code and another containing accession metadata that will soon be queryable within the Amazon Athena interactive query service.

Latest Podcasts