Genomics Data Lake

The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity.

To Access the Genomic Data Lake:

COVID-19 Open Research Dataset

Full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.

In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19). This dataset is a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset mobilizes researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease.

The corpus may be updated as new research is published in peer-reviewed publications and archival services like bioRxivmedRxiv, and others.

To Access the Dataset:

COVID-19 Data Lake

The COVID-19 Data Lake contains COVID-19 related datasets from various sources. It covers testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, and so on.

The COVID-19 Data Lake is hosted in Azure Data Lake Storage in the East US region. For each dataset, modified versions in csv, json, json-lines, and parquet formats are available. The raw data is also available as ingested.

ISO 3166 subdivision codes are added where not present to simplify joining. Column names reformatted in lower case with underscore separators. Datasets are updated daily with historical copies of modified and raw files also available.

To Access The Data Sets:

How genomics will help us track Covid-19


New Zealand has found itself suddenly plunged into a dire situation: Four new cases of Covid-19 in the community with no obvious link to the border, managed isolation and quarantine facilities or overseas travel.

While contact tracing is underway to attempt to find out who infected yesterday’s four cases and search for more links to the border, it could still come up blank. Contact tracing will never be perfect and the Government is even investigating the possibility that the virus came in a cooled shipping container from overseas, rather than via a person who crossed the border on an Air New Zealand flight.

Given this, the Government has turned to a new tool in its arsenal.

“I should add that we are also doing genome sequencing on all those who have tested positive and our recent cases and current cases in managed isolation and quarantine,” Director-General of Health Ashley Bloomfield said on Wednesday.

“That will help us track where this virus may have arisen from and then gotten out into the community.”

For More Information:

Genomics Bioinformatic Analysis

Authors: Patrick C Y Woo 1Yi HuangSusanna K P LauKwok-Yung Yuen

The drastic increase in the number of coronaviruses discovered and coronavirus genomes being sequenced have given us an unprecedented opportunity to perform genomics and bioinformatics analysis on this family of viruses. Coronaviruses possess the largest genomes (26.4 to 31.7 kb) among all known RNA viruses, with G + C contents varying from 32% to 43%. Variable numbers of small ORFs are present between the various conserved genes (ORF1ab, spike, envelope, membrane and nucleocapsid) and downstream to nucleocapsid gene in different coronavirus lineages. Phylogenetically, three genera, Alphacoronavirus, Betacoronavirus and Gammacoronavirus, with Betacoronavirus consisting of subgroups A, B, C and D, exist. A fourth genus, Deltacoronavirus, which includes bulbul coronavirus HKU11, thrush coronavirus HKU12 and munia coronavirus HKU13, is emerging. Molecular clock analysis using various gene loci revealed that the time of most recent common ancestor of human/civet SARS related coronavirus to be 1999-2002, with estimated substitution rate of 4×10(-4) to 2×10(-2) substitutions per site per year. Recombination in coronaviruses was most notable between different strains of murine hepatitis virus (MHV), between different strains of infectious bronchitis virus, between MHV and bovine coronavirus, between feline coronavirus (FCoV) type I and canine coronavirus generating FCoV type II, and between the three genotypes of human coronavirus HKU1 (HCoV-HKU1). Codon usage bias in coronaviruses were observed, with HCoV-HKU1 showing the most extreme bias, and cytosine deamination and selection of CpG suppressed clones are the two major independent biological forces that shape such codon usage bias in coronaviruses.

For More Information:

Initial study on TMPRSS2 p.Val160Met genetic variant in COVID-19 patients

Authors: Laksmi WulandariBerliana HamidahCennikon PakpahanNevy Shinta DamayantiNeneng Dewi KurniatiChristophorus Oetama AdiatmajaMonica Rizky WigianitaSoedarsonoDominicus HusadaDamayanti TinduhCita Rosita Sigit PrakoeswaAnang EndaryantoNi Nyoman Tri PuspaningsihYasuko MoriMaria Inge LusidaKazufumi Shimizu & Delvac Oceandy 

Coronavirus disease 2019 (COVID-19) is a global health problem that causes millions of deaths worldwide. The clinical manifestation of COVID-19 widely varies from asymptomatic infection to severe pneumonia and systemic inflammatory disease. It is thought that host genetic variability may affect the host’s response to the virus infection and thus cause severity of the disease. The SARS-CoV-2 virus requires interaction with its receptor complex in the host cells before infection. The transmembrane protease serine 2 (TMPRSS2) has been identified as one of the key molecules involved in SARS-CoV-2 virus receptor binding and cell invasion. Therefore, in this study, we investigated the correlation between a genetic variant within the human TMPRSS2 gene and COVID-19 severity and viral load.

For More Information:

Genomics & Precision Medicine

Authors: CDC Blogs Merline Feero, Marta Gwinn, and Muin J. Khoury, Office of Genomics and Precision Public Health, Centers for Disease Control and Prevention, Atlanta, Georgia

Tracking the Scientific Literature on SARS-CoV-2 Variants Using the COVID-19 Genomics and Precision Health Knowledge Base

The first reports of SARS-CoV-2, the highly infectious virus causing COVID-19, swept across the globe in December 2019, prompting a burst of scientific activity. The rate of research and discovery intensified as the pandemic grew, resulting in a flood of publications in journals and on preprint servers around the world. More recently, SARS-CoV-2 variants have become a major focus of SARS-CoV-2 research in basic, clinical, and public health sciences.

CDC’s Office of Genomics and Precision Public Health established the COVID-19 Genomics and Precision Health (COVID-19 GPH) database to capture publications that reflect the influence of two broad emerging technologies: genomics (pathogen and human), and precision health (machine learning, artificial intelligence, and predictive analytics). Together, these fields are the leading edge of precision public health in COVID-19 and beyond. Data are continuously updated from PubMed, the NIH iSearch COVID-19 Portfolio, LitCovid, and media sources using an automatic retrieval and text mining strategy3 and manual curation by CDC staff.

For More Information:


Authors: Amanda Zrebiec

With software and molecular biology approaches developed in part at APL, researchers at John’s Hopkins are using handheld DNA sequencers to conduct immediate on-site genome sequencing of SARS-CoV-2—the virus that causes COVID-19.

“This information allows for the tracking of the evolution of the virus. It gives clinicians a sense of where the new cases coming into Baltimore could’ve originated, and insight into how long transmission may have occurred undetected. There are a lot of things we can glean from that.”

Topping that list is the ability to see how quickly the virus mutates—integral information for mapping its spread, as well as developing an effective vaccine. Influenza, for example, mutates constantly. That’s why it’s necessary to vaccinate against different strains of the flu each year.

For More Information:

COV-2 Transmission Analyzing Genomics

Authors:Trevor Bedford, PhD

The news of infections caused by a novel coronavirus, and the everyday use of the names SARS-CoV-2 and COVID-19, became widespread around February. But, the question of how long the virus had already been present in the United States before that time has remained unknown. Now, a team of researchers has reconstructed some of the early transmissions of the virus. By analyzing the genomic sequences of SARS-CoV-2 samples from infected patients in Washington State, they suggest that most early SARS-CoV-2 infections derive from a single introduction in late January or early February, sparking rapid community transmission of the virus that went undetected for several weeks before this community spread became evident.

For More Information:

Why Does COVID Make Some People So Sick, Ask Their Genome

Authors: Megan Molteni i

SARS-COV-2,the pandemic coronavirus that surfaced for the first time in China last year, is an equal opportunity invader. If you’re a human, it wants in. Regardless of age, race, or sex, the virus appears to infect people at the same rate. Which makes sense, given that it’s a totally new pathogen against which approximately zero humans have preexisting immunity.

But the disease it causes, Covid-19, is more mercurial in its manifestations. Only some infected people ever get sick. Those who do experience a wide range of symptoms. Some get fever and a cough. For others it’s stomach cramps and diarrhea. Some lose their appetite. Some lose their sense of smell. Some can wait it out at home with a steady diet of fluids and The Great British Baking Show. Others drown in a sea of breathing tubes futilely forcing air into their flooded lungs. Old people, those with underlying conditions, and men make up the majority of the casualties. But not always. In the US, an alarmingly high fraction of those hospitalized with severe symptoms are adults under the age of 40. Kids, and in particular infants, aren’t invincible either.

To understand what accounts for these differences, scientists have been scouring the patchy epidemiological data coming out of hotspots like China, Italy, and the US, looking for patterns in patients’ age, race, sex, socioeconomic status, behaviors, and access to health care. And now, they’re starting to dig somewhere else for clues: your DNA.

For More Information: