MPEG-G: A New and Efficient Way of Handling Massive Genomic Information

Author: Tristan Dickey

The advances in genomic sequencing are transforming clinical care and ushering in an era of precision medicine. When a person’s genome gets sequenced, every gene of that individual becomes a digital data set that is subject to further analyses.

By analyzing the genomic data, humankind will be able to answer many questions related to life’s formation. For example, we can understand how a body’s genes may interact with medicines or discover the genes that can be inherited by someone’s offspring, and the like.

These days it is possible to sequence 9,000 human genomes annually using a single powerful sequencing machine. However, sequencing these many genomes, would generate nearly 1 PB of data every year. And researchers predict that the amount of genomic data generated in the future will only climb upward.

Nonetheless, the IT costs of storing, processing, and transferring genomic data in bulk volumes exceed its sequencing cost. Also, there’s a lack of a solid representation of a body’s genomic data; to top it all off, there are no leading-edge technologies for compressing genomic data sequences and make it easily accessible at very large scale. Because of these factors, genomic data use is still limited and not applied to large populations.

With the goal to provide the scientific community with new tools able to unlock the real potential of genomic data, the Moving Picture Experts Group (MPEG) and ISO Technical Committee 276/Working Group 5 are developing MPEG-G, an international standard for compressing, storing, processing, and transmitting genome sequencing data.

Understanding MPEG-G standard

MPEG-G will be a new open standard for compressing, storing, transmitting, and processing sequencing data. This new standard will provide extremely high compression levels—nearly 100 times the size of raw data.

Besides compressing, storing, processing, and transferring genome sequencing data, this standard will provide innovative functionalities including:

  • data access control and privacy protection
  • support for enhanced selective data
  • flexible storage options and different streaming capabilities
  • standard interfaces to non MPEG-G compliant systems

Through all these features, MPEG-G will enable new application scenarios such as streaming the genome sequencing data from a machine to remote analysis centers. Being the first ISO/IEC standard addressing the shortcomings of existing genomic data formats, MPEG-G will provide new tools to efficiently and economically represent genomic information and managing DNA sequence compression.

An additional advantage of the ISO standard is that it will be future-proof in every respect because ISO/IEC will be engaged in maintaining MPEG-G in the future.

To make it compatible with a variety of genomic information, MPEG-G will offer high-level interoperability and seamless integration with present-day information processing pipelines. This new standard will support the conversion to or from different file formats such as FASTQ/SAM/BAM.

To sum up

All in all, the MPEG-G standard will revolutionize the way genomic information will be represented and transmitted in the future. This standard is the result of the interdisciplinary effort carried out by specialists belonging to different domains such as biology, bioinformatics, telecom, information theory, data storage, video, information security, and data compression.

This will be the first ISO standard that will address every limitation or problem with current genome sequencing products and technologies. In short, this will be the first-ever international ISO standard supporting the creation of an economical way of handling bulk genomic information.