Constructing A Graph for Genome Comparison Swiftly
Introduction
In recent years, a group of people in the scientific community starts to look into the application of graph to represent genomes. Graph is an extremely powerful abstraction for processing various kind of data. Meanwhile, mathematically, a graph is simple and complicated at the same time. To define a graph, all we need are the sets of nodes (vertex) and links (edges). In the meantime, some theorists also think we can explain what space-time is, e.g., see what-is-spacetime-really, with graphs.
At this moment, we are not so ambitious to explain the whole universe with graphs. We are interested in applications using graph to speed up processing genomics information. Previously, one of the applications we know the best is using a graph to construct genomic sequences at long chromosome scales from the “reads” from much shorter DNA fragments (https://en.wikipedia.org/wiki/Sequence_assembly, https://www.nature.com/articles/nmeth.4035). In the meantime, the genomic community is also eager to represent small variants with a graph to help to identify the difference between a person’s genome to a reference genome (https://www.nature.com/articles/s41587-019-0199-7, https://www.nature.com/articles/nbt.4227, https://www.nature.com/articles/s41587-019-0201-4, https://www.nature.com/articles/s41588-018-0316-4). While there are many similarities between different techniques using among various applications, the subtle differences make the application of graph in genomics a bit confusing, though.
Meanwhile, it is unavoidable that there might not a single graph structure that satisfies the need for different applications. We think it will be efficient to introduce a general graph structure describing how different but similar genomes related to each other that catch large scale structure first. With such a graph for larger-scale similarity between genomes, we might be able to use it as a foundation for more effective downstream processing.
The Idea of “Sketches” of a Sequence
Given a long DNA sequence, can we find a subsequence which is much smaller than the original sequence that we can use to compare the whole sequence to other sequences? There are a couple of options to achieve that. For example, we can regularly sample the original sequence taking the first 20 bases out of every 1000 bases. Unfortunately, this simple example will fail when we want to use such a subsequence to compare with others. This simple subsampling method is too sensitive to where we start to sample to get the subsequences. Two sequences merely differ by one base offset will have no matched subsequence.
One other possible way is to randomly select a set of k-mers (short stretch of DNA bases of length k), and we construct a subsequence of from those k-mers in the set from the original sequences. We want to convert the original sequence into a series of k-mers. If two sequences are similar to each other, then we expect we will see similar k-mer lists generated from the two sequences. This is the basic strategy we used during the Baylor SV Codeathon to compare 8 major histocompatibility sequences in GRCh38. Such set of k-mer servers as “sketches” of the sequences. Instead of using the full sequences for comparisons, we use the sketches. As the size of sketches is much less than the original sequences, it will help us to construct efficient computation strategies for comparing sequences.
Finding Backbone k-mers
Now, the question becomes how we decide what set of k-mer to use. Some k-mer may happen multiple times in a sequence. We like to avoid using such k-mers as they are not as “representative” as those unique/non-repetitive k-mers, namely, the k-mers only happens once in a single sequence. One can think each of such unique k-mers represents a unique “address” for unique/non-repetitive segments of the sequences. Duplicate “addresses” will not be useful as good identifiers. In the meantime, if we want to use a set k-mer to compare different sequences, we hope those k-mers are shared among different sequences. If a k-mer only happens in one sequence, while we know it may represent a unique segment among all sequences, it does not tell us how this segment related to other sequences.
These two guidelines help us to select the set of k-mers from a set of sequences that we will like to build a backbone for comparing the sequences to each other:
(1) Each of the k-mers we want to use is unique within each sequence.
(2) The k-mers we like to use are shared by different sequences.
Let’s call a set of k-mers (roughly) satisfies these conditions the “backbone” k-mers.
The SHIMMER Sketch Graph
The k-mer approach we suggested during Baylor SV Codeathon worked to some degree, but there is still a lot of follow-up work to make this framework more useful. One issue is that the number of these “backbone” k-mers can be high. For example, with the ~5Mb regions of MHC, with a choice of k=128, we still have about a couple million backbone k-mers. One way to reduce the large number of k-mer to process is to randomly sample them to get a subset of backbone k-mer that we can process efficiently. We tried this out and got interesting results. Another possibility is to use minimizer sketches to sample the k-mer from the beginning, so we don’t have to start with a large number of k-mers.
The idea behind minimizer to summarize long stretch of a sequence by selecting a special k-mer within the sequence. We can convert each k-mer inside a sequence to a number and only keep the k-mer of the smallest number as the digest of the sequence. For a very long sequence, we will repeat picking such minimizers for every substring of length w. Thus, we can summarize a long DNA sequence as a list of such minimizers. Interestingly, we can repeat such “minimization process” on the list of minimizers too. This gives us a natural way to generate coarse-grained representation with the sparse-hierarchal minimizers (SHIMMERs) of long DNA sequences. It has been shown such coarse-grained transformation is very useful for some practical applications like assembling DNA fragments into genomes. We are interested in the general application using SHIMMERs for other applications to speed up genomic information processing. Here, we like to show how to use SHIMMERs to generate graph to represent multiple sequences and help to create some visual representation for comparing them.
It is quite straight forward. Instead of taking all k-mers that satisfy the “backbone” conditions, we can select the SHIMMERs that satisfy the requirement. We might want to start with a very sparse representation to get some larger scale “features” that represents the similarity and dissimilarity among those genomic sequences. These backbone SHIMMERs give common “addresses” for us to navigate through the common sub-sequences. The hierarchical nature of SHIMMERs may also allow us to dive into different scales for significant smaller differences between the sequences. We think that SHIMMER could be a natural data structure that can help us to explore multiple sequences efficiently in a recursive manner.
Let’s visualize the locations of the “backbone sketches” inside the MHC region of the reference GRCh38, two de novo assembled haplotype sequences, and those MHC alternative contigs in GRCh38. We also connect the same SHIMMERs in different sequences to show the relationship among the different sequences. The alternative contigs in the GRCh38 MHC region have many “gaps”. Gaps are those regions where the sequences are undetermined due to limit of sequencing technologies. They are typically represented as strings of “N” in the sequences. We also highlight the “gaps” inside the sequences by the horizontal grey bars. We can see there are many big gaps inside the alt. contigs. We miss a lot of information due to those gaps. We are hoping better DNA sequencing technologies will help to remove them so we can make a more comprehensive comparison in the future.
If we merge the same SHIMMER among different sequences, we can generate a “Sketch Genome Graph” using the “backbone sketches”. In this case, we use very sparse SHIMMERs. What is interesting in this case is, without explicitly doing any “alignment” between the SHIMMER lists from the sequences, that the SHIMMERs from different sequences are already co-linear to each other. We can imagine if we simply merge the SHIMMERs that are shared among different sequences, we can use a simple graph to represent the nine sequences at once as shown in the code from this Jupyter notebook. We use `graphviz` to layout the graph as shown below.
We can also try to use Gephi to layout the graph that is somehow visually appealing.
The graph shows above have the “backbone sketches” shown as blue nodes in the graph. The orange nodes are from those SHIMMERs that are not common in all nine sequences. They form parallel paths between the backbone sketches. While those SHIMMERs that are corresponding to the orange nodes are not shared all nine sequences, some of them may be common in a fewer number of different sequences/alternative contigs. We can also collapse those, as shown below. In those regions, we can collapse those SHIMMERs further even though not all sequences have them.
What’s Next?
The point to represent genome sequences as graphs is beyond generating beautiful visualization. Such graph representation helps us to see some complexity, which would hard to get a good intuition otherwise, for genome comparisons. To us, a graph representation with the SHIMMERs helps us to compare different sequences even without pre-describe a coordinate system from a reference genome, e.g., GRCh38. Part of my research work is on the “de novo genome assembly” in the last couple of years. In the context of the “de novo” assembly process, there is no reference. We need to construct genome sequences from scratch from short segments for DNA fragments (reads). There are many different approaches to such problems. We find that using SHIMMERs grouping read together can speed up the genome assembly process. The sparse nature of SHIMMERs allows us to remove direct all pairwise comparison efficiently, as we have shown here. We think this simple SHIMMER (minimizer) data structure will play an essential role in multi-scale genome comparisons. We hope efficient comparisons of many genomes can lead a better understanding of genome structure in populations, and such knowledge about human genomes can help us to solve some puzzling genetic diseases soon.