Why Do Contigs Break in a Genome Assembly?

7 min readJun 7, 2020

Human Genome Reference Remains Incomplete

When I was still a physics graduate student, I heard about the coolest thing at the time: we finished the Human Genome Project. It made me think about how awesome we can get all those little letters A/C/G and T in each of our cells. There must be a lot of human health and medicine problem that we can solve with the new information about our genomes.

Nevertheless, I did not pay much attention as I had to finish my thesis. Later, I started to work in the field of bioinformatics, the impression that we had “finished” the Human Genome Project became a more and more confusing concept to me.

In fact, our current human genome reference is incomplete, and there are still many essential problems that having a better human genome reference will help to solve. Some scientific communities know this problem well and have been published research papers to address the incomplete human genome reference problem.

If you are interested in the implication of a better human genome reference, I don’t think I can do a better job than Vivek Das’s blog.

However, there is one thing I do know well. If we just sequence the human genome for its completion, why can’t we get it correctly? If this is a question that interests you, I hope I am providing something useful insight about it from this short write-up.

What Is The Genome Assembly Problem?

Due to the limit of technology, no one has managed to read DNA from one end of a chromosome to the other end in a single piece. Instead, we read many small fragments of the chromosomes and try to stitch them together, just like solving a jigsaw puzzle except for two things. First, we don’t have a reference picture in the so-called de novo assembly problem. Second, we are actually solving “one-dimensional puzzle” (as DNA sequences are just a series of A/C/G and T letters) rather than a two-dimensional one.

(For those who are interested in some better description and how the pioneers’ approaches to solving the genome assembly problems during the Human Genome Project, you can read the book “The Genome War.”)

In fact, despite the advance in DNA sequencing and computation technology, the fundamental approach to solving the genome assembly problems does not change much since the late ’90s.

First, like solving jigsaw puzzles, we compare all two DNA fragment pairs to see if they match. If so, we connect a link between the two pieces. For reconstructing the chromosome, we try a find a path that is linearly going through all DNA fragments that come from the chromosome. Of course, in the eyes of the nerds or experts who develop genome assemblers, this is an oversimplified picture.

An Alternative View

I, probably the only one among all genome assembler developers, hold a minority point of view about the de novel genome assembly problem. I like to think genome assembly is actually a sorting problem, except we don’t have a way to compare two fragments to determine their order if they don’t overlap with each other. Still, we can get local/partial order quickly if a set of fragments overlaps with each other.

In this alternative view, to get a perfect genome assembly of a chromosome, we need to find the unique solution of the global order of the DNA fragments from that chromosome. However, this may not always be possible from those partial local orders that we are confident with.

If we take this point of view thinking assembly as a problem finding correct order of the fragments, when there are ambiguities that we cannot determine the order of some fragments, we might want to let scientists who use the output to know about those ambiguities. A straightforward way to do that is to only produce “contigs (abbreviated from contiguous sequences).” In a most strict definition, there should be no ambiguity of the order of the fragments within the contigs. If there is anything that we are not sure we can get the correct orders between any two fragments, we may need to break the contigs.

When there are long repeats, two long sequences are identical to each other, but in different locations in a genome, it may make to build a globally unique order of the DNA fragments impossible. (There are many research papers about this already.)

For example, we consider a case of three fragments A, B, and C shown in the picture below. If there is a repetitive region at the end of fragments A, by comparing these three sequences, we know A goes before B and A goes before C. If we only look at the sequences that overlap with the fragment A, we might conclude B goes before C. Unfortunately, this is not true, as the righthand side of B and C are actually different, we cannot claim the ordering relationship between B and C.

The yellow highlighted part indicate the three fragments have identical sequeunces. We can determine the order relationship between A-B, A-C but not B-C as there are different sequences on the right hand side.

Such inconsistency of ordering the fragments due to the repetitive sequences in a genome can make very tangled assembly graphs. Linear paths without branch in such graph represent the fragments that we can confidently order them to reconstruct the genome unambiguously. Each branching point in the graph indicates that we cannot determine the order between some fragments. In some case, we can only determine the partial order (in the mathematical sense) of a subset of fragments but all fragments.

Some Real and Interesting Cases

I had documented a couple of interesting cases about such ambiguities a couple years ago. (See this slide deck: “what after getting awesome contigs: challenges and opportunities”). Back then, we had to work on DNA sequencing data with 10% to 15% errors. Recently, the DNA sequence technologies have advanced that DNA fragments of 20,000 base pairs (= 20,000 A,C,G or T letters) can be sequenced with an error rate <0.5% now. It becomes interesting to see where these regions can be resolved.

By aligning the assembled contigs to a reference, we can examine the assembly graph around where the contig breaks to understand what is going on.

One of the trouble regions that I previously seen in human genome is around Alpha-amylase on Chromosome 1. Here is the assembly graph of the region from the most recent Peregrine Genome Assembler code base:

A local tangled graph across the AMY1 in chromosome one.

We track the directions of both forward and reversed DNA strain. The arrows indicate the local orders/overlaps of two reads along the forward or reversed directions. While we know “ctg09F” is before “ctg51R” on human chromosome 1, it will need additional work to see if we can covert the ambiguous order into an unambiguous one of the fragments between the two contigs.

Another interesting case in around human Chr. 6, base 26,063,026 to 26,890,349:

An example of the “lollipop” motif created by inverted repeats that cannot be traversed unambiguously. See the following text for an explanation.

Our assembly has a break between ctg31f and ctg36r around this region, and the corresponded assembly graph has a lollipop shape. What is going on? What happens is there is a large inverted repeat.

(See https://en.wikipedia.org/wiki/Inverted_repeat for the examples of smaller inverted repeats, the repeats shown here is much larger than the examples in the Wikipedia page.)

A cartoon shows how the lollipop motifs is formed from inverted repeats.

Because of the repeats, there are two ways to go through the loop part. We don’t know which the correct choice is to place the DNA fragments in the loop that is the same as the genome. We can randomly choose one, but we will likely have a 50% chance to be wrong.

Two equivalent ways (yellow / blue) to “walk” from point A to point B in the graph. If the repeats (around point A and B) are exact the same, we can’t determine which choice is correct. We do not know which order of the fragments with in the orange segements relative to the other part of the assembly are right.

This “loop” may be opened up if (1) we have very long reads spanning through the repeats or (2) the repeat fragments corresponding to the “stem” part of the lollipop are not 100% identical. If so, there is a hope to do some detailed analysis with a more sophisticated algorithm to separate the repeats opening up the loop to get a unique order of the fragments.

There are also other more complicated examples:

Two randomly picked example for complicated regions that may be hard to solve.

Reducing DNA sequencing errors or implementing algorithm for correcting reads may help to make some of these cases simpler, or even one can fully resolve them. Yes, there are many other researchers (e.g., Heng Li and Haoyu Cheng) who are eager and working hard to solve these problems to get a better assembly.

Brief Summary

While the basic principle of genome assembly is actually quite simple. Complexity arises due to the repeat sequences or DNA sequencing errors. Many people spent countless hours to find a way to solve these problems. I hope you will find this is interesting to you. If you are in the field of genomics, it might help you to know some little part of how s̶a̶u̶s̶a̶g̶e̶ genome reference is made.