Genome Rearrangement

A group of 11 mathematicians, smiling and posing for the camera.

Collaborators

Lora Bailey
Heather Smith Blake
Garner Cochran
Nathan Fox
Michael Levet
Reem Mahmoud
Elizabeth Matson
Inne Singgih
Grace Stadnyk
Elena Wang

First noted by Sturtevant in the study of strains of Drosophila (fruit flies), genome rearrangement is a large scale mutation which occurs when the order of genes within a genome is altered (opposed to small scale mutations which occur within a gene itself). Aligning with Occam's razor, a common genome rearrangement problem is determining a minimal sequence of mutations would transform one genome into another. We call such a minimal sequence a most parsimonious scenario.

Represent each gene by a directed edge with a distinct label, where direction identifies which strand of the DNA contains the gene coding sequence. A genome is then represented by an edge-labeled directed graph with maximum total degree (sum of in-degree and out-degree) two, such as the two genomes in the figure below. Each component in the underlying graph then corresponds to a chromosome, where paths are linear chromosomes as in eukaryotes and cycles are circular chromosomes that play a role in tumor growth.

A visual representation of the adjacency graph between two genomes. Genome one consists of two components: the first being arrows 4 and 5 joined head-to-head, and the second being arrows 1 and 3 joined head-to-head, arrows 1 and 2 joined tail-to-tail, and the head of arrow 2 joined to the tail of arrow 3 (forming a triangle). The second genome is one component, with arrows 5 and 4 joined head-to-head, the tail of 4 joined to the tail of 3, the head of 3 joined to the tail of 1, and the head of 1 joined to the head of 2. The adjacency graph is drawn below, constructed as described in the main text.

The adjacency graph accentuates the differences between Genome One and Genome Two.

The adjacency graph between two genomes, first defined by Bergeron, Mixtacki, and Stoye in A Unifying View of Genome Rearrangements, is a bipartite graph which contains a vertex for each vertex of the two genomes, with each new vertex labeled by the `head/tail' of the adjacent edge(s) in the corresponding genome. There is an edge between two vertices if there is an non-empty intersection of their labels. Notably, two genomes are the same if and only if the adjacency graph consists of trivial paths and trivial cycles.

A model for genome rearrangement specifies which types of mutations are allowable. In creating models, we must balance biological relevance with computational approachability. Unsurprisingly, the models which most accurately represent nature have been proven to be computationally intractable or still have unknown complexity. For each model, we ask for an enumeration of potential evolutionary histories. The sheer number of histories makes it infeasible to check statistics on every history. However, if a polynomial time algorithm exists for enumerating the histories, then we can sample the histories from a near-uniform distribution for hypothesis testing.

With a cohort formed by the 2022 AMS MRC Conference on Trees In Many Contexts, we have been studying the Single-Cut and Join model introduced by Bergeron, Medvedev, and Stoye in Rearrangement Models and Single-Cut Operations. This model allows three mutations:

Fission (Cut): Replace adjacency {a,b} with telomeres {a} and {b}.
Fusion (Join): Replace telomeres {a} and {b} with an adjacency {a,b}
Single-Cut and Join: Replace adjacency {a,b} and telomere {c} with adjacency {a,c} and telomere {b}.

Our goal is to count the number of most parsimonious scenarios between two genomes (that is, the number of potential evolutionary histories of minimal distance), and to determine the computational complexity of that count. Too technical to list here, we have derived formulae for many cases, such as when the adjacency graph consists of only cycles, or of only non-cycles, and we are circling a formula for full generality. The complexity of each case varies greatly. We have been able to show that, under polynomial-time Turing reductions, the number of most parsimonious scenarios in certain cases is #p-complete. In essence, this means that counting the number of most parsimonious scenarios is among the most difficult #p counting problems; i.e., every problem which asks for "the number of solutions to a problem wherein each solution can be verified (not necessarily solved for) in polynomial time" can be rephrased in polynomial time to counting most parsimonious scenarios. I am excited to say that these results are in preparation.

A visual representation of genomic rearrangement. Gene one consists of arrow 3, arrow 1, and arrow 2 all head-to-tail in that order. Gene two consists of arrow 1, arrow 2 head-to-tail, and the head of arrow 3 joined to the head of arrow 2. Gene three is the same as gene two, except arrow 1 and arrow 2 have been cut apart, forming two components.

Genome One is mutated into Genomes Two by single-cut and joining {t_1, h_3} and {h_2} with {t_1} and {h_2, h_3}. Then {h_1, t_2} is cut to create Genome 3.

Students:

If you have a strong interest in combinatorics and are interested in working with me, we could consider a variety of related problems. Determining the complexity class of various genome rearrangement models can be a very difficult problem and is likely beyond the scope of a semester project, but we can start by combinatorically counting the number of most parsimonious scenarios for a given genome rearrangement model, which is far more approachable.