Scalable genomic alignment with Progressive Cactus

An important method in comparative genomics and evolutionary studies, multiple genomic alignments attempt to map all regions in each of the input genomes to the corresponding segments in every other genome. Such alignments help understand the relationships between those segments and unlock key insights into genome evolution.

With the growing number of published genomic sequences, many studies seek to analyse increasingly large sets of complex genomes. This means that multiple genome alignment tools need to scale to handle the ever growing sets of input genomes.

An important class of multiple genome alignment tools are reference-free aligners, also known as non-reference-based aligners, which do not require a reference sequence for constructing the alignment. One such tool, Cactus, provides highly accurate alignment results and has been shown to outperform it peers.

The original implementation of Cactus dates back to 2012 and since then, it has been used in many genomic projects and studies. The runtime requirements of Cactus, however, increase quadratically with the total number of input bases which means that it cannot, for example, be used to align any more than 10 large vertebrate genomes.

Progressive Cactus is the new extension of the Cactus aligner designed to perform well on large sets of input genomes (hundreds to thousands of large genomes). Unlike its predecessor, Progressive Cactus implements a linear-time progressive algorithm which recursively breaks down the multiple alignment problem into smaller subproblems with the resulting sub-alignments being aligned back together to form the final alignment output.

Bottom line

Genome alignment is the sine qua non of comparative genomics and evolutionary studies. Due to the increasing scale of such studies, genome alignment tools must continually improve to cope with the ever growing complexity of multiple genome alignment problems.

By implementing the progressive alignment strategy, Progressive Cactus becomes suitable for aligning hundreds to thousands of large input genomes and provides the opportunity to uncover new insights into genome evolution and natural history.

See also

Big Sequence Logos
$29.99

A collection of large-format sequence logos.

Green Fluorescent Protein Poster, Technical Illustration, English-Labeled
$19.99

A poster featuring the green fluorescent protein structure.

Hawaiian Alphabet Poster, English-Labeled
$17.99

The Hawaiian alphabet chart.

Fullerene Molecule Poster, Ball-and-Stick Model, English-Labeled
$19.99

A poster featuring the ball-and-stick model of the fullerene molecule.

языковед Morphemic Analysis Poster
$14.99

A poster featuring the morphemic analysis of the Russian word языковед.

Plasmid Map Generator

A tool to generate plasmid maps from GenBank files.

WikiPathways: A Wikipedia for biological pathways

An overview of the collaboratively edited structured pathway encyclopedia.

Bioinformatics Crossword

A daily crossword puzzle for bioinformatics terms.

Awesome bioinformatics

A curated list of awesome resources on bioinformatics.

Bioinformatics pronunciation guide

A pronunciation guide for bioinformatics terms.

All prices listed are in United States Dollars (USD). Visual representations of products are intended for illustrative purposes. Actual products may exhibit variations in color, texture, or other characteristics inherent to the manufacturing process. The products' design and underlying technology are protected by applicable intellectual property laws. Unauthorized reproduction or distribution is prohibited.