wiki:Tutorial/Alignments

Alignments

Alignments are one of the most basic and important ways to measure similarity between two or more sequences. In this part of the tutorial we want to demonstrate the data structures and functions that SeqAn offers for computing local or global, pairwise, or multiple sequence alignments. There are two ways for storing alignments in SeqAn: (1) the Align data structures, and (2) the Alignment Graph.

Alignment Representation

Align Data Structure

To understand the Align data structure it is first necessary to look at the underlying Gaps data structures. This data structure is used for storing gapped sequences, i.e. the rows of an alignment.

SeqAn implements different specializations of the Gaps class: ArrayGaps, and SumlistGaps. Not going into further detail about the differences between these specializations, we here want to focus on demonstrating the concept of source (sequence space) and view (gapped sequence space) positions of a gapped sequence.

In this example, we will use the ArrayGaps specialization to show how to insert and remove gaps, and to manually construct an alignment between two sequences, and to print the source and the view positions of this alignment.

Let's start by including the header file seqan/align.h which contains the necessary data structures and functions.

#include <iostream>
#include <seqan/align.h>

using namespace seqan;
int main()
{

We define our sequence type TSequence, our alignment type TAlign, and our gapped sequence type TRow by using the metafunction Row.

    typedef String<char> TSequence;                 // sequence type
    typedef Align<TSequence,ArrayGaps> TAlign;      // align type
    typedef Row<TAlign>::Type TRow;                 // gapped sequence type

Now we define two sequences, seq1 and seq2, of type TSequence, and our alignment object align of type TAlign. The function rows returns a container of TRow elements (the correct container type can be acquired by using metafunction Rows) which we resize to 2 in order to make room for two source sequences. The function assignSource is used to set the source sequence of the first row to seq1, and the source sequence of the second row to seq2.

    TSequence seq1 = "CDFGHC";
    TSequence seq2 = "CDEFGAHC";
    TAlign align;
    resize(rows(align), 2);
    assignSource(row(align,0),seq1);
    assignSource(row(align,1),seq2);

So far the rows of the align object do not contain any gaps, they simply represent the original sequences. Let's print the 'alignment' to see what it looks like. Next we manually construct an alignment (don't worry, we will see better ways of constructing alignments later on). The function row returns an individual row of the alignment. We use a reference TRow & to each row, in order to manipulate the actual align object and not just a copy of the row. Now we insert two gaps into the first row: one gap after the 'D' (position 2 in source and in view space), and one gap after the 'G' (position 4 in sequence space, but position 5 in view space!). We can now print the nice manually constructed alignment.

    ::std::cout << align;
    TRow &row1 = row(align,0);
    TRow &row2 = row(align,1);
    insertGap(row1,2);
    insertGap(row1,5);
    ::std::cout << align;

For each view position in each of the rows, we can print the corresponding sequence position. Where the gapped sequence contains a gap character, the source position is defined to be the position of the character on the right border of the gap.

    ::std::cout << ::std::endl << "ViewToSource1: ";
    for(unsigned i = 0; i < length(row1); ++i)
        ::std::cout << toSourcePosition(row1, i) << ",";
    ::std::cout << ::std::endl << "ViewToSource2: ";
    for(unsigned i = 0; i < length(row2); ++i)
        ::std::cout << toSourcePosition(row2, i) << ",";
    ::std::cout << ::std::endl;

We can also walk through the sequences and print the view position that corresponds to each sequence position.

    ::std::cout << ::std::endl << "SourceToView1: ";
    for(unsigned i = 0; i < length(source(row1)); ++i)
        ::std::cout << toViewPosition(row1, i) << ",";
    ::std::cout << ::std::endl << "SourceToView2: ";
    for(unsigned i = 0; i < length(source(row2)); ++i)
        ::std::cout << toViewPosition(row2, i) << ",";
    ::std::cout << ::std::endl;
    return 0;
}

And here is the output of this short example program:

      0     .    
        CDFGHC--
        ||      
        CDEFGAHC
 
  
      0     .    
        CD-FG-HC
        || || ||
        CDEFGAHC
 
  
ViewToSource1: 0,1,2,2,3,4,4,5,
ViewToSource2: 0,1,2,3,4,5,6,7,
 
SourceToView1: 0,1,3,4,6,7,
SourceToView2: 0,1,2,3,4,5,6,7

In the first alignment we see that the end of the first row is artificially filled up with gaps. These gaps are only in the visualization but not explicitly stored in the underlying gapped sequence data structure. The second alignment is the one we manually constructed.

AlignmentGraph Data Structure

The Alignment Graph is a graph in which each vertex corresponds to a sequence segment, and each edge indicates an ungapped alignment between the connected vertices, or more precisely between the sequences stored in those vertices. Here is an example of such a graph:

source:trunk/seqan/core/demos/tutorial/alignments/alignment_example.png

The general usage of graphs is explained in the Graphs section of this tutorial. How to use alignment graphs will be demonstrated in the following subsection, while also introducing the pairwise alignment functions offered by seqan.

Scoring Schemes

Scoring schemes define the score for aligning two characters of a given alphabet and the score for gaps within alignments. Given an alignment between two sequences and a scoring scheme, the score of the alignment can be computed as the sum of the scores for aligned character pairs plus the sum of the scores for all gaps.

An example for a scoring scheme is the levenshtein distance, for which each mismatch between two aligned characters costs 1 and each character that is aligned with a gap costs 1. Translated into scores instead of costs, misalignments get a score of -1 and gaps a score of -1 per character. This scoring scheme is the default for Simple Score.

Character Pair Scores

SeqAn offers two kinds of scoring scheme:

Simple Score This scoring scheme differentiates between "match" (the two aligned characters are the same), "mismatch" (the two aligned characters are different), and gaps. It supports linear and affine gap costs. The functions scoreMatch and scoreMismatch access values for match and mismatch. The function scoreGap, or scoreGapExtend and scoreGapOpen access values for gaps.
Scoring matrices These scoring schemes store a score value for each pair of characters. This value can be accessed using score. Examples for this kind of scoring scheme are Pam120 and Blosum62. The class Score Matrix can be used to store arbitrary scoring matrices. Also see the HowTo on custom scoring matrices.

Gap Scores

SeqAn scoring schemes support affine gap costs: The score for a gap of length n is gap_open + (n - 1) * gap_extend. While gap_open and gap_extend are typical negative values that represent the costs for "opening a gap" and "extendig the gap". If gap_open == gap_extend (which is the default), the scoring scheme uses linear gap costs.

Pairwise Alignments

Global

Let's first look at global pairwise alignments, and later at local pairwise alignments.

SeqAn offers different global alignment algorithms, for a complete list see here.

In the following example, we want to compute a global alignment of two sequences using the Hirschberg algorithm and setting the match score to 1, and mismatch as well as gap penalty to -1. We print the alignment and the score. After that, we solve the same task using the alignment graph data structure, this time by use of the Needleman-Wunsch algorithm.

First the necessary includes and typedefs:

#include <iostream>
#include <seqan/align.h>

using namespace seqan;
int main()
{

We initialize an align object as in the previous example.

    typedef String<char>            TSequence;  // sequence type
    typedef Align<TSequence,ArrayGaps>  TAlign;     // align type

    TSequence seq1 = "CDFGHC";
    TSequence seq2 = "CDEFGAHC";
    TAlign align;
    resize(rows(align), 2);
    assignSource(row(align,0),seq1);
    assignSource(row(align,1),seq2);

Now a global alignment is computed using the function globalAlignment. The scoring parameters are handed to this function via a Score object. The tag Hirschberg makes sure the desired algorithm is used (for an overview of available alignment algorithms see Global Alignment Algorithms). The resulting alignment and score are then printed.

    int score = globalAlignment(align,Score<int>(1,-1,-1,-1), Hirschberg());
    ::std::cout << "Score = " << score << ::std::endl;
    ::std::cout << align;

Now let's make use of the Alignment Graph for the first time. The alignment graph needs to be initialized on a StringSet, so we first create the object sequences and append our sequences of type TSequence. The first template argument of the AlignmentGraph is StringSet<TSequence, Dependent<> >. We use the tag Dependent so that the sequences are not copied to the graph but only a reference to the original StringSet is held.

    StringSet<TSequence> sequences;
    appendValue(sequences,seq1);
    appendValue(sequences,seq2);
    typedef StringSet<TSequence, Dependent<> > TDepStringSet;
    typedef Graph<Alignment<TDepStringSet> > TAlignGraph;
    TAlignGraph alignG(sequences);

The function globalAlignment computes the global alignment, this time with the Needleman-Wunsch algorithm as specified by the tag NeedlemanWunsch.

    score = globalAlignment(alignG,Score<int>(1,-1,-1,-1), NeedlemanWunsch());
    ::std::cout << "Score = " << score << ::std::endl;
    ::std::cout << alignG;
    return 0;
}

And here is the output:

Score = 4
      0     .
        CD-FG-HC
        || || ||
        CDEFGAHC


Score = 4
Alignment matrix:
      0     .
        CD-FG-HC
        || || ||
        CDEFGAHC

Assignments

Task 1
Write a program that computes a global alignment of the DNA sequences "acgtacgtact" and "actactacgt" using the Align data structure. Iterate through the rows of the aligment and for each row output the view positions of the gaps.
Hint
Use the function isGap.
Solution
can be found here
Task 2
Compute a semi-global alignment of the sequences "blablablu" and "abab" using edit distance and an alignment algorithm of your choice. Print distance and alignment.
Hint
When using the Alignment Graph data structure, one can compute more sophisticated alignments, so called end-space free alignments. These alignments can be computed by using the regular global alignment algorithms, but initializing the DP matrix with zeros in the first row and/or column, and/or by searching for the best score in the last row and/or column. This can be achieved by using an AlignConfig object that specifies which end spaces are free from penalty.
Solution
can be found here

Local

Now let's look at local pairwise alignments.

Seqan offers the classical Smith-Waterman algorithm that computes the best local alignment with respect to a given scoring scheme, and the Waterman-Eggert algorithm, which computes not only the best but also suboptimal local alignments.

We are going to demonstrate the usage of both in the following example where first the best local alignment of two character strings and then all local alignments of two DNA sequences with a score greater than or equal to 4 are computed.

#include <iostream>
#include <seqan/align.h>

using namespace seqan;
int main()
{

Let's start with initializing the Align object to contain the two sequences.

    Align< String<char> > ali;
    appendValue(rows(ali), "aphilologicaltheorem");
    appendValue(rows(ali), "bizarreamphibology");

Now the best alignment given the scoring parameters is computed by the function localAlignment. The returned score value is printed directly, and the alignment itself in the next line. The functions clippedBeginPosition and clippedEndPosition can be used to retrieve the begin and end position of the matching subsequences within the original sequences.

    ::std::cout << "Score = " << localAlignment(ali, Score<int>(3,-3,-2, -2), SmithWaterman()) << ::std::endl;
    ::std::cout << ali;
    ::std::cout << "Aligns Seq1[" << clippedBeginPosition(row(ali, 0)) << ":" << (clippedEndPosition(row(ali, 0))-1) << "]";
    ::std::cout << " and Seq2[" << clippedBeginPosition(row(ali, 1)) << ":" <<  (clippedEndPosition(row(ali, 1))-1) << "]" << ::std::endl << ::std::endl;

Next, several local alignments of the two given DNA sequences are going to be computed. First, the Align object is created.

    Align< String<Dna> > ali2;
    appendValue(rows(ali2), "ataagcgtctcg");
    appendValue(rows(ali2), "tcatagagttgc");

A LocalAlignmentFinder object needs to be initialized on the Align object. In addition to the Align object and the scoring scheme, we now also pass the finder and a minimal score value, 4 in this case, to the localAlignment function. The WatermanEggert tag specifies the desired Waterman-Eggert algorithm. While the score of the local alignment satisfies the minimal score cutoff, the alignments are printed with their scores and the subsequence begin and end positions.

    LocalAlignmentFinder<> finder(ali2);
    Score<int> scoring(2, -1, -2, 0);
    while (localAlignment(ali2, finder, scoring, 4, WatermanEggert())) {
        ::std::cout << "Score = " << getScore(finder) << ::std::endl;
        ::std::cout << ali2;
        ::std::cout << "Aligns Seq1[" << clippedBeginPosition(row(ali2, 0)) << ":" << (clippedEndPosition(row(ali2, 0))-1) << "]";
        ::std::cout << " and Seq2[" << clippedBeginPosition(row(ali2, 1)) << ":" <<  (clippedEndPosition(row(ali2, 1))-1) << "]" << ::std::endl << ::std::endl;
    }
    return 0;
}

Here is the output of the first part of our example program:

Score = 19
      0     .    :
        a-philolog
        | ||| ||||
        amphibolog


Aligns Seq1[0:8] and Seq2[7:16]

The second part outputs two alignments:

Score = 9
      0     .
        ATAAGCGT
        ||| | ||
        ATA-GAGT


Aligns Seq1[0:7] and Seq2[2:8]

Score = 5
      0     .
        TC-TCG
        || | |
        TCATAG


Aligns Seq1[7:11] and Seq2[0:5]

Assignments

Task 3
Write a program which computes the 3 best local alignments of the two AminoAcid sequences "PNCFDAKQRTASRPL" and "CFDKQKNNRTATRDTA" using the following scoring parameters: match = 3, mismatch = -2, gapopen = -5, gapextension = -1.
Solution
can be found here

Multiple Alignments

Apart from pairwise alignments, also multiple sequence alignments can be computed in SeqAn. The easiest way to do this is by using the function globalMsaAlignment. This function computes a heuristic alignment based on a consistency-based progressive alignment strategy as described in  Seqan::TCoffee paper.

In the following example, we align four amino acid sequences using the Alignment Graph data structure and the Blosum62 scoring matrix with gap extension penalty -11 and gap open penalty -1. The required header for multiple sequence alignments is seqan/graph_msa.h.

#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan;
int main()
{

First, the sequence type TSequence is defined and a StringSet is declared. The four sequences to be aligned are appended to the StringSet seq.

    typedef String<AminoAcid> TSequence;
    StringSet<TSequence> seq;
    appendValue(seq,"DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE");
    appendValue(seq,"RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK");
    appendValue(seq,"FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK");
    appendValue(seq,"HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK");

Now we initialize our Alignment Graph with the sequences. The graph and the Blosum62 scoring matrix are handed to the function globalMsaAlignment which computes the desired alignment.

    Graph<Alignment<StringSet<TSequence, Dependent<> > > > aliG(seq);
    globalMsaAlignment(aliG, Blosum62(-1, -11));
    ::std::cout << aliG << ::std::endl;
    return 0;
}

And here is the output of this example program:

Alignment matrix:
      0     .    :    .    :    .    :    .    :    .    :
        DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSA
          | |   |          |       |      || ||     ||
        RVKRP---MNAFIVWSRDQRRKMALENP--RMRNSEISKQLGYQWKMLTE
          | |              | | |   |   | |    | |    | | |
        FPKKP---LTPYFRFFMEKRAKYAKLHP--EMSNLDLTKILSKKYKELPE
          |||   |        | ||                  ||      |
        HIKKP---LNAFMLYMKEMRANVVAEST--LKESAAINQILGRRWHALSR

     50     .    :    .    :    .    :    .    :
        KEKGKFEDMAKADKARYEREMKTY----------IPPKGE
         ||  |   |    |        |
        AEKWPFFQEAQKLQAMHREKYPNYKYRP---RRKAKMLPK
          |    |  |                            |
        KKKMKYIQDFQREKQEFERNLARFREDH---PDLIQNAKK
            ||      | |                        |
        EEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK

Assignments

Task 4
Repeat the above example using the Align data structure and the Blosum80 scoring matrix.
Solution
can be found here

Submit a comment

If you found a mistake, or have suggestions about an improvement of this page press: submit your comment