Table of Contents
Pattern Matching
This tutorial is about searching a known string or StringSet (needle) in another string or StringSet (haystack). The searching can be exact or approximate. The latter allows for errors, which are either only mismatches or also indels. Additionally there are filtration algorithms which return potential matches, i.e. haystack segments possibly containing a pattern match. All searching is done by calling the function find, which takes at least two arguments:
- A Finder that stores all necessary information about the haystack and the last found position of the needle within the haystack.
- A Pattern that stores all information about the needle.
Some variants of find support further arguments. The Finder and Pattern classes expect the underlying haystack and needle types as first template arguments. In addition, a second template argument specifies the search algorithm.
Each call of find finds only one match (or potential match) of the needle within the haystack. The Finder can be asked for the begin and end position of the last found match. The Pattern can be asked for the number of the found sequence if the needle is a StringSet. Subsequent calls of find can be used to find more occurrences of the needle, until no more occurrences can be found and find returns false.
In general, search algorithms can be divided into algorithms that preprocess the needle (online search) or preprocess the haystack (index search).
Online Search
For all online search algorithms, the Finder is an iterator that scans over the haystack. The Pattern is a search algorithm dependent data structure preprocessed from the needle. The second template argument of the Pattern selects the search algorithm.
Exact Search
The following code snippet illustrates the usage of online search algorithms in SeqAn using the example of the Hoorspool (Horspool, 1980) algorithm. We begin by creating two strings of type char containing the haystack and the needle.
#include <iostream> #include <seqan/find.h> using namespace seqan; int main() { CharString haystack = "Simon, send more money!"; CharString needle = "mo";
We then create Finder and Pattern objects of these strings and choose Horspool as the specialization in the second template argument of Pattern.
Finder<CharString> finder(haystack); Pattern<CharString, Horspool> pattern(needle); while (find(finder, pattern)) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
Program output:
[2,4) mo [12,14) mo [17,19) mo
Currently the following exact online algorithms for searching a single sequence are implemented in Seqan:
| Specialization | Description |
| Simple Finder | Brute force algorithm |
| Horspool | Horspool's algorithm (Horspool, 1980) |
| Bfam | Backward Factor Automaton Matching |
| BndmAlgo | Backward Nondeterministic DAWG Matching |
| ShiftAnd | Exact string matching using bit parallelism |
| ShiftOr | Exact string matching using bit parallelism |
... and for multiple sequences:
| Specialization | Description |
| WuManber | Extension of Horspool |
| MultiBfam | Multiple version of Bfam, uses an automaton of reversed needles |
| SetHorspool | Another extension of Horspool using a trie of reversed needles |
| AhoCorasick | Uses a trie with suffix links to scan through the haystack (Aho, Corasick, 1975) |
| MultipleShiftAnd | Extension of ShiftAnd, should only be used if the sum of needle lengths doesn't exceed the machine word size |
Assignments
- Task 1
- Become acquainted with online search algorithms for multiple sequences and search "Simon, send more money!" simultaneously for "mo", "send", "more".
For every match output the begin and end position in the haystack and which needle has been found.
- Difficulty
- 2
- Solution
- can be found here.
Approximate Search
The approximate search can be used to find segments in the haystack that are similar to a needle allowing errors, such as mismatches or indels. Note that if only mismatches are allowed, the difference of the end and begin position of a match is the length of the found needle. However, in the case of indels this difference may vary and is only a rough estimate for the length. Therefore, to find a begin position for a certain end position the findBegin interface should be used. The usage is similar to find and is shown in the next example. We want to find all semi-global alignments of a needle "more" with a SimpleScore of at least -2 using the scoring scheme (0,-2,-1) (match,mismatch,gap).
Again, we create haystack and needle strings first:
#include <iostream> #include <seqan/find.h> using namespace seqan; int main() { CharString haystack = "Simon, send more money!"; CharString needle = "more";
We then create Finder and Pattern objects of these strings and choose DPSearch as the specialization in the second template argument of Pattern. DPSearch expects the scoring function as the first template argument which is SimpleScore in our example. The pattern is constructed using the needle as a template and our scoring object is initialized with the appropriate scores for match, mismatch and gap. As in the previous example, the main iteration uses find to iterate over all end positions with a minimum best score of -2. If such a semi-global alignment end position is found the begin position is searched via findBegin. Please note that we have to set the minimum score to the score of the match found ( getScore) in order to find the begin of a best match. We then output all begin and end positions and the corresponding haystack segment for each match found.
Finder<CharString> finder(haystack); Pattern<CharString, DPSearch<SimpleScore> > pattern(needle, SimpleScore(0, -2, -1)); while (find(finder, pattern, -2)) while (findBegin(finder, pattern, getScore(pattern))) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
Program output:
[2,4) mo [12,14) mo [12,15) mor [12,16) more [12,17) more [12,18) more m [17,19) mo [17,21) mone
| Specialization | Description |
| DPSearch | Dynamic programming algorithm for many kinds of scoring scheme |
| Myers | A fast bit-parallel DP algorithm for edit distance (Myers 1999, Ukkonen 1985) |
| Pex | A (hierarchical) filter based on the pigeonhole principle (Baeza-Yates, Navarro, 1999) |
| AbndmAlgo | Approximate Backward Nondeterministic DAWG Matching, adaption of BndmAlgo |
Assignments
- Task 2
- Modify the example above to search with the Myers algorithm for matches of "more" with an edit distance of at most 2.
- Difficulty
- 1
- Solution
- can be found here.
Index Search
Exact Search
For the index based search the Finder needs to be specialized with an Index of the haystack in the first template argument. The index itself requires two template arguments, the haystack type and a index specialization. In contrast, since the needle is not preprocessed the second template argument of the Pattern has to be omitted. The following source illustrates the usage of an index based search in SeqAn using the example of the IndexEsa index (an enhanced suffix array index), the default index specialization if no second template argument for the index is given. We begin to create an index object of our haystack "tobeornottobe" and a needle "be".
int main() { Index<CharString> index("tobeornottobe"); CharString needle = "be"; Finder<Index<CharString> > finder(index);
We proceed to create a Pattern of the needle and conduct the search in the usual way:
Pattern<CharString> pattern(needle); while (find(finder, pattern)) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
Instead of creating and using a pattern solely storing the needle we can pass the needle directly to find. Please note that an Index based Finder has to be reset with clear before conducting another search.
clear(finder); while (find(finder, "be")) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
Program output:
[11,13) be [2,4) be [11,13) be [2,4) be
All indices also support StringSet texts and can therefore be used to search multiple haystacks as the following example shows. We simply exchange the CharString of the haystack with a StringSet of CharString and append some strings to it. The rest of the program remains unchanged.
int main() { typedef StringSet<CharString> THaystacks; THaystacks haystacks; appendValue(haystacks, "tobeornottobe"); appendValue(haystacks, "thebeeonthecomb"); appendValue(haystacks, "beingjohnmalkovich"); Index<THaystacks> index(haystacks); Finder<Index<THaystacks> > finder(haystacks);
Program output:
[< 0 , 11 >,< 0 , 13 >) be [< 1 , 3 >,< 1 , 5 >) be [< 2 , 0 >,< 2 , 2 >) be [< 0 , 2 >,< 0 , 4 >) be
The following index specializations support the Finder interface as described above.
| Specialization | Description |
| IndexEsa | Enhanced suffix array based index. Supports arbitrary needles. |
| IndexQGram | q-gram index. Needle mustn't exceed the size of the q-gram. |
| OpenAddressing | q-gram index with open addressing. Supports larger q-grams. Needle and q-gram must have the same size. |
Besides the find interface there is another interface for indices using suffix tree iterators to search exact needle occurrences described in the index tutorial.
Assignments
- Task 3
- Modify the example above to search with an OpenAddressing q-gram index for matches of "tobe" in "tobeornottobe".
- Difficulty
- 2
- Solution
- can be found here.
Approximate Filtration
Currently there are no indices directly supporting an approximate search. But nevertheless, there are approximate search filters available that can be used to filter out regions of the haystack that do not contain an approximate match, see Swift. The regions found by these filters potentially contain a match and must be verified afterwards. beginPosition, endPosition and infix can be used to return the boundaries or sequence of such a potential match. For more details on using filters, see the Swift howto.
Submit a comment
If you found a mistake, or have suggestions about an improvement of this page press: submit your comment
