Repository Structure
Table of Contents
This is work in progress
The main changes are (1) adding an "extras" and "sandbox" structure, (2) changing the way CMake is used, and (3) introducing INFO files for each app/module.
This page describes the structure of the SeqAn repository.
The intended audience is:
- SeqAn users that work directly with a repository checkout
- SeqAn developers and contributors,
- Students of the FU Berlin, working inside the SeqAn repository.
The intended audience does not include SeqAn users who just want to use the Release version.
Aims
The aims of the repository structure is to
- be easy to understand,
- be similar to traditional C/C++ project structures, in keeping the build files parallel to the sources (yet out of source builds are prefered),
- have different areas for rock-solid mature code, recently contributed but stable code, and for possibly unstable code actively developed by each SeqAn contributor.
Main Repository Structure
The main repository structure, including LICENSE files and some example directories and files is as follows:
seqan
|-- LICENSE
|-- README
|-- GETTING_STARTED
|-- core
| |-- apps
| |-- tests
| |-- demos
| |-- includes
| `-- docs
| `-- globals.dddoc
|-- extras
| |-- apps
| |-- tests
| |-- demos
| |-- includes
| `-- docs
|-- sandbox
| |-- seqan_team
| |-- contributor_a
| |-- contributor_b
| `-- ...
|-- build
|-- ext
| |-- zlib
| `-- ...
`-- util
|-- dddoc
|-- tpl
| |-- app_skel
| |-- test_skel
| |-- demo_skel
| |-- module_skel
| |-- new_app.py
| |-- new_test.py
| |-- new_demo.py
| `-- new_module.py
|-- CMake
| |-- FindTBB
| `-- ...
|-- py_lib
| |-- README
| |-- threadpool
| | |-- INFO
| | `-- __init__.py
| `-- ...
|-- misc
`-- ...
The directory contents are as follows:
- core
- Area for rock-solid mature code.
- extras
- Area for recently contributed but stable code.
- sandbox
- Area for code actively developed by SeqAn contributors.
- build
- Builds are meant to happen here, inside a subdirectory for each build type. See below for more info.
- ext
- External libraries bundled with SeqAn, such as zlib. Each library must have an INFO file (described below).
- util
- Tools such as dddoc, one directory for each tool or "real package".
- util/misc
- Misc things such as the logo, awk scripts, stuff that does not fit in anywhere else. No fixed structure, try to be consistent with what's already here.
- util/skel
- Code templates / skelletons.
- util/dddoc
- The documentation system.
- util/py_lib
- Reuseable Python code from the SeqAn project, bundled non-standard Python modules such as threadpool.
Core Area
Code that lives here is built and tested nightly by the core team, the results posted to CDash. It has to pass a code review, be tested, well-documented, proven to work and be useful. It has the highest priority to be stable.
The code here is mostly maintained by SeqAn developers or selected contributors with write access to the core library.
The structure of this area follows the apps, test, demos, include, docs pattern:
core |-- apps |-- tests | |-- basic | `-- ... |-- demos |-- include | |-- seqan.h | `-- seqan | |-- basic.h | |-- basic | `-- ... `-- docs
The directory contents are as follows:
- apps
- Applications, each app in a subdirectory.
- tests
- Tests, each test suite in a subdirectory.
- demos
- Demos and tutorials, flat C++ files.
- include
- Contains the seqan.h header and the directory seqan with the library only. This directory is added to the include path of the compiler, e.g. the -I switch in SeqAn.
- docs
- The documentation is built here. This directory contains a CMakeLists.txt file for building the documentation and the file globals.dddoc (the configuration file). The documentation is built in the subdirectory html.
Extras Area
It is where stable, contributed applications and modules (with demos and test) live. The code has to pass a rough code review. The apps, tests, demos are built and tested nightly by the core team on all supported platforms, results posted to CDash. The code here has to be tested, well-documented. Its stability has a high priority.
The code here is maintained by SeqAn developers (probably more recent code, not in the library yet) or externally maintained.
The structure of this area follows the apps, test, demos, include, docs pattern:
extras |-- apps | |-- razers4 | | `-- razers.cpp | `-- ... |-- tests | |-- cute_module | | `-- test_cute_module.cpp | `-- ... |-- demos | |-- cute_module.cpp | `-- ... |-- include | `-- seqan | |-- parsing.h | |-- parsing | | `-- parsing_stuff.h | `-- ... `-- docs
Sandbox Area
This is the location for actively developed code. Code located here, if proven to be stable, will move into the extra library.
Each contributor or contributing group has one subdirectory whose structure should follow the standard apps, test, demos, include, docs pattern:
sandbox |-- fu_berlin_students | |-- student1 | | |-- apps | | `-- ... | `-- student2 | |-- apps | `-- ... |-- seqan_team | |-- project1 | | |-- apps | | |-- test | | |-- demos | | |-- include | | `-- docs | `-- project 2 | `-- ... |-- contributor_a | |-- apps | |-- test | `-- include |-- contributor_b | `-- ... `-- ...
CDash builds do not have to be setup. However, if at one point one aims at having his code moved into the extra library, he should to setup CDash builds for at least one platform and should keep the code here pass building and testing.
Code Life Cycle
Code that lives in the SeqAn repository should eventually be made possible to the whole community, thus go into extras or SeqAn core. (One exception is code in the fu_berlin sandbox where also MSc code from students and PhD students lives that might eventually be abandoned). Thus, development should not start in the SeqAn repository but only go there once a reasonable maturity has been reached (again, the core team takes the privilege to break this rule). Code in sandboxes can break and does not have to build continuously. However, it is stronly recommended to integrate code into the CDash build system as soon as possible so feedback is as immediate as possible. The aim should be that the code here should eventually go into extras.
The code in extras contains stable, usable code. To get here, it has to pass a rough code review, as it is built on the core teams' servers and should not be harmful. Furthermore, the code should be understandable so if small changes in the library break code in extras, the core team can fix it easily. The aim here is to fix problems in at most 2-7 days by the maintainer. The code authors are prefered as maintainers and will be given write-access on their code.
Code in the library should be as stable as possible, well-documented, actively maintained and tested. This area is only writeable by core developers and "distinguished" developers, who either contributed code to extra that has moved into the core repository or have distinguished themselves over a span of time by good patches and other contributions. This will be decided on a case-by-case basis. Otherwise, changes to the core are made attaching patches to the Ticket tracker.
Current Modules
- Core Modules
- align
- basic
- chaining
- consensus
- file
- find
- find2
- graph_algorithms
- graph_align
- graph_decomposition
- graph_msa
- graph_types
- index
- map
- misc
- modifier
- parallel
- pipe
- platform
- random
- refinement
- score
- seeds
- seeds2
- sequence
- sequence_journaled
- store
- synopsis
- system
- Extra Modules
- blast
- find_motif
- statistics
- stream
- Sandbox Modules
Current Apps
- Core Apps
- dfi
- mason
- micro_razers
- pair_align
- mason
- rabema
- razers
- sak
- seqan_tcoffee
- seqcons
- snp_store
- splazers
- stellar
- tree_recon
- Extra Apps
- indel_simulator
- insegt ?
- param_chooser
- read_analyzer
- rep_sep
- variant_comp
- Sandbox Apps
- fiona (weese)
- hsa (bkehr)
- prob_spec (reinert)
- razers2 (emde)
- razers3 (emde)
- seqan_lagan (rausch / holtgrew ?)
- transquant (weese)
Current Demos
- Core Demos
- alignment.cpp
- alignment_local.cpp
- alignment_msa.cpp
- allocator.cpp
- alphabet.cpp
- annotation_converter.cpp
- file_format.cpp
- file_readwrite.cpp
- file_speed.cpp
- filter_sam.cpp
- find_approx.cpp
- find_exact.cpp
- find_wild.cpp
- gff2gtf.cpp
- graph_algo_bfs.cpp
- graph_algo_dfs.cpp
- graph_algo_flow_fordfulkerson.cpp
- graph_algo_his.cpp
- graph_algo_lcs.cpp
- graph_algo_lis.cpp
- graph_algo_path_allpairs.cpp
- graph_algo_path_bellmanford.cpp
- graph_algo_path_dag.cpp
- graph_algo_path_dijkstra.cpp
- graph_algo_path_floydwarshall.cpp
- graph_algo_path_transitive.cpp
- graph_algo_scc.cpp
- graph_algo_topsort.cpp
- graph_algo_tree_kruskal.cpp
- graph_algo_tree_prim.cpp
- graph_hmm.cpp
- graph_hmm_silent.cpp
- index_find.cpp
- index_find_stringset.cpp
- index_maxrepeats.cpp
- index_mummy.cpp
- index_mums.cpp
- index_node_predicate.cpp
- index_qgram_counts.cpp
- index_sufarray.cpp
- index_supermaxrepeats.cpp
- interval_tree.cpp
- iterator.cpp
- lagan.cpp, lagan1.fasta, lagan2.fasta
- modifier_modreverse.cpp
- modifier_modview.cpp
- modifier_nested.cpp
- rooted_iterator.cpp
- sam2svg.cpp
- seeds.cpp
- segmentalignment.cpp, sequence_1.fa, sequence_2.fa
- sequence_length.cpp
- swift_verification.cpp
- template_subclassing.cpp
- Extra Demos
- benchmark_stream.cpp
- blast_report.cpp, ecoln.out
- find_motif.cpp
- zscore.cpp, zscore_example_mm.3, zscore_human_mm.3
- Sandbox Demos
The INFO File Format
Each app and library module contains an INFO file. This file describes the component. The information is collected and regularly made available on a website.
It has a key/value structure:
Name: NAME Author: AUTHOR <EMAIL> Maintainer: MAINTAINER <EMAIL> Copyright: YEAR, OWNER Homepage: URL Version: MAJOR.MINOR Source: SOURCE Status: STATUS License: LICENSE Description: SINGLE LINE EXTENDED DESCRIPTION
There can be multiple Author, Maintainer and Copyright fields.
- Name
- Name of the SeqAn module, or app.
- Author
- Name and email of the code's author. This field is required.
- Maintainer
- Name and email of the maintainer within SeqAn. If the INFO file describes an external library like zlib or a Python module, for example, this is the person resposible for keeping this piece of software up to date. A missing entry means unmaintained. The person in the first maintainer line is the primary contact person.
- Copyright
- Copyright entry, can be specified multiple times in case it is based on other code. This field is required.
- Homepage
- The homepage of the related project, if any.
- Version
- The version field is increased on every release.
- Status
- The status of the code. A proposal for states are: stable (well-tested on all platforms), testing (e.g. tested on one platform but might not build on others), development (non-experimental, but might still undergo large changes).
- Dependencies
- Hard dependencies on another package. Currently, for information purpose only.
- Recommends
- If another package improves this module then it is specified here. Currently, for information purpose only.
- Source
- Describes where the given package was retrieved from. This field is optional and only given for external libraries. For packages from external sources, changes should be described in the Description section. If the given package has an INFO file of itself, that file should be renamed to INFO.orig and the change should be described under Description.
- License
- The license of the package.
- Description
-
This field is followed by a single line synopsis, to be used as the short description in listings.
It has to be the last field in the INFO file.
The extended description can span multiple lines.
Text in the descripton field should be wrapped to 78 characters and has to be wrapped to 80 characters.
Format, as in Debian controlfields:
The lines in the extended description can have these formats:
- Those starting with a single space are part of a paragraph. Successive lines of this form will be word-wrapped when displayed. The leading space will usually be stripped off.
- Those starting with two or more spaces. These will be displayed verbatim. If the display cannot be panned horizontally, the displaying program will line wrap them "hard" (i.e., without taking account of word breaks). If it can they will be allowed to trail off to the right. None, one or two initial spaces may be deleted, but the number of spaces deleted from each line will be the same (so that you can have indenting work correctly, for example).
- Those containing a single space followed by a single full stop character. These are rendered as blank lines. This is the only way to get a blank line[36].
- Those containing a space, a full stop and some more characters. These are for future expansion. Do not use them.
- Do not use tab characters. Their effect is not predictable.
The following shows an example:
Name: RazerS Author: David Weese <first.last@example.net> Author: Anne-Katrin Emde <first.last@example.net> Author: Manuel Holtgrewe <first.last@example.net> Maintainer: David Weese <first.last@example.net> License: GPL 3 or later Copyright: 2008-2011, FU Berlin Homepage: http://www.seqan.de/projects/razers.html Version: 3.0 Status: stable Description: RazerS is a fully sensitive read mapping tool. RazerS is a fast read mapping tool with full sensitivity using Hamming and Levenshtein distance. Given a reference sequence and a set of reads, it guarantees to find all occurences of the given reads in the reference sequence. . The program uses the implementation of the semi-global SWIFT algorithm by Rasmussen et al.
Out Of Source Builds
Our CMake files do not allow in-source builds, e.g. calling cmake in the apps directory to build all apps. Instead, the builds have to be made in another place, this is called out-of-source builds.
The repository contains the empty folder build for just this purpose. We recommend to create one subfolder in this directory to create the builds in. This allows you to compile debug and optimized binaries without having to clean all built files and calling CMake in between. It also allows you to generate Eclipse, XCode and Visual Studio projects at the same time.
Here is an example of creating multiple Makefiles in debug and release mode and creating an XCode project. The Getting Started article from the tutorial contains detailed information for generating projects for other IDEs such as Visual Studio.
$ cd build $ mkdir Debug $ cd Debug $ cmake ../.. -DCMAKE_BUILD_TYPE=Debug $ cd .. $ mkdir Release $ cd Release $ cmake ../.. -DCMAKE_BUILD_TYPE=Release $ cd .. $ mkdir XCode $ cd XCode $ cmake ../.. -G Xcode $ cd ..
In Source Builds for Eclipse CDT 4
As described in the VTK Wiki, VTK Wiki generating out-of-source builds for Eclipse CDT 4 has drawbacks. In-source builds should, exceptionally, be allowed here. The CMakeLists.txt allows in-source builds if the user creates the file "I_WANT_IN_SOURCE_BUILDS" and prompts him about this if he tries to build in-source without having this file.
- Note that while the VTK Wiki page tells you to check "copy project into workspace", this will make all targets disappear in the right-hand side "Make" tab. Uncheck "copy project into workspace".
- Make sure to disable indexing (Preferences -> C++ -> Indexer) before importing the generated project and disable autobuilding "Project -> Build Automatically".
Now arises a second problem: When doing in source builds, the SeqAn applications clash with the existing directories. The binary "razers", for example, would be crated at the path "apps/razers" which already is a directory.
We work around this by appending ".exe" to all binary names / names of all targets when doing in-source builds on Non-Windows platforms.
CMake Usage
Enrico and Manuel discussed the current CMake system. The predominant feature is that there is not one CMakeLists.txt per directory with .cpp files but only four such files: One global one, one for apps, one for tests, and one for demos.
The advantage of the current system that a new test and app can be created by a new directory, a new .cpp file and a CMake run.
However, the drawbacks are:
- There are few, monolithic and complex CMakeLists.txt that build all app, for example. The handling of special cases is non-trivial and a bit overwhelming for a new developer.
- Each group's project subdirectory should have its own CMakeLists.txt so the maintainers can tweak linked libraries, for example. However, it will be a bit hard to explain all the things in the CMakeLists.txt files.
- Also, CMake generates one project file per CMakeLists.txt file. These grow very large in the current system.
We propose to change the system in the following way:
- There is one CMakeLists.txt in each subdirectory of apps, tests and one in demos.
- The CMakeLists.txt in demos uses a simple loop to add all demos to one project file.
- For each app, there is a CMakeLists.txt file that builds all binaries for this app.
- The same is true for each test.
- Inside the "apps" and "tests" directory, a CMakeLists.txt loops over all subdirectories and simply does add_subdir.
This has the following advantages:
- Now, for each CMakeLists.txt, a new project file is generated, this has a finer granularity.
- It appears to be more "the way CMake is intended to be used," with fewer special case handlings.
- Each maintainer can write his own CMakeFile, linking against external libraries if he is writing an adaption module for LARA, for example.
- The old approach was generally limiting : It is now possible to create more than one binary for each app, see old Razers with param chooser. Cmp. bowtie & bowtie-build
It has the following "neutral" property:
- Instead of "apps/razers", there will now be the binary "apps/razers/razers".
It has the disadvantage that
- a new CMakeLists.txt is required for each project.
This problem can be fixed by providing skelleton-creation scripts. Providing such scripts is a good idea, anyway.
The root CMakeLists.txt file would add some new CMake functions, that allow the simple creation of new binaries using SeqAn. For example, they would add linking against librt on Linux.
Examples for Skelleton creation.
$ ./util/tpl/new_app.py core razers4 # --> ask user for name and email or read from environment variabl SEQAN_USER # --> cp ./util/tpl/app_tpl/app.cpp -> apps/razers4/razers4.cpp # --> cp ./util/tpl/app_tpl/app.h -> apps/razers4/razers4.h # --> cp ./util/tpl/app_tpl/CMakeLists.txt -> apps/razers4/CMakeLists.txt # --> replace name and email, adjust used symbol in "#ifndef ...", etc.
$ ./util/tpl/new_contrib.py foo # --> create new entry in contrib directory "foo", copy over skelleton, adjust $ ./util/tpl/new_app.py contrib/foo myniceprogram # --> ...
Example for CMakeLists.txt for a SeqAn app:
cmake_minimum_required (VERSION 2.6) project (FUB_Sandbox_my_app) add_seqan_executable(my_app my_app.cpp some_func.h) add_seqan_executable(some_support some_support.cpp some_func.h)
CMake Generation Scripts
Initial proposal in attached util.zip.
$ ./util/bin/skel.py
Usage: skel.py [options] create [module|test|app|demo|repository] NAME LOCATION
skel.py [options] info [module|test|app|demo|repository]
SeqAn code generator. The create command uses the template to code of the
given type with the given NAME in the given LOCATION. The info command
displays information on the template. The LOCATION is the repository name and
could be "core", "extras" or "sandbox/fub_students".
Options:
-h, --help show this help message and exit
-s SKEL_ROOT, --skel-root=SKEL_ROOT
Set path to the directory where the skelletons live
in. Taken from environment variable SEQAN_SKELS if
available.
-a AUTHOR, --author=AUTHOR
Set author to use. Should have the format USER
<EMAIL>. Taken from environment variable SEQAN_AUTHOR
if it exists.
-d, --dry-run Do not change anything, just simulate.
Attachments
-
util.zip
(28.1 KB) -
added by holtgrew 14 months ago.
Initial proposal for skelleton creation scripts.
-
CMakeLists.txt
(1.3 KB) -
added by holtgrew 14 months ago.
Root CMakeLists.txt file.
-
FindSeqAn.cmake
(19.8 KB) -
added by holtgrew 14 months ago.
SeqAn CMake utils.
