wissen.leben | WWU Münster 


Home   > Theses   > Thesis details

PhD: Software Components for increased Data Reuse and Reproducibility in Phylogenetics and Phylogenomics


Status in progress
Student Stöver, Ben
Supervisors Müller, Kai
Quandt, Dietmar
Makalowski, Wojciech
Accepting institution Evolution and Biodiversity of Plants
Institute for Evolution and Biodiversity
WWU Münster
Hüfferstraße 1
48149 Münster
Germany

Abstract

Many parts of the life sciences, including phylogenetics, phylogenomics or ecology, have become data-intensive due to increasingly cheaper high-throughput sequencing technologies, the digitiza-tion of large biological collections or data contributions from citizen science. An increasing number of available and computationally accessible methods for downstream analysis that produce derived data (e.g., phylogenetic trees or character data automatically extracted from images) further con-tributes to the production of large quantities of potentially reusable data. This opens up new op-portunities for big data studies, but also creates new challenges for infrastructure and method de-velopment. To fully use the potential of available data, cyberinfrastructure for sharing and mainte-nance of scientific data and policies of journals and funding agencies that encourage its publication are necessary. In part, this has already been addressed by databases like Dryad and recommenda-tions of an increasing number of journals and funding agencies, although additional measures still need to be taken. Equally important as the availability of scientific data is its reusability. This in-cludes the use of open and well-defined formats, as well as the semantic annotation of data with metadata of different kinds, for an unambiguous description and links to related information and resources. These annotations should ideally be machine-interpretable to allow reliable automated data collection for large-scale studies. In addition to its value for data reuse, proper annotation can also increase the reproducibility of studies, if methods used and steps of workflows are directly documented using attached metadata. Ideally, this would be done by the researchers that produce the data, who, however, often are unfamiliar with the necessary annotation technologies like the Resource Description Framework (RDF), advanced file formats like NeXML or biological ontologies. Therefore, software that makes this process more convenient is a key requirement in the age of big data and the semantic web.

To address these needs regarding phylogenetic data types, two approaches are followed in this thesis, which cater to both developers of bioinformatical software and researchers from any disci-pline dealing, e.g., with multiple sequence alignments or phylogenetic trees at any step of their workflows. First, programming libraries are introduced that provide required reusable software components. JPhyloIO allows reading and writing phylogenetic data from and to various file formats through a single memory-efficient interface, while making full use of the metadata model of each format. LibrAlign provides flexible and easily extendible GUI components for displaying and editing biological sequences and multiple sequence alignments (MSAs) closely together with any type of attached metadata. Second, these libraries form the basis of applications newly developed here that address the described needs of researchers. At the same time, new functionality exposed through these libraries is available to all developers and enables creation or extension of software for diverse biological applications that simplify data reuse through efficient annotation.

Among the developed applications, the Taxonomic Editor of the EDIT Platform for Cybertaxonomy models taxonomic workflows and persistently links all data elements to the specimen they were derived from. This a major advantage over the traditional approach of linking all information to a taxon, because data remains reusable and interpretable if the assignment of specimens to taxa changes in taxonomic revisions. In this thesis, the Taxonomic Editor is extended to support molecu-lar sequence data with help of the functionality provided by LibrAlign and JPhyloIO. The two main phylogenetic data types are addressed by PhyDE 2 and TreeGraph 2, editors for multiple sequence alignments and phylogenetic trees, respectively. PhyDE 2 is a reimplementation of the currently used version of PhyDE based on LibrAlign and JPhyloIO. Although it currently is in a proof-of-concept state and does not yet offer the full feature set of the previous version, its new codebase is much easier to maintain and extend and significantly simplifies the future development towards advanced metadata modeling and using the potential of the new libraries. TreeGraph 2 offers ver-satile formatting and editing options in a user-friendly way and models any type of metadata asso-ciated with tree nodes and branches, while offering a variety of options to visualize these annota-tions. It makes use of JPhyloIO to read and write phylogenetic trees and their metadata.

In addition to fostering data reuse, allowing to compare and combine results from alternative methods is another major goal of this thesis and is also closely linked to metadata modelling and increased reproducibility of studies. Many alternative methods to construct MSAs or phylogenetic trees are available and is choosing among them is usually non-trivial. As a result, researchers often need to carefully check for agreements and conflicts between results from alternative approaches and possibly also present a synthesis across alternatives. AlignmentComparator implements differ-ent algorithms to visually compare alternative MSAs of the same dataset in detail and allows to identify and annotate differently and identically aligned regions. It can also be used to track subse-quent automatic or manual alignment changes in workflows. TreeGraph 2 completes the required functionality by providing an interactive comparison feature for phylogenetic trees and allows to map statistical support values derived from alternative methods onto a single reference topology, thereby highlighting topological conflicts.

Together, the developed applications support visualizing, editing and comparing all major data types of phylogenetics and related fields and have the potential to allow convenient and complete modeling of necessary metadata across complete phylogenetic workflows that produce optimally reusable data in an easily reproducibly way. Easy reuse of the developed functionality is ensured by providing key functionality in separate libraries that simplify the development and extension of more tools to provide features for easier data reuse and increased reproducibility. All developed products are freely available at http://bioinfweb.info/Software.

Publications

Sort list:  by publication date  by type

Article in Journal

Kilian N, Henning T, Plitzner P, Müller A, Güntsch A, Stöver BC, Müller KF, Berendsohn WG, Borsch T: Sample data processing in an additive and reproducible taxonomic workflow by using character data persistently linked to preserved individual specimens. Database 2015, 2015:bav094 (Details)

Stöver BC, Müller KF: TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics 2010, 11 (Details)

Conference abstract/poster

Wiechers S, Müller KF, Stöver BC: Increasing data accessibility and reuse in phylogenetics by employing externally defined ontologies. 6th annual Symposium of the Münster Graduate School of Evolution; Münster, Germany; 2017 (Details)

Stöver BC, Wiechers S, Müller KF: JPhyloIO - A Java library for event-based reading and writing of different alignment and tree formats through one common interface. European Conference on Computational Biology (ECCB); The Hague, The Netherlands; 2016 (Details)

Wiechers S, Müller KF, Stöver BC: New comparison and annotation methods of the phylogenetic tree editor TreeGraph 2. 5th annual Münster Graduate School of Evolution Symposium; Münster, Germany; 2015 (Details)

Stöver BC, Wiechers S, Müller KF: Recent development of the phylogenetic tree editor TreeGraph 2. German Conference on Bioinformatics (GCB); Dortmund, Germany; 2015 (Details)

Stöver BC, Müller KF: LibrAlign - A powerful Java GUI library for MSA and attached raw and meta data. IPAM Multiple Sequence Alignment Workshop; Los Angeles, USA; 2015 (Details)

Stöver BC, Müller KF: AlignmentComparator - Comparing and annotating alternative alignments of the same data set. IPAM Multiple Sequence Alignment Workshop; Los Angeles, USA; 2015 (Details)

Stöver BC, Müller KF: LibrAlign - A Java library with powerful GUI components for multiple sequence alignment and attached raw and meta data. German Conference on Bioinformatics (GCB); Bielefeld, Germany; 2014 (Details)

Stöver BC, Müller KF: AlignmentComparator - A GUI application to efficiently visualize and annotate differences between alternative multiple sequence alignments. 4th annual Münster Graduate School of Evolution Symposium; Münster, Germany; 2014 (Details)

Stöver BC, Müller KF: LibrAlign - A GUI library for displaying and editing multiple sequence alignments and attached data. BioDivEvo 2014; Dresden, Germany; 2014 (Details)

Stöver BC, Quandt D, Müller KF: Complex mutations and multiple sequence alignment - Example: Hairpin-initiated repeats (HIRs). 2nd annual Münster Graduate School of Evolution Symposium; Münster, Germany; 2012 (Details)

Related Software


Legal notice | Privacy policy | © 2011 WWU Münster
Institute for Evolution ind Biodiversity
Hüfferstraße 1 · 48149 Münster