In influenza research, we focus on investigating the evolutionary dynamics of human seasonal influenza A viruses using a combination of methods from statistical learning, concepts from population genetics and phylogenetic inference. Ultimately, combining epidemiological and biological expert knowledge with suitable frameworks for automated inference of viral phylodynamics will allow an increase in prediction accuracy for problems such as predicting the predominant viral strains in the next influenza winter season, which is of relevance for the defining the influenza A vaccine composition. In collaboration with Dr. Ben Adams (a postdoctoral researcher in the McHardy group in 2007/2008, now holding a permanent position at the University of Bath) we investigated the impact of distinct geographic reservoirs on the global antigenic evolution of seasonal influenza A viruses using a stochastic epidemiological model with distinct geographic patches. We have furthermore developed a novel framework for the visualization of the evolutionary dynamics of a population and demonstrated the use of a novel indicator for selection. Applied to the problem of influenza A virus vaccine strain prediction, we show how this approach allows for improvement on the state of the art.
Furthermore, in collaboration with Martin Beer (Friedrich Loeffler Institute, Isle of Riems), we are working on novel methods for analysis of population-level evolutionary dynamics and detection of linked single nucleotide polymorphisms based on 454 deep sequencing samples of avian influenza A viruses. In collaboration with Department 3 of MPII (Lengauer, Domingues, structural modeling) we used in silico analyses to formulate several evolutionary hypotheses, regarding acquisition of the in phenotype of sustained human-to-human transmissibility, which the novel 2009 swine-origin H1N1 influenza A virus exhibits. Since its appearance in April of 2009, this virus has initiated the first influenza pandemic of the twenty-first century. In collaboration with J. Stech (Friedrich Loeffler Institute, Isle of Riems), our hypotheses are currently being investigated with experimental studies.
This research area covers the reconstruction, visualization and subsequent analysis of biological networks. Mario Albrecht's group within the Excellence Cluster developed several bioinformatics tools that provide substantial aid for biologists at each point of a typical wet lab data analysis. Usually, the main aim is the generation of promising targets for further wet lab studies.
The large amounts of molecular interaction data that are being accumulated worldwide are currently scattered over multiple databases, maintained by different research groups and stored using a variety of file formats and identifiers. To make the data easily accessible to the biologist at a single site, Mario Albrecht's group introduced the Distributed Annotation System for Molecular Interactions (DASMI), a framework for the dynamic exchange and assessment of molecular interactions. Its distributed architecture, consisting of data servers that provide interaction data and user clients that integrate and visualize the information, enables research groups around the world to instantly add new data resources and make them available to a wide public without any additional work. Confidence scoring servers can be used for separating high-quality from untrustworthy molecular interactions.
The Lenhof group (in cooperation with the Institute for Human Genetics of Saarland University, Meese group) and the groups of Michael Kaufmann and Oliver Kohlbacher of the University of Tübingen have developed a novel dynamic programming algorithm for detecting deregulated signaling cascades induced by pathogenic processes. The so-called FiDePa (Finding Deregulated Paths) algorithm interprets differences in the expression profiles of, e.g., tumor and normal tissues. It relies on the well-known method of gene set enrichment analysis (GSEA) and efficiently detects all paths in a given regulatory or signaling network that are significantly enriched with differentially expressed genes or proteins. To demonstrate the capabilities of our algorithm, we analyzed a glioma expression data set with respect to a graph that combined the regulatory networks of the KEGG and the TRANSPATH database. For each patient, we calculated the deregulated paths and the union of all his/her deregulated paths. Among others, we found most known key players in glioma and cancer-related pathogenic processes and we were able to correlate clinically relevant features like necrosis and metastasis with the detected paths.
In the last decade, non-coding RNAs, especially microRNAs (miRNAs) have gained increasing scientific interest. While for most known human miRNAs hundreds of experimentally determined or predicted targets are known, there is only limited knowledge about putative systemic effects of changes in the expression of miRNAs and their regulatory influence. The Lenhof group (in cooperation with the Institute of Human Genetics of Saarland University, Meese group) have studied the putative regulatory effects of miRNAs by investigating the enrichment of miRNA targets in KEGG and TRANSPATH pathways and in Gene Ontology categories. We refer to these enriched pathways and categories as target pathways. Our study provided a comprehensive 'miRNA-target pathway' dictionary. Furthermore, an analysis of differentially expressed genes of 13 cancer data sets extracted from the gene expression omnibus showed that targets of specific miRNAs were significantly deregulated in these sets, providing further evidence that miRNAs are key players in the regulation of oncogenic processes.
Jan Baumbach has developed approaches dedicated to the identification of evolutionarily conserved gene regulatory networks amongst different species solely based on genomic sequences. Two examples: MoRAine is an online tool for DNA sequence motif discovery; Transitivity Clustering is a novel amino acid sequence clustering method based on a combination of different approaches for Weighted Graph Cluster Editing, a long-standing NP-hard problem in computer science. It is used for precise, large-scale detections of groups of homologous proteins.
Alternative splicing is an important biological mechanism for increasing protein diversity, resulting in several protein variants produced by a single gene. Microarrays can be used to monitor alternative splicing events, yet little is known about the biological effects of different protein variants. Mario Albrecht and his group have developed DomainGraph, a program which allows even inexperienced users to easily explore the biological effects of alternative splicing and deal with the large amounts of microarray data. DomainGraph can be used for visualizing alternative splicing events at different levels of granularity, ranging from genes to proteins, to protein domains, and to microRNA binding sites. Furthermore, effects on molecular interactions networks and pathways can be visually explored.
Dynamic Modeling of Biological Systems
The theory of stochastic chemical kinetics is in widespread use for the mesoscopic description of intracellular dynamics. Verena Wolf's group has developed several numerical algorithms for the analysis of the resulting stochastic model. They have shown that they outperform earlier methods in terms of accuracy and efficiency. The advantage of the stochastic model compared to the classical deterministic model is that it provides a more accurate description of cellular processes. This is particularly important if such processes are used for probabilistic cellular decisions, which have recently gained strong interest from experimentalists. The most recent stochastic hybrid solution method combines a stochastic description with a deterministic description. This makes the method applicable to large systems and allows us to use the expensive stochastic description only where necessary.
Biomarkers and Molecular Diagnostics
In close cooperation with the Institute for Human Genetics (Meese group), the Mathematical Image Analysis Group (Weickert group, RA2), the Hein group (Machine Learning) and several clinical groups, all at Saarland University, the Lenhof group has developed and tested two novel serum-based approaches for molecular diagnostics. The first approach is based on an array of human recombinant clones that allows for simultaneously measuring the reactivity of serum antibodies against the 1827 tumor antigens. To analyze the resulting images, we have developed an automated image analysis procedure that calculates arrays of normalized reactivity values, the so-called autoantibody profiles. Using standard classification methods, we were able to show that these autoantibody profiles enable an accurate classification of sera of patients with various types of cancer vs. controls. Moreover, the identified sets of tumor antigens may contain many interesting candidates for future tumor vaccine studies. Finally, we are analyzing the resulting tumor antigen sets with the goal to deepen our knowledge on the immunogenicity of tumor antigens, i.e., we are seeking answers to the question why tumor antigens are immunogenic and induce humoral immune reactions.
The second approach is based on miRNA profiles of patient's blood cells measured by microarray experiments and has been developed in cooperation with the biotech company febit GmbH in Heidelberg. Besides highly-accurate classifications of sera from patients for several kinds of cancers vs. controls, we were also able to show that miRNA profiles allow for an accurate classification of sera from patients with relapsing-remitting MS (RRMS) vs. controls. The resulting sets of deregulated miRNAs form the basis for subsequent network-based studies of the corresponding pathogenic processes (see the section "Analysis of Molecular interaction Networks".
At the MPI Informatics, an epigenetic route towards constructing and optimizing biomarkers for cancer has been taken. Here, the biomarkers consist of genomic regions whose methylation pattern is informative about tumor type and stage and/or susceptibility to certain types of chemotherapy. An initial manual construction of the biomarker for glioma was used as the starting point for the development of the software tool MethMarker, which guides the identification and optimization of epigenetic cancer biomarkers.
Visualization of Biomolecules
The three-dimensional geometries of biomolecular systems and the spatial distribution of their physico-chemical properties is key to solving many current life science problems. For instance, virtual screening techniques and molecular docking approaches have become an invaluable tool in modern drug design and are routinely used by pharmaceutical companies. However, in most real-world applications, fully automated techniques are still infeasible. This is not only due to the computational complexities involved, but more importantly because much of the underlying physics and chemistry is still insufficiently understood, which necessitates interactive exploration of a molecular scenario. In cooperation between bioinformatics and computer graphics, the groups of Prof. Slusallek, Dr. Hildebrandt, and Prof. Lenhof have integrated real-time ray tracing functionality into molecular modeling techniques by combining the RTfact ray tracing library with the molecular viewer and modeler BALLView. This first real-world application of real time ray tracing in molecular modeling greatly simplifies the visual perception of molecular geometries, for instance through the use of advanced indirect lighting effects. Furthermore, the use of state-of-the art volume ray tracing techniques allows sophisticated visualization of molecular properties like electrostatic potentials. Finally, the project has explored the use of ray tracing techniques for the computation of molecular volumes and areas. The integration has been made freely available and has sparked great interest from the modeling community, pharmaceutical companies, and the press. The project has also been successfully demonstrated at several conferences and exhibitions.
Analyzing Mass Spectrometry Data
Mass spectrometry has become the de-facto standard for experimental proteomics, and is becoming increasingly important for metabolomics. But even today, with highly sophisticated mass spectrometers available, signal interpretation is a great challenge. Large-scale proteomics applications, for instance, routinely generate terabytes of data that have to be processed. And since signal-to-noise ratios can become very low. In this project, we have worked on a wavelet-based signal processing strategy that simultaneously detects whole isotope patterns instead of isolated peaks, leading to very sensitive and stable feature detection. To enable routine application of the method, we have developed a vectorized algorithm for this approach that runs on modern graphics processing units, with a speed-up factor of approximately 200. Through Prof. Tholey (now at Kiel University), the project was able to generate its own gold-standard data sets, and to have access to experimental experts to compare automated signal processing results to manually generated expert opinion. This project is funded by the Excellence Cluster.
In cooperation between the groups of Prof. Hein and Prof. Hildebrandt, the project has developed a further processing technique that is ideally suited to separate highly overlapping features (publication in preparation). The methods developed in this project have also been applied to metabolomics GC-MS data generated in Prof. Heinzle's group, where it led to greatly improved detection rates. In an important external collaboration, the methods developed in this project are used at the Aebersold lab at the ETH Zürich.
Data Analysis, Genome Browsers and Query Engines
The research area is undertaking several projects to bring hard-to-interpret data to the fingertips of biologists and medical researchers. The deliverables of these projects usually take the form of data repositories together with browsers and/or query engines that allow querying the data in order to reap new biological insights from them. Here we list the major projects in this respect. (i) EpiGRAPH is a statistical browser that enables finding regions in genomic and epigenomic data that are highly unlikely to occur by chance and that thus can be attributed some biological relevance. EpiGRAPH has already been used by other groups to analyze genome-wide epigenomic datasets. (ii) BioMyn is a database and query engine that facilitates the analysis of functional associations among proteins in the same and different species. Such a tool is a central ingredient in uncovering the molecular basis of disease. The data warehouse has already been used in projects on the computational analysis of protein interaction networks in yeast and humans, on the prediction of scaffold proteins and on the analysis of human cellular factors for the HCV infection. (iii) Phylopythia is a system that enables the classification of genome sequences in metagenomic samples to different species families. Metagenomic samples arise from collecting genome sequences in complex biological habitats, such as the human gut, humus or sea water. Such habitats contain myriad never-before-seen microbes that cannot be cultured in the lab. The composition of these microbial communities is believed to play a central role in biological processes in general, and disease processes in particular, e.g. for the gut, in the development of colon cancer or bowel inflammation. The classification of many sequences that we cannot assign to known species poses a major challenge that has to be targeted with special phylogenetic methods. Phylopythia offers such methods and has already been used in major metagenomic sequencing projects.