Knowledge Extraction
RA5 has developed general methodologies for knowledge harvesting, and demonstrates their practical viability and usefulness by constructing and maintaining specific knowledge bases from a variety of sources. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF knowledge base. The YAGO base is automatically constructed from semistructured information in Wikipedia, WordNet, and other Web sources. It contains more than 2 million entities, properly assigned to semantic classes of a taxonomic backbone, and more than 40 million relationships between entities. YAGO is publicly available, has been downloaded more than 10,000 times, and is used in many other projects worldwide, including DBpedia. The original YAGO paper from 2007 has been cited more than 300 times. Figure shows excerpts of the facts that YAGO knows about Albert Einstein, in the form of screenshots of the YAGO browser. This knowledge can be easily queried to answer, for example, a question about female politicians who are married to scientists. As shown in Figure, the query returns Angela Merkel whose husband is a chemistry professor.
To further grow YAGO from arbitrary Web sources and natural-language texts while retaining its high quality, pattern-based extraction is combined with logic-based consistency checking in a unified framework named SOFIE. SOFIE has pioneered a novel way of mapping the joint reasoning on fact candidates, goodness of extraction patterns, and entity disambiguation onto a carefully designed weighted Max-Sat problem that has an efficient customized approximation algorithm. The pattern-based gathering of fact candidates has been extended to higher-arity relations, using dependency parsing and a new confidence assessment method based on pre-compiled closed-world knowledge. Along these lines, the excellence cluster has led to intensified collaboration between Uszkoreit's and Weikum's teams.
To tap information that cannot be crawled and is accessible only via functions, the knowledge base is dynamically coupled with external Web Services. To connect entities, facts, and their terminologies in different languages, methods for multilingual knowledge harvesting have been developed. We are currently investigating, in joint work by Weikum's group and Ivan Titov's IRG, probabilistic models for learning common properties of concepts (e.g., sugar is sweet, bananas are yellow or green). The cooperative work in the excellence cluster has drawn attention to multimodal knowledge, namely photos of named entities (people, mountains, cultural landmarks, etc.). We specifically designed consistency-learning methods harnessing the YAGO base to improve the results from photo search engines in terms of precision and diversity; the methods combine evidence from the results of different query expansions and also consider the visual similarity of photos. Finally, to capture the dynamics of evolving knowledge, we have started investigating the extraction and management of time-varying relations (e.g., former and current spouses and the corresponding time periods).
Another line of research in the Open Science Web theme aims to automatically organize scholarly publications, lifting them into a semantic representation so that researchers can effectively explore and quickly find relevant knowledge about organizations, scientists, collaborations, software, and the impacts of these. Advanced natural-language processing is used to analyze scientific corpora, extract key propositions, bring them into normalized semantic representations (e.g., by unifying passive vs. active constructions, use of synonyms, etc.), and make them more precisely searchable. For example, one can easily find all papers that include experiments with particular software tools for conditional random fields (and related models).
Semantic Search and Ranking
Search engines face an increasing need to support advanced information tasks posed by knowledge workers such as researchers, students, or analysts. For example, a history student may want to know which scientists emigrated from Germany to America, and when. In health care, users may be interested in finding drugs to treat flu symptoms that can be taken during pregnancy, or that do not interfere with other medications the patient is taking. Answering such queries requires a semantic understanding of Web contents, lifting the interpretation of pages from keywords to the level of named entities and relationships between entities. As results, we expect ranked lists of entities or entity pairs that satisfy the relational conditions expressed in the question. This line of research greatly benefits from large-scale knowledge bases such as YAGO, and has recently gained much attention in industry as well.
Statistical language models, LMs for short, are the state-of-the-art method in information retrieval for ranking the results of queries for documents or passages. LMs have recently been extended to deal with keyword search for entities. RA5 has developed the first method for LM-based ranking that considers both entities and relationships and that can handle the full spectrum from structured to keyword queries. We further extended the approach to cope with temporal expressions in queries and Web sources. For example, a query about German politicians in the eighties would return Helmut Kohl, Helmut Schmidt, Hans-Dietrich Genscher, Joschka Fischer, and so on. Further work on knowledge discovery includes the integration of YAGO into the faceted search engine CompleteSearch and the use of YAGO to create the semantically annotated Wikipedia corpus for the INEX benchmark competition on XML retrieval.
Knowledge Reasoning and Uncertain Data
For enhanced reasoning in expressive logics, the YAGO knowledge base has been translated into the Bernays-Schönfinkel Horn class with equality. A new variant of the superposition calculus has been developed, which is sound, complete, and always terminating for this class. Together with extended term-indexing data structures the new calculus has been implemented in SPASS-YAGO, based on the SPASS theorem prover. SPASS-YAGO can prove non-trivial conjectures on the saturated and consistent clause set of more than 1 GB in less than a second.
Automatically extracted facts typically, if not inevitably, exhibit a certain degree of incompleteness, inconsistency, and uncertainty. One approach to deal with this situation is to postpone the resolution of potentially inconsistent facts until query time and employ methods for reasoning on uncertain knowledge. The URDF system, developed in RA5, features novel algorithms, based on an extended form of weighted Max-Sat solving, to infer the most likely answers to queries. This approach harnesses soft and hard rules for coping with missing knowledge and vague information, e.g., about time points. For example, a query about the doctoral advisor of Jim Gray could be answered, despite the lack of any explicit facts in the knowledge base, by the following reasoning chain: Jim Gray's first five publications have four papers co-authored by Mike Harrison, who is more senior than Gray and was a professor at UC Berkeley where Gray graduated a few years later.
Information History and Knowledge Evolution
Archives of evolving digital content such as blogs, news, or Web pages (e.g. by the Internet Archive) are becoming available as potential assets for journalists, market and media analysts, intellectual property experts, and many others. To exploit this potential, new forms of access need to be supported, for example, "time-travel" keyword queries, time-aware ranking, or identifying the most interesting phrases of particular time period — all in a scalable manner with interactive response times.
We have been the very first team to address this complex set of technical problems. The TTIX indexing method for time-enriched inverted lists includes advanced optimizations, based on dynamic programming, to efficiently support temporal queries with bounded space overhead. The approach has been generalized for distributed settings like peer-to-peer networks or Hadoop-based cloud-computing platforms in joint work between the Weikum group and the IRG led by Ralf Schenkel.
Other work in this framework includes time-aware text analytics, such as finding interesting phrases for particular time periods (see below) and novel notions of statistical language models for temporal expressions, in order to rank search results in a time-aware manner. For example, a query about the "US president in the nineties" returns Bill Clinton even if the corresponding pages contain the wording "from 1993 to 2001" or just mention a particular event in spring 1996.
Social Networks
Online communities like Flickr, del.icio.us and LibraryThing have established themselves as popular services for publishing and searching contents, but also for identifying other users who share similar interests. In these communities, data are usually annotated with carefully selected and often semantically meaningful tags. Items like URLs, photos, videos, or books are typically retrieved by issuing queries that consist of a set of tags, returning items that have been frequently annotated with these tags. However, users often prefer a more personalized way of searching by exploiting preferences of and connections between users.
The SENSE system, developed in Ralf Schenkel's IRG, includes novel models for search result ranking with personalization along two dimensions: in the social dimension, search is focused towards items tagged by friends of the querying user, whereas in the similarity dimension, users are preferred who share preferences with the querying user. The system integrates semantic expansion of query tags. For efficiency in large-scale social networks, SENSE provides a top-k algorithm that dynamically expands the search to related users and tags, based on specifically designed index structures. This way, interactive response times can be guaranteed even for complex queries on very large datasets.
Web 2.0 Streams
Web 2.0 streams, like blog postings, micro-blogging tweets, or RSS feeds from online communities, offer a wealth of latest news about real-world events and societal discussion. However, this wealth comes with great challenges regarding the scale and high dynamics of the data, and the need to avoid cognitive overload for users who want to monitor or explore information about specific entities or topics.
One line of research within this broader framework, pursued in the IRG headed by Sebastian Michel, combines users' postings with social-link structures for improved prediction of annotation tags. Another recently investigated issue is the online detection of emergent topics: unexpected bursts of postings that occur in several previously uncorrelated thematic categories or user-provided tags. A research theme that spans many of these technical issues is that of efficient approximation techniques for correlation measures and similarities among postings and tag-sets, based on techniques such as spectral Bloom filters or locality-sensitive hashing. For scalability, these techniques are extended to Hadoop-based analytics (see below).
Scalable Analytics
Business-relevant knowledge is latently embedded in large text sources such as news, customer mail, and reports, and in user-interaction logs (web clicks, etc.) as well as Web 2.0 community data. These rich sources offer great potential for enhancing decision support, but face a tremendous challenge of huge scale. Analysts are often interested in examining a set of documents to identify their characteristic properties in contrast to a second set. Tag clouds and evolving taglines are examples of this kind of analytics. Going beyond this word- or tag-centric mining, we have developed new methods for the analysis of variable-length phrases. Such interesting phrases include named entities, important quotations by politicians or actors, market slogans, song lyrics, etc. Our algorithms stand out for their scalability properties: they provide interactive response times for the top-k most interesting phrases on ad-hoc subsets of huge corpora.
Our algorithms can be easily parallelized and distributed across a cloud-computing platform. This theme has been generalized by developing a full-fledged run-time system for embedding data-intensive analytics into a Map-Reduce-based framework. The system, called Hadoop++, provides great versatility by allowing arbitrary functions as plug-ins. It outperforms both the original Hadoop and recently advocated extensions.
Another architectural paradigm for large-scale distributed analytics is peer-to-peer networks. In such settings, information dissemination is complicated by the autonomy and unreliability of (or potential distrust in) the individual nodes. In joint work between RA5 (Weikum), RA4 (Backes), and Maffei's IRG, a new family of protocols has been developed that supports anonymous dissemination in a censorship-resilient way for unstructured peer-to-peer networks.
Efficient RDF Data Management
A natural representation of entity-relationship-oriented facts in a knowledge base is RDF, a data model for schema-free structured information. RDF is also gaining momentum in the context of linked data on the Web, data repositories for computational life sciences, and Web 2.0 platforms.
The fine-grained nature of Subject-Property-Object triples and the absence of a schema entail major performance challenges. The work on the rdf-3x engine has paved the way for efficiently answering complex multi-join queries on large-scale RDF databases containing trillions of triples (e.g. UniProt, Linked-Data collections). In published benchmarks, rdf-3x has outperformed all other systems by a very large margin.
rdf-3x is based on a "RISC-style" indexing architecture, and uses advanced query optimization techniques. These include world-leading methods for join-order optimization based on dynamic programming novel kinds of selectivity estimation, and a lightweight run-time method of sideways information passing to accelerate scans and merge-joins. The system supports online updates with transactional consistency, and versioning with support for temporal querying ("time-travel" search). rdf-3x is available as open-source software, and is becoming popular in the research community. It is also used as a platform for the URDF work (see above). It has recently been extended to support graph-mining tasks on biological networks and social communities, as part of a collaboration between the Lenhof and Weikum groups.