Clustered Results of Harvesting Entity Photos
This demonstrator is driven by the long-term objective of automatically building and maintaining comprehensive knowledge bases about entities and their semantic classes, relationships between entities, associated multimodal information such as photos, and informative cross-linkage with Web sites and Web 2.0 sources.
The demonstrator builds on earlier work in G. Weikum's group, which has developed methodologies for automatically constructing a large knowledge base from Wikipedia and other Web sources. The YAGO base has been downloaded several thousand times, and is used in many research projects worldwide. A new edition of the knowledge base, YAGO2, currently contains more than 200 million facts, including meta-facts about time, location, and provenance. The facts in knowledge bases like YAGO2 also serve as seeds for pattern-based information extraction from arbitrary natural-language texts and Web sources. This has been pursued in both RA1 (H. Uszkoreit) and RA5 (G. Weikum). The knowledge base can be leveraged to semantically annotate entities and facts in natural-language texts such as news or blogs. Entity recognition and disambiguation is also leveraged in the work of Sebastian Michel's IRG on Web 2.0 streams, as well as in joint work by Pinkal's and Weikum's groups, spanning RAs 1 and 5.
The search engine NAGA ranks search results on a novel form of statistical language model computed on the 500-million-page corpus CleuWeb2009 using MapReduce methods developed in Ralf Schenkel's IRG. To compute only the top-k results of a query, we have built on the algorithms jointly developed by Weikum's group and the IRGs led by Hannah Bast and Ralf Schenkel. To query entity names in image search engines yielding large candidate lists with high precision and satisfactory recall, our approach harnesses the knowledge base facts about the entities of interest, including salient keyphrases automatically mined from seed pages such as Wikipedia.
URDF is a powerful rule-based engine for querying and reasoning. With the rich knowledge imported from YAGO, it has been used as a backend engine for a dialog system developed at DFKI. Work on RA5 at the DFKI Language Technology Lab is aimed at novel applications that assist scientists in searching and authoring scientific publications.
Entity Annotation of News Text
In RA5, Weikum's group has developed methodologies for automatically constructing a large knowledge base from Wikipedia and other Web sources. The original YAGO base contained more than 2 million entities, properly assigned to semantic classes of a taxonomic backbone, and more than 20 million relationships between entities. These facts are represented as subject-property-object triples according to the RDF data model. The YAGO base has been downloaded several thousand times, and is used in many research projects worldwide: indicators are the large number of citations (more than 300) and the large number of textual mentions (in more than 6000 publications), according to Google Scholar.
Recently, we have completed a major revision of the YAGO architecture for automatic extraction, and built a new edition of the knowledge base, named YAGO2. The new extractors allow more facts to be harvested from Wikipedia, while retaining high accuracy. We integrated spatial and temporal facts to a much larger extent, including information from Geonames, with high-precision methods for automatic entity reconciliation and integration of semantic classes. Moreover, we harvested multilingual information from more than a hundred Wikipedia editions to capture entity and concept names as well as taxonomic relations in many languages. YAGO2 currently contains more than 200 million facts, including meta-facts about time, location, and provenance.
The facts in knowledge bases like YAGO2 also serve as seeds for pattern-based information extraction from arbitrary natural-language texts and Web sources. This has been pursued in both RA1 (Uszkoreit's group) and RA5 (Weikum's group). Our tools SOFIE and PROSPERA combine statistical evidence from pattern occurrences with logical reasoning over consistency constraints. PROSPERA uses MapReduce-style parallelism to scale up these computationally expensive procedures on distributed platforms. Its typical use case is massive-scale harvesting of new facts about known entities. However, it can also be used for on-the-fly gathering of facts about an individual entity given by the user. This way, the system can automatically construct an infobox-style fact sheet about a scientist, an organization, a cultural or natural landmark, etc. — even if the knowledge base and Wikipedia do not contain any prior information about the entity.
The knowledge base can be leveraged to semantically annotate entities and facts in natural-language texts such as news or blogs, and can enhance the automatic extraction of new relationships on the fly. This entails matching entity mentions in the text against the names of potential meanings in the knowledge base, to obtain a set of candidate entities. Next, a disambiguation method is applied to select the most likely entity for each mention. The disambiguation itself harnesses the knowledge base by comparing the textual context of a mention with the ontological context of a candidate entity (e.g., the classes to which the entity belongs, related entities, etc.) in order to compute similarities and rankings. The example demonstrates entity-level annotations in a news clip about first ladies using Twitter. Entity recognition and disambiguation is also leveraged in the work of Sebastian Michel's JRG on Web 2.0 streams and used in his demo enblogue.mmci.uni-saarland.de. Further research on disambiguation is pursued jointly by Pinkal's and 's groups, spanning RAs 1 and 5.
For querying the knowledge base, we offer both an API for programmed access and a UI for interactive exploration. At the API level, to answer richly structured questions, our search engine NAGA evaluates SPARQL predicates on the underlying RDF data. We also support keyword search over text-augmented RDF triples and flexible combinations of structured and keyword conditions, as well as adaptive methods for automatic relaxation of queries when their result set would be unduly small. For example, suppose a user is interested in composers from Europe who have won the Academy Award. The knowledge base has a rich class taxonomy and facts about birth places, location hierarchies, and movie awards. Further suppose that the user is particularly interested in composers who also worked with classical orchestras and on suspenseful movies like westerns or fantasies. These conditions cannot be mapped to relations of the knowledge base. By associating facts with “witness” documents where the facts occur (and from which they may be extracted), these documents form additional context for text search conditions. This way, the following text-enhanced SPARQL query (with keyword conditions in curly braces) can return good matches such as Ennio Morricone or Javier Navarrete.
NAGA ranks search results based on a novel form of statistical language models. The parameters of this ranking model are estimated from large-scale mining of co-occurrence statistics for entity pairs, entity-keyword pairs, pairs of entities and relational patterns, and more. For NAGA this has been computed on the 500-million-pages corpus ClueWeb2009 (the largest publicly available Web crawl), using MapReduce methods developed in Ralf Schenkel's JRG. To compute only the top-k results of a query, we have built on the algorithms jointly developed by Weikum's group and the JRGs headed by Hannah Bast and Ralf Schenkel.
At the UI level, NAGA provides a forms-based interface for guided entering of search conditions. This input is mapped into a set of RDF triple patterns. Alternatively, expert users can also type SPARQL queries directly. We are currently investigating ways of mapping natural-language questions into SPARQL queries, for restricted styles of question answering.
As real-world knowledge keeps changing, no knowledge base can ever be truly complete. Moreover, as we perform larger-scale fact extraction from natural-language text, the knowledge base may contain a certain fraction of low-confidence information, if not even incorrect statements. This leads to knowledge bases with uncertainty. Ranking search results alleviates these problems, but an alternative approach is to employ logical reasoning chains at query time.
URDF is a powerful rule-based engine for querying and reasoning. It uses Datalog-style rules for intentional predicates to infer results for otherwise unanswerable queries. In contrast to deductive databases, rules can be either soft (with confidence weights) or hard (e.g., for enforcing functional dependencies). The execution engine of URDF combines Datalog techniques with an approximation algorithm for Weighted MaxSat on grounded rules. This way, it can achieve interactive response times.
As an example, consider a user asking where Al Gore lives. The knowledge base does not have any information about his residence. However, we can use soft rules that people usually live together with their spouses, Tipper Gore in this case. YAGO knows that Tipper Gore lives in Washington, DC, and it has also gathered the high-confidence statement that Al Gore and Tipper Gore are (still) married to each other. Then, the reasoning engine can infer that Washington, DC, is a likely answer to the user's question.
URDF, with the rich knowledge imported from YAGO, has been used as a backend engine for a dialog system developed at DFKI.
Work on RA5 at the DFKI Language Technology Lab is aimed at novel applications that assist scientists in searching and authoring scientific publications. By using and advancing natural language processing (NLP) methods and tools, the full content of scholar publications is lifted into a structured representation with semantic markup. The work currently concentrates on a corpus of 19,200 papers from the past 47 years in the field of computational linguistics and language technology, the ACL Anthology. The NLP aspects are largely independent of the science domain; they build on generic linguistic knowledge and machine learning, extended by automatic extraction where domain-specific resources are required.
The automatically pre-computed, normalized semantic representations of sentences contained in the processed Anthology help to structure the search space and find equivalent or related propositions even if they are expressed differently, e.g. in passive constructions, using synonyms etc. A hybrid chain of shallow and deep NLP tools computes these semantic structures in an offline step on a compute grid. Semantic similarity of domain-specific nouns and multi-word expressions can be computed automatically. This semantically augmented anthology is combined with the LT World Ontology and freely available resources such as WordNet, and exploited as an additional knowledge source to enhance the user's search experience.
The resulting application framework, begun as the Scientist's Workbench demonstrator, has now become a public service to the scientific community under the name ACL Anthology Searchbench. It supports semantic, full-text and bibliographic search in the Anthology texts. The user can search for subject-predicate-object triples in millions of sentences, where predicates can also be synonyms, and taking passives and sentence negation into account. Combination with full-text and bibliographics search filters, autosuggest texts in search fields and faceted search make search easy and fast. Optionally, search results are highlighted in the original PDF layout. Furthermore, it is possible to bookmark queries or send them via email as simple web links.

In order to exploit citations for monitoring and understanding the evolution of knowledge through scientific progress, new bibliometric methods were developed for qualitative analysis of citation graphs. To this end, the contexts of citations are linguistically analyzed and classified into categories like basedOn, confirmed, refuted, improved, etc. The typed citation graph is automatically computed for the papers in the anthology collection (approx. 91,000 citation sentences in approx. 6,300 papers), and an appropriate graph layout is computed for enhanced navigation in paper collections. Colored edges indicate the citation types, and the edges are clickable to allow inspection of the citation(s) that caused the classification. The graphical navigator helps users to understand the development of scientific ideas over time. This bridges the gap between citation statistics and text search.

With the proliferation of photo and video footage on the Web, a knowledge base would not be complete without multimodal data on individual entities (people, places, etc.) and important events (concerts, award ceremonies, soccer matches, etc.). While photos of celebrities are abundant on the Internet, they are much harder to retrieve for less popular entities such as notable computer scientists or regionally interesting churches. Querying the entity names in image search engines yields large candidate lists, but they often have low precision and unsatisfactory recall. Moreover, even for more prominent targets, it is desirable to have a diverse collection of photos (e.g., from different time periods), some of which might be rare and difficult to locate using search engines. In some cases, the ambiguity of the entity name dilutes the search engine results. An example is the Berkeley professor and former ACM president David Patterson. None of the top 20 image results from major search engines show him; most show the governor of New York (whose name is actually David Paterson).
Our approach to these problems is to harness the knowledge base facts about the entities of interest, including salient keyphrases automatically mined from seed pages such as Wikipedia articles or homepages of entities. We then generate expanded queries to obtain a richer pool of candidates from image search engines. Subsequently, the results from different queries are aggregated by weighted voting, also taking into consideration the proximity of fact-related text snippets in the pages that contain the photo. As the same photo may be returned with different URIs from several queries, we compare photos by their visual similarity using SIFT features. This provides us with a clustering of near-duplicate images, which in turn can be used to improve the diversity of the final result photos.