RA5: Knowledge Management

Vision and Research Strategy

The Web bears the potential of being the world's greatest encyclopedic source, but we are far from exploiting this potential. Valuable scientific and cultural content is mixed up with huge amounts of noisy, low-quality, unstructured text and media. However, the proliferation of knowledge-sharing communities like Wikipedia and the advances in automated information extraction from Web pages open up an unprecedented opportunity: we can systematically harvest facts from the Web and compile them into a comprehensive, formally represented knowledge base about the world's entities, their semantic properties, and their relationships with each other. Ultimately, we would like to distill valuable assets from the Web in order to automatically construct and maintain a high-quality, machine-processable notion of an “Open Science Web.” As a concrete milestone, imagine a “Structured Wikipedia” that has the same scale and richness as Wikipedia itself, but offers a precise and concise representation of knowledge, e.g., in the RDF format. This enables expressive and highly precise querying, e.g., in the SPARQL language (or appropriate extensions), with additional capabilities for informative ranking of query results.

The Open Science Web should ideally know all individual entities of this world (e.g., David Patterson), their semantic classes (e.g., DavidPatterson isa ComputerScientist, DavidPatterson isa UniversityProfessor), relationships between entities (e.g., DavidPatterson worksFor UCBerkeley, DavidPatterson hasInvented RISC), as well as validity times and confidence values for the correctness of such facts (e.g., DavidPatterson hasPosition ACMpresident [2004,2006]). It can connect entities with multimodal information like photos and videos. Moreover, it comes with logical reasoning capabilities and rich support for querying. Concrete instantiations of this framework can focus on scholarly knowledge about researchers, organizations, projects, publications, software, etc., or on specific aspects of common diseases and their genetic as well as external factors.

The benefits of building such comprehensive knowledge bases will be enormous. Potential applications include 1) a formalized machine-readable encyclopedia that can be queried with high precision like a semantic database; 2) a key asset for disambiguating entities by supporting fast and accurate mappings of textual phrases onto named entities in the knowledge base; 3) an enabler for entity-relationship-oriented semantic search on the Web, for detecting entities and relations in Web pages and reasoning about them in expressive (probabilistic) logics; 4) a backbone for natural-language question answering that would aid in dealing with entities and their relationships in answering who/where/when/ etc. questions; 5) a key asset for machine translation (e.g., English to German) and interpretation of spoken dialogs, where world knowledge provides essential context for disambiguation; and 6) a catalyst for acquisition of further knowledge and largely automated maintenance and growth of the knowledge base.

Our methodology towards these ambitious long-term goals is twofold: on the foundational-research side, we are developing new models and algorithms for knowledge harvesting, search, and ranking; on the system-integration and experimental-research side, we are developing software tools and building knowledge bases as demonstrators of our progress and for dissemination to the wider research community.

Research Topics and Achievements

Responsible Investigators and Personal Development

Collaborations

Prizes and Awards

Publications