Efficient search in large collections of text and XML
We are aiming at providing interactive response times for queries, even if they are to run on extremely large collections in the order of Terabytes, or on the Web. Besides developing efficient exact top-k algorithms, we have focused on approaches that quickly determine good approximations of the final results. Here, we develop promising heuristics for budget-aware algorithms that try to find the best approximate answers within a given budget of resource cost (like disk accesses) or within a given time. Our budget-aware algorithms produce results with a much higher precision than state-of-the-art budget-oblivious top-k algorithms.
In a second line of work, we exploit extensive precomputations of indexes for term pairs, trading in improved query processing time for higher disk storage requirements. With a precomputed index not exceeding the size of the indexed collection, our processing algorithm retrieves search results orders of magnitude faster than state-of-the-art algorithms, while preserving a similar result quality for reasonable numbers of results. At the same time, execution time is independent of the number of retrieved results.
Efficient search in social networks
Social networks like Facebook and MySpace, but also community platforms such as del.icio.us, YouTube, and Flickr, have become very popular for sharing items such as images or links, annotating them with tags, adding ratings or comments to existing content, and maintaining a network of friends. Users search for items by providing a set of tags. Such search tasks can be classified into three different categories: social search for items contributed by explicit friends, spiritual search for items provided by users with a similar interest profile, and global search for the most frequent items (currently the dominant paradigm on these platforms). We developed a context-aware scoring function which, in contrast to standard IR query models, can be tuned towards the different search tasks. Experiments show that this score is significantly more effective than a global score that does not take the querying user and her friends into account.
On the efficiency side, we developed the ContextMerge algorithm for efficient top-k search and ranking in social networks. It extends existing top-k threshold algorithms over inverted lists of different types with various novel techniques for the specific setting of large online communities. To leverage the "social wisdom," the algorithm incrementally explores the space of (explicit or implicit) friends and accumulates scores of items on the fly as they are encountered, limiting the expansion to a minimal number of related users. The algorithm can efficiently compute the best matches for social and spiritual search and, by falling back to a threshold algorithm on precomputed index lists, also for global search, even in huge communities with high dynamics. The SENSE system (for Socially ENhanced SEarch) provides tag-based search for a variety of social networks, including del.icio.us and Flickr.
Maintaining persistent archives of highly dynamic document collections
The World Wide Web has become a key source of knowledge, but much of the data on the Web is highly ephemeral in nature. Organizations like the Internet Archive have been working towards preserving the ever-changing Web. However, they have so far paid little attention to developing a global-scale infrastructure for collecting, archiving, and performing historical analyses on the collected data. Our Everlast approach, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive, aims to close this gap. Everlast pursues two main goals: (1) efficiently collecting and maintaining an archive of the World Wide Web with good coverage, and (2) enabling historical text analytics by efficiently processing time-travel keyword queries to return a ranked list of relevant documents from a web snapshot in the past. We combine standard methods of crawling with human-assisted crawlers — archival plugins in browsers or proxies which capture Web pages of archival interest and publish them into the archive. The archive itself combines existing persistence solutions from the peer-to-peer community, a storage-metadata management layer required for Web archiving, and an indexing layer for processing time-travel queries.
Distributed Knowledge Management
In the WisNetGrid project, our goal is to make the large amount of data stored in the German science grid accessible beyond the scope of a single application domain. To achieve this, we develop federated search interfaces for integrated search over the large number of structurally heterogeneous data stores in the German science grid. In addition to this, we leverage knowledge extraction techniques developed at the MPI-INF to distill facts from information available in the grid and on domain-specific Web sources, enabling precise search for knowledge for diverse application areas such as humanities and environmental sciences. Here, the main scientific challenges are the distributed nature of the data, the need for disambiguation and record linkage, and very complex applications that seem to require more than plain RDF-style factual knowledge.