Unsupervised Semantic Parsing
Enabling a computer to understand human language is one of the ultimate goals of NLP. Arguably, such understanding is not possible without semantic parsing, i.e. mapping of text to a representation of meaning. In recent years, there has been increasing interest in statistical approaches to semantic parsing. However, most of this research has focused on supervised methods requiring large amounts of data labeled by human experts. Such annotated resources are scarce and expensive to create, motivating the need for unsupervised or semi-supervised techniques.
Recently, we introduced the first nonparametric Bayesian model for unsupervised full semantic parsing. The model defines a generative story for recursive construction of lexical items along with syntactic and semantic structures. The method achieves competitive results on the question-answering task on the biomedical domain.
Bootstrapping Semantic Analyzers from Non-Contradictory Texts
Unsupervised models of semantics have their own challenges: they are not always able to establish semantic equivalences of lexical entries or logical forms or, on the contrary, cluster semantically different or even opposite expressions. For example, when analyzing weather forecasts, it is very hard to discover without explicit supervision which of the expressions “south wind”, “wind from west", and “southerly” denote the same wind direction and which do not, as they all have a very similar distribution of their contexts.
In our work, we showed that groups of unannotated texts with overlapping and non-contradictory semantics represent a valuable form of weak supervision. We assume that each such group of texts corresponds to some hidden semantic state, and each text in the group verbalizes a subset of this full semantic state. For example, consider semantic analysis of news articles or biographies. In both cases, we can find groups of documents referring to the same events or persons, and though they will probably focus on different aspects and have different subjective passages, they are likely to agree on the core information. Simultaneous inference of the semantic state for such a group would restrict the space of compatible hypotheses, and, intuitively, "easier" texts in a group will help to analyze the "harder" ones. The improvement in accuracy demonstrated in our experiments corresponds to more than 50% relative error reduction with a fully-supervised method used as an upper bound.
Modeling Documents with Inherently Parallel Structure
Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or sections presenting alternative views on the same problem. Automatically revealing relations between the parts, by jointly segmenting and predicting links between the segments, would open interesting possibilities for construction of friendlier user interfaces. One example is an application which, given a user-selected fragment of the abstract, produces a summary from the aligned segment of the document body. To address this problem, we introduced an unsupervised Bayesian model for joint discourse segmentation and alignment. When applied to a lecture-discussion dataset, our unsupervised method achieves competitive results rivaling those of a previously proposed fully-supervised technique. This work can also be regarded as a way to speed up inference for the bootstrapping approach discussed above, since, given such an alignment, a hidden semantic state can be inferred independently for each of the aligned segments.
Unsupervised Aggregation in Multiclass and Structured Settings
Consider the setting where a panel of judges is repeatedly asked to rank hypotheses and to predict a graph (e.g., semantic or syntactic structure) or a category for an object, and assume that the judges' expertise depends on the hidden objects' type (or domain). Learning to aggregate their predictions with the goal of producing a better joint prediction is a fundamental problem in many areas of information retrieval and NLP, among others. However, supervised ranking data is generally hard to obtain, and data augmented with expertise information is particularly difficult. We proposed a generative framework for learning to aggregate predictors with domain-specific expertise, without supervision.
Learning Feature Representations for Syntactic and Semantic Parsing
State-of-the-art approaches to syntactic and semantic parsing problems are mostly based on statistical learning, i.e., they use annotated training data collected for the considered problem to build statistical models of semantic and syntactic structures. Even though values of model parameters are determined by learning algorithms, model designers have to perform extensive manual selection of predictive structural features before applying the learning algorithms. These models are expensive to create and difficult to port to new domains, even when they are similar. If the model designer omits an important predictive feature, the resulting model will not be able to discover the statistical dependency, and the model accuracy will be worse.
We study latent variable models which automatically induce feature representations appropriate for the task and even the dataset, and the methods we developed achieve state-of-the-art results on multiple languages with minimal feature engineering. Latent variables can also be used to address sparsity problems. For example, the FrameNet standard describes an exceedingly large number of frame elements (argument types), and a model, which treats each frame element as a distinct label, is unlikely to succeed unless a very large amount of annotated data is provided. In our recent work, we constructed a latent variable model which automatically aligns frame elements and, as a result, dramatically reduces the set of model parameters.