# Statistical Learning with Structured Data

## Project Description

Graph-based learning methods have become a central topic in machine learning. The reason for their success is that they can be applied to any kind of data, as only relative similarity information is required, but no absolute feature representation. Moreover, graph-based learning methods are data-adaptive, as the global graph structure is built up from local similarity information, which is the key principle in manifold learning.

The construction of a clustering of a graph is equivalent to the construction of a partitioning usually defined by a certain cut-criterion. Recently, we investigated the relaxation of the normalized cut criterion leading to spectral clustering, a widely used clustering method, from a new perspective. We were able to show that by replacing the graph Laplacian used in spectral clustering with the graph -Laplacian, a nonlinear generalization of the graph Laplacian, we interpolate between relaxations of the normalized cut and the Cheeger cut. Our main result is that the partition found by optimal thresholding of the second eigenvector of the -Laplacian converges to the optimal Cheeger cut, also known as conductance of a graph, as tends towards one. But -spectral clustering is not only theoretically superior to standard spectral clustering, but also achieves better clustering results on a large number of datasets. Despite the more complicated optimization problem, we are still able to partition sparse graphs with a few hundred thousand vertices. Currently, we are applying the idea of the use of eigenvectors of non-linear operators to other problems in unsupervised data analysis.

Semi-supervised learning deals with the situation where unlabeled data is available in abundance but labeled data is scarce, as it is usually either time-consuming or very expensive to obtain labeled data. While semi-supervised classification has been heavily studied in recent years, we have tackled the problem of semi-supervised regression. Our main goal was the construction of a regularization functional for semi-supervised regression that prefers functions which vary linearly with respect to the geodesic distance on the "data manifold" and thus is particularly suited for semi-supervised dimensionality reduction. The embeddings in low-dimensional spaces obtained via our semi-supervised regression method are very smooth and extrapolate "linearly" beyond the labeled points compared to the other methods using the graph Laplacian as regularizer. Due to the semi-supervised setting, each dimension of the obtained embedding has a particular user-specified meaning, as opposed to unsupervised methods, and thus allows a direct interpretation. Finally, our method achieves state-of-the-art results in the domain of image colorization.

### Sparse Recovery of Positive Signals with Application to Mass Spectrometry

The problem of sparse recovery has become one of the major research topics in recent years in machine learning, signal processing and applied mathematics. This is due to the crucial observation that many types of high-dimensional data and signals occurring in nature can be well-described by a sparse representation, that is only a small number of coefficients in a suitable dictionary are non-zero. This principle can be used for efficient prediction and feature selection. Another major area is the field of compressed sensing where the sparsity property of signals is exploited in the acquisition process, as far fewer measurements are required than suggested e.g. by the Shannon-Nyquist sampling theorem.

The reliable and efficient identification of all proteins in mass spectrograms is the basis of all further investigation in computational proteomics. In collaboration with Andreas Hildebrandt's junior research group for protein-protein interactions and computational proteomics, we have formulated the detection of the mass signatures as a high-dimensional sparse recovery problem. Using the positivity of the spectrum, we use a pure fitting approach based on non-negative least squares. Compared to the Lasso, the by now standard method for sparse recovery, our estimator is unbiased and is robust to heterogeneous noise as it often appears in mass spectrograms. The whole system is basically parameter-free, adapts to new spectra without the necessity of any user interaction, and, compared to previous approaches, is particularly suited for the resolution of overlapping mass signatures in MALDI as well as ESI spectra. A publication and the corresponding bioconductor package in R are currently in preparation.

Motivated by the practical success of non-negative least squares for sparse recovery of positive signals, we have recently started its theoretical investigation. Using results from random matrix theory and theory of high-dimensional convex cones, we were able to prove that non-negative least squares recovers the original sparse positive signal under certain conditions even if the data is corrupted by noise. Moreover, our bounds also suggest that non-negative least squares outperforms the standard Lasso in certain settings. The corresponding paper is currently under review.