Statistics Technical Reports:Search | Browse by year

Sorted by:

Title:A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
Author(s):Negahban, Sahand; Ravikumar, Pradeep; Wainwright, Martin J.; Yu, Bin; 
Date issued:October 2010 (PDF)
Abstract:High-dimensional statistical inference deals with models in which the the number of parameters $p$ is comparable to or larger than the sample size $n$. Since it is usually impossible to obtain consistent procedures unless $p/n \rightarrow 0$, a line of recent work has studied models with various types of low-dimensional structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized $M$-estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. This paper provides a unified framework for establishing consistency and convergence rates for such regularized $M$-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure corresponding regularized $M$-estimators have fast convergence rates, and which are optimal in many well-studied cases.
Keyword note:Negahban__Sahand Ravikumar__Pradeep Wainwright__Martin Yu__Bin
Report ID:797

Title:Trickle-down processes and their boundaries
Author(s):Evans, Steven N.; Gruebel, Rudolf; Wakolbinger, Anton; 
Date issued:October 2010 (PDF)
Abstract:It is possible to represent each of a number of Markov chains as an evolving sequence of connected subsets of a directed acyclic graph that grow in the following way: initially, all vertices of the graph are unoccupied, particles are fed in one-by-one at a distinguished source vertex, successive particles proceed along directed edges according to an appropriate stochastic mechanism, and each particle comes to rest once it encounters an unoccupied vertex. Examples include the binary and digital search tree processes, the random recursive tree process and generalizations of it arising from nested instances of Pitman's two-parameter Chinese restaurant process, tree-growth models associated with Mallows' phi model of random permutations and with Schuetzenberger's non-commutative q-binomial theorem, and a construction due to Luczak and Winkler that grows uniform random binary trees in a Markovian manner. We introduce a framework that encompasses such Markov chains, and we characterize their asymptotic behavior by analyzing in detail their Doob-Martin compactifications, Poisson boundaries and tail sigma-fields.
Keyword note:Evans__Steven_N Gruebel__Rudolf Wakolbinger__Anton
Report ID:796

Title:Minimax-optimal rates for sparse additive models over kernel classes via convex programming
Author(s):Raskutti, Garvesh; Wainwright, Martin J.; Yu, Bin; 
Date issued:August 2010 (PDF)
Abstract:Sparse additive models are families of $d$-variate functions that have the additive decomposition \mbox{$f^* = \sum_{j \in S} f^*_j$,} where $S$ is a unknown subset of cardinality $s \ll d$. We consider the case where each component function $f^*_j$ lies in a reproducing kernel Hilbert space, and analyze a simple kernel-based convex program for estimating the unknown function $f^*$. Working within a high-dimensional framework that allows both the dimension $d$ and sparsity $s$ to scale, we derive convergence rates in the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norms. These rates consist of two terms: a \emph{subset selection term} of the order $\frac{s \log d}{n}$, corresponding to the difficulty of finding the unknown $s$-sized subset, and an \emph{estimation error} term of the order $s \, \nu_n^2$, where $\nu_n^2$ is the optimal rate for estimating an univariate function within the RKHS. We complement these achievable results by deriving minimax lower bounds on the $L^2(\mathbb{P})$ error, thereby showing that our method is optimal up to constant factors for sub-linear sparsity $s = o(d)$. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, finite-rank kernel classes, as well as Sobolev smoothness classes.
Keyword note:Raskutti__Garvesh Wainwright__Martin Yu__Bin
Report ID:795

Title:Non-existence of Markovian time dynamics for graphical models of correlated default
Author(s):Evans, Steven N.; Hening, Alexandru; 
Date issued:August 2010 (PDF)
Abstract:Filiz et al. (2008) proposed a model for the pattern of defaults seen among a group of firms at the end of a given time period. The ingredients in the model are a graph, where the vertices correspond to the firms and the edges describe the network of interdependencies between the firms, a parameter for each vertex that captures the individual propensity of that firm to default, and a parameter for each edge that captures the joint propensity of the two connected firms to default. The correlated default model can be re-rewritten as a standard Ising model on the graph by identifying the set of defaulting firms in the default model with the set of sites in the Ising model for which the spin is +1. We ask whether there is a suitable continuous time Markov chain taking values in the subsets of the vertex set such that the initial state of the chain is the empty set, each jump of the chain involves the inclusion of a single extra vertex, the distribution of the chain at some fixed time horizon time is the one given by the default model, and the distribution of the chain for other times is described by a probability distribution in the same family as the default model. We show for three simple but financially natural special cases that this is not possible outside of the trivial case where there is complete independence between the firms.
Keyword note:Evans__Steven_N Hening__Alexandru
Report ID:794

Title:Pade approximants and exact two-locus sampling distributions
Author(s):Jenkins, Paul A.; Song, Yun S.; 
Date issued:July 2010 (PDF)
Abstract:For population genetics models with recombination, obtaining an exact, analytic sampling distribution has remained a challenging open problem for several decades. Recently, a new perspective based on asymptotic series has been introduced to make progress on this problem. Specifically, closed-form expressions have been derived for the first few terms in an asymptotic expansion of the two-locus sampling distribution when the recombination rate rho is moderate to large. In this paper, a new computational technique is developed for finding the asymptotic expansion to an arbitrary order. Computation in this new approach can be automated easily. Furthermore, it is proved here that only a finite number of terms in the asymptotic expansion is needed to recover (via the method of Pade approximants) the exact two-locus sampling distribution as an analytic function of rho; this function is exact for all values of rho from 0 to infinity. It is also shown that the new computational framework presented here is flexible enough to incorporate natural selection.
Keyword note:Jenkins__Paul_A Song__Yun_S
Report ID:793

Title:An Analysis of the Convergence of Graph Laplacians
Author(s):Ting, Daniel; Huang, Ling; Jordan, Michael; 
Date issued:July 2010 (PDF)
Abstract:Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously unstudied graphs including kNN graphs. We also introduce a kernel-free framework to analyze graph constructions with shrinking neighborhoods in general and apply it to analyze locally linear embedding (LLE). We also describe how for a given limiting Laplacian operator desirable properties such as a convergent spectrum and sparseness can be achieved choosing the appropriate graph construction.
Keyword note:Ting__Daniel Huang__Ling Jordan__Michael_I
Report ID:792

Title:Spectral clustering and the high-dimensional Stochastic Block Model
Author(s):Rohe, Karl; Chatterjee, Sourav; Yu, Bin; 
Date issued:July 2010 (PDF)
Abstract:Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities.The Stochastic Block Model (Holland et al., 1983) is a social network model with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model, we bound the number of nodes "misclustered" by spectral clustering. The asymptotic results in this paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional.In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normalized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original.
Keyword note:Rohe__Karl Chatterjee__Sourav Yu__Bin
Report ID:791

Title:Measuring reproducibility of high-throughput experiments
Author(s):Li, Qunhua; Brown, James B.; Huang, Haiyan; Bickel, Peter J.; 
Date issued:May 2010 (PDF)
Abstract:Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work, we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.
Keyword note:Li__Qunhua Brown__James_B Huang__Haiyan Bickel__Peter_John
Report ID:790

Title:The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples
Author(s):Evans, Steven N.; Matsen, Frederick A.; 
Date issued:May 2010 (PDF)
Abstract:We employ the Kantorovich-Rubinstein (KR) metric and $L^p$ generalizations to compare probability distributions on a given phylogenetic tree. Such distributions arise in the context of metagenomics, where a sample of environmental sequences may be treated as a collection of weighted points on a reference phylogenetic tree of known sequences. In contrast to many applications of Kantorovich-Rubinstein ideas, the phylogenetic KR metric can be written in a closed form and calculated in linear time. Using Monte Carlo resampling of the data, we assign a statistical significance level to the observed distance between two distributions under a null hypothesis of no clustering. We also approximate the significance level using a functional of a suitable Gaussian process; in the $L^2$ generalized case this functional is distributed as a linear combination of $\chi_1^2$ random variables weighted by the eigenvalues of an associated matrix. We conclude with an example application using our software implementation of the KR metric and its generalizations.
Keyword note:Evans__Steven_N Matsen__Frederick_A
Report ID:789

Title:Shape-based peak identification for ChIP-Seq
Author(s):Hower, Valerie; Evans, Steven N.; Pachter, Lior; 
Date issued:May 2010 (PDF)
Abstract:We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at
Keyword note:Hower__Valerie Evans__Steven_N Pachter__Lior
Report ID:788

Title:Coverage statistics for sequence census methods
Author(s):Evans, Steven N.; Hower, Valerie; Pachter, Lior; 
Date issued:April 2010 (PDF)
Abstract:We study the shape of coverage functions resulting from the sequencing of random genome fragments, and show that they can be described by Galton-Watson trees. This extends standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provides a null model for detecting deviations from random coverage in high-throughput sequence census based experiments such as ChIP-Seq.
Keyword note:Evans__Steven_N Hower__Valerie Pachter__Lior
Report ID:787

Title:Second order accurate distributed eigenvector computation for extremely large matrices
Author(s):El Karoui, Noureddine; d'Aspremont, Alexandre; 
Date issued:February 2010 (PDF)
Abstract:We propose a second-order accurate method to estimate the eigenvectors of extremely large matrices thereby addressing a problem of relevance to statisticians working in the analysis of very large datasets. More specifically, we show that averaging eigenvectors of randomly subsampled matrices efficiently approximates the true eigenvectors of the original matrix under certain conditions on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in dimensionality reduction and information retrieval.
Keyword note:El__Karoui__Noureddine d_Aspremont__Alexandre
Report ID:786