Statistics Technical Reports:Search | Browse by year

Term(s):2009
Results:18
Sorted by:

Title:Coalescing systems of Brownian particles on the Sierpinski gasket and stable particles on the line or circle
Author(s):Evans, Steven N.; Morris, Ben; Sen, Arnab; 
Date issued:November 2009
http://nma.berkeley.edu/ark:/28722/bk000641c58 (PDF)
Abstract:A well-known result of Arratia shows that one can make rigorous the notion of starting an independent Brownian motion at every point of an arbitrary closed subset of the real line and then building a set-valued process by requiring particles to coalesce when they collide. Arratia noted that the value of this process will be almost surely a locally finite set at all positive times, and a finite set almost surely if the initial value is compact: the key to both of these facts is the observation that, because of the topology of the real line and the continuity of Brownian sample paths, at the time when two particles collide one or the other of them must have already collided with each particle that was initially between them. We investigate whether such instantaneous coalescence still occurs for coalescing systems of particles where either the state space of the individual particles is not locally homeomorphic to an interval or the sample paths of the individual particles are discontinuous. We show that Arratia's conclusion is valid for Brownian motions on the Sierpinski gasket and for stable processes on the real line with stable index greater than one.
Keyword note:Evans__Steven_N Morris__Ben Sen__Arnab
Report ID:785
Relevance:100

Title:On the realized risk of high-dimensional Markowitz portfolios
Author(s):El Karoui, Noureddine; 
Date issued:October 2009
http://nma.berkeley.edu/ark:/28722/bk000641c7c (PDF)
Abstract:We study the realized risk of Markowitz portfolio computed using parameters estimated from data and generalizations to similar questions involving the out-of-sample risk in quadratic programs with linear equality constraints. We do so under the assumption that the data is generated according to an elliptical model, which allows us to study models where we have heavy-tails, tail dependence, and leptokurtic marginals for the data. We place ourselves in the setting of high-dimensional inference where the number of assets in the portfolio, $p$, is large and comparable to the number of samples, $n$, we use to estimate the parameters. Our approach is based on random matrix theory. We consider both the impact of the estimation of the mean and of the covariance. Our work shows that risk is underestimated in this setting, and further, that in the class of elliptical distributions, the Gaussian case yields the least amount of risk underestimation. The problem is more pronounced for genuinely elliptical distributions and Gaussian computations give an overoptimistic view of the situation. We also propose a robust estimator of realized risk and investigate its performance in simulations.
Keyword note:El__Karoui__Noureddine
Report ID:784
Relevance:100

Title:The Lasso under Heteroscedasticity
Author(s):Jia, Jinzhu; Rohe, Karl; Yu, Bin; 
Date issued:November 2009
http://nma.berkeley.edu/ark:/28722/bk000641c9g (PDF)
Abstract:Lasso is a popular method for variable selection in regression. Much theoretical understanding has been obtained recently on its model selection or sparsity recovery properties under sparse and homoscedastic linear regression models. Since these standard model assumptions are often not met in practice, it is important to understand how Lasso behaves under nonstandard model assumptions. In this paper, we study the sign consistency of the Lasso under one such model where the variance of the noise scales linearly with the expectation of the observation. This sparse Poisson-like model is motivated by medical imaging. In addition to studying the sign consistency, we also give sufficient conditions for $\ell_\infty$ consistency. With theoretical and simulation studies, we provide conditions for when the Lasso should not be expected to be sign consistent. One interesting finding is that $\truebeta$ can not be spread out. Precisely, for both deterministic design and random Gaussian design, the sufficient conditions for the Lasso to be sign consistent require $\|\truebeta\|_2 / [\minbeta]^2$ to be not too big, where $\minbeta$ is the smallest nonzero element of $|\truebeta|$. By special designs of $\X$, we show that $\|\truebeta\|_2 / [\minbeta]^2=o(n)$ is almost necessary. For Positron Emission Tomography (PET), this suggests that when there are dense areas of the positron emitting substance, less dense areas are not well detected by the Lasso; this is of particular concern when imaging tumors; the periphery of the tumor will produce a much weaker signal than the center, leading to a big $\|\truebeta\|_2 / [\minbeta]^2$. We compare the sign consistency of the Lasso under the Poisson-like model to its sign consistency on the standard model which assumes the noise is homoscedastic. The comparison shows that when $\truebeta$ is spread out, the Lasso performs worse for data from the Poisson-like model than those from the standard model, confirming our theoretical findings.
Keyword note:Jia__Jinzhu Rohe__Karl Yu__Bin
Report ID:783
Relevance:100

Title:Confidence Intervals for Negative Binomial Random Variables of High Dispersion
Author(s):Shilane, David; Evans, Steven N.; Hubbard, Alan E.; 
Date issued:September 2009
http://nma.berkeley.edu/ark:/28722/bk000641d1k (PDF)
Abstract:This paper considers the problem of constructing confidence intervals for the mean of a Negative Binomial random variable based upon sampled data. When the sample size is large, it is a common practice to rely upon a Normal distribution approximation to construct these intervals. However, we demonstrate that the sample mean of highly dispersed Negative Binomials exhibits a slow convergence in distribution to the Normal as a function of the sample size. As a result, standard techniques (such as the Normal approximation and bootstrap) will construct confidence intervals for the mean that are typically too narrow and significantly undercover in the case of high dispersion. To address this problem, we propose methods based upon Bernstein's inequality along with the Gamma and Chi Square distributions as alternatives to the standard methods when the sample size is small and the dispersion is high. A confidence interval based upon Bernstein's inequality relies upon less stringent assumptions than those required of parametric models. Moreover, we prove a limit theorem demonstrating that the sample mean of Negative Binomials converges in distribution to a Gamma random variable under suitable hypotheses, and we use this observation to construct approximate confidence intervals. Furthermore, we investigate the applicability of the Chi Square distribution as a special case of the Gamma model. We then undertake a variety of simulation experiments to compare the proposed methods to standard techniques in terms of empirical coverage and provide concrete recommendations for the settings in which particular intervals are preferred. We also apply the proposed methods to examples arising in the serial analysis of gene expression and traffic flow in a communications network to illustrate both the strengths and weaknesses of these procedures along with those of standard techniques.
Keyword note:Shilane__David Evans__Steven_N Hubbard__Alan_E
Report ID:782
Relevance:100

Title:High-dimensionality effects in the Markowitz problem and other quadratic programs with linear equality constraints: risk underestimation
Author(s):El Karoui, Noureddine; 
Date issued:August 2009
http://nma.berkeley.edu/ark:/28722/bk000641d3p (PDF)
Abstract:We study the properties of solutions of quadratic programs with linear equality constraints whose parameters are estimated from data in the high-dimensional setting where $p$, the number of variables in the problem, is of the same order of magnitude as n, the number of observations used to estimate the parameters. The Markowitz problem in Finance is a subcase of our study. Assuming normality and independence of the observations we relate the efficient frontier computed empirically to the ``true" efficient frontier. Our computations show that there is a separation of the errors induced by estimating the mean of the observations and estimating the covariance matrix. In particular, the price paid for estimating the covariance matrix is an underestimation of the variance by a factor roughly equal to 1-p/n. Therefore the risk of the optimal population solution is underestimated when we estimate it by solving a similar quadratic program with estimated parameters. We also characterize the statistical behavior of linear functionals of the empirical optimal vector and show that they are biased estimators of the corresponding population quantities. We investigate the robustness of our Gaussian results by extending the study to certain elliptical models and models where our $n$ observations are correlated (in ``time"). We show a lack of robustness of the Gaussian results, but are still able to get results concerning first order properties of the quantities of interest, even in the case of relatively heavy-tailed data (we require two moments). Risk underestimation is still present in the elliptical case and more pronounced that in the Gaussian case. We discuss properties of the non-parametric and parametric bootstrap in this context. We show several results, including the interesting fact that standard applications of the bootstrap generally yields inconsistent estimates of bias. Finally, we propose some strategies to correct these problems and practically validate them in some simulations. In all the paper, we will assume that p, n and n-p tend to infinity, and p<n.
Keyword note:El__Karoui__Noureddine
Report ID:781
Relevance:100

Title:A zero-one law for linear transformations of Levy noise
Author(s):Evans, Steven N.; 
Date issued:August 2009
http://nma.berkeley.edu/ark:/28722/bk000641d5s (PDF)
Abstract:A L\'evy noise on $\mathbb{R}^d$ assigns a random real ``mass'' $\Pi(B)$ to each Borel subset $B$ of $\mathbb{R}^d$ with finite Lebesgue measure. The distribution of $\Pi(B)$ only depends on the Lebesgue measure of $B$, and if $B_1, \ldots, B_n$ is a finite collection of pairwise disjoint sets, then the random variables $\Pi(B_1), \ldots, \Pi(B_n)$ are independent with $\Pi(B_1 \cup \cdots \cup B_n) = \Pi(B_1) + \cdots + \Pi(B_n)$ almost surely. In particular, the distribution of $\Pi \circ g$ is the same as that of $\Pi$ when $g$ is a bijective transformation of $\mathbb{R}^d$ that preserves Lebesgue measure. It follows from the Hewitt--Savage zero--one law that any event which is almost surely invariant under the mappings $\Pi \mapsto \Pi \circ g$ for every Lebesgue measure preserving bijection $g$ of $\mathbb{R}^d$ must have probability $0$ or $1$. We investigate whether certain smaller groups of Lebesgue measure preserving bijections also possess this property. We show that if $d \ge 2$, the L\'evy noise is not purely deterministic, and the group consists of linear transformations and is closed, then the invariant events all have probability $0$ or $1$ if and only if the group is not compact.
Keyword note:Evans__Steven_N
Report ID:780
Relevance:100

Title:Ensemble Filtering for High Dimensional Non-linear State Space Models
Author(s):Lei, Jing; Bickel, Peter; 
Date issued:August 2009
http://nma.berkeley.edu/ark:/28722/bk000641d90 (PDF)
Abstract:We consider non-linear state space models in high-dimensional situations, where the two common tools for state space models both have difficulties. The Kalman filter variants are seriously biased due to non-Gaussianity and the particle filter suffers from the ``curse of dimensionality''. Inspired by a regression perspective on the Kalman filter, a novel approach is developed by combining the Kalman filter and particle filter, retaining the stability of the Kalman filter in large systems as well as the accuracy of particle filters in highly non-linear systems. Its theoretical properties are justified under the Gaussian linear models as an extension of the Kalman filter. Its performance is tested and compared with other methods on a simulated chaotic system which is used widely in numerical weather forecasting.
Keyword note:Lei__Jing Bickel__Peter_John
Report ID:779
Relevance:100

Title:On information plus noise kernel random matrices
Author(s):El Karoui, Noureddine; 
Date issued:August 2009
http://nma.berkeley.edu/ark:/28722/bk000641f13 (PDF)
Abstract:Kernel random matrices have attracted a lot of interest in recent years, from both practical and theoretical standpoints. Most of the theoretical work so far has focused on the case were the data is sampled from a low-dimensional structure. Very recently, the first results concerning kernel random matrices with high-dimensional input data were obtained, in a setting where the data was sampled from a genuinely high-dimensional structure.
Keyword note:El__Karoui__Noureddine
Report ID:778
Relevance:100

Title:Soccer/World Football
Author(s):Brillinger, David R.; 
Date issued:July 2009
http://nma.berkeley.edu/ark:/28722/bk000641d7w (PDF)
Abstract:This paper provides a review of statistical and related methods and models that have been employed in studies of soccer, also know as association football. The sections are: 1. Data collection and descriptive analyses, 2. Stochastic modelling, 3. Ranking, 4. Tournaments and scheduling, 5. Game theory, 6. Economics and management, 7. Some remaining topics, 8. Extended reference list.
Keyword note:Brillinger__David_R
Report ID:777
Relevance:100

Title:Modelling spatial trajectories
Author(s):Brillinger, David R.; 
Date issued:June 2009
http://nma.berkeley.edu/ark:/28722/bk000641f36 (PDF)
Abstract:The study of trajectories has been basic to science for many centuries. One can mention the motion of the planets, the meanderings of animals and the routes of ships. More recently there has been considerable modeling and statistical analysis of biological and ecological processes of moving particles. The models may be motivated formally by difference and differential equations and by potential functions. Initially, following Liebnitz and Newton, such models were described by deterministic differential equations, but variability around observed paths has led to the introduction of random variables and to the development of stochastic calculi. The results obtained from the fitting of such models are highly useful. They may be employed for: simple description, summary, comparison, simulation, prediction, model appraisal, bootstrapping, and also employed for the estimation of derived quantities of interest. The potential function approach, to be presented in Section 1.3.4, will be found to have the advantage that an equation of motion is set down quite directly and that explanatories, including attractors, repellors, and time-varying fields may be included conveniently.
Keyword note:Brillinger__David_R
Report ID:776
Relevance:100

Title:An asymptotic sampling formula for the coalescent with recombination
Author(s):Jenkins, Paul A.; Song, Yun S.; 
Date issued:June 2009
http://nma.berkeley.edu/ark:/28722/bk000641f59 (PDF)
Abstract:Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biological applications, and has sparked several interesting mathematical generalizations. In the population genetics community, extending the underlying random-mating model to include recombination has received much attention in the past, but no general closed-form sampling formula is currently known even for the simplest extension, that is, a model with two loci. In this paper, we show that it is possible to obtain useful closed-form results in the case the population-scaled recombination rate rho is large but not necessarily infinite. Specifically, we consider an asymptotic expansion of the two-locus sampling formula in inverse powers of rho and obtain closed-form expressions for the first few terms in the expansion. Our asymptotic sampling formula applies to arbitrary sample sizes and configurations.
Keyword note:Jenkins__Paul_A Song__Yun_S
Report ID:775
Relevance:100

Title:Simultaneous support recovery in high dimensions: Benefits and perils of block $\ell_1/\ell_\infty$-regularization
Author(s):Negahban, S.; Wainwright, M. J.; 
Date issued:May 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0b17 (PDF)
Abstract:Given a collection of $r \geq 2$ linear regression problems in $\pdim$ dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of $\ell_{1}/\ell_{\infty}$-regularized regression for joint estimation of the $\pdim \times \numreg$ matrix of regression coefficients. We analyze the high-dimensional scaling of $\ell_1/\ell_\infty$-regularized quadratic programming, considering both consistency rates in $\ell_\infty$-norm, and also how the minimal sample size $n$ required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the $\ell_\infty$-error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. Our second set of results applies to $\numreg = 2$ linear regression problems with standard Gaussian designs whose supports overlap in a fraction $\alpha \in [0,1]$ of their entries: for this problem class, we prove that the $\ell_{1}/\ell_{\infty}$-regularized method undergoes a phase transition---that is, a sharp change from failure to success---characterized by the rescaled sample size $\theta_{1,\infty}(n, p, s, \alpha) = n/\{(4 - 3 \alpha) s \log(p-(2-\alpha) \, s)\}$. More precisely, given sequences of problems specified by $(n, p, s, \alpha)$, for any $\delta > 0$, the probability of successfully recovering both supports converges to $1$ if $\theta_{1, \infty}(n, p, s, \alpha) > 1+\delta$, and converges to $0$ for problem sequences for which $\theta_{1,\infty}(n,p,s, \alpha) < 1 - \delta$. An implication of this threshold is that use of $\ell_1 / \ell_{\infty}$-regularization yields improved statistical efficiency if the overlap parameter is large enough ($\alpha > 2/3$), but has \emph{worse} statistical efficiency than a naive Lasso-based approach for moderate to small overlap ($\alpha < 2/3$). Empirical simulations illustrate the close agreement between these theoretical predictions, and the actual behavior in practice. These results indicate that some caution needs to be exercised in the application of $\ell_1/\ell_\infty$ block regularization: if the data does not match its structure closely enough, it can impair statistical performance relative to computationally less expensive schemes.
Keyword note:Negahban__Sahand Wainwright__Martin
Report ID:774
Relevance:100

Title:Bayesian Functional ANOVA Modeling Using Gaussian Process Prior Distributions
Author(s):Kaufman, Cari G.; Sain, Stephan R.; 
Date issued:April 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0b5f (PDF)
Abstract:Functional analysis of variance (ANOVA) models partition a functional response according to the main effects and interactions of various factors. This article develops a general framework for functional ANOVA modeling from a Bayesian viewpoint, assigning Gaussian process prior distributions to each batch of functional effects. We discuss the choices to be made in specifying such a model, advocating the treatment of levels within a given factor as dependent but exchangeable quantities, and we suggest weakly informative prior distributions for higher level parameters that may be appropriate in many situations. We discuss computationally efficient strategies for posterior sampling using Markov Chain Monte Carlo algorithms, and we emphasize useful graphical summaries based on the posterior distribution of model-based analogues of traditional ANOVA decompositions of variance. We illustrate this process of model specification, posterior sampling, and graphical posterior summaries in two examples. The first considers the effect of geographic region on the temperature profiles at weather stations in Canada. The second example examines sources of variability in the output of regional climate models from a designed experiment.
Keyword note:Kaufman__Cari Sain__Stephan_R
Report ID:773
Relevance:100

Title:Fast Approximate Spectral Clustering
Author(s):Yan, Donghui; Huang, Ling; Jordan, Michael I.; 
Date issued:March 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0c95 (PDF)
Abstract:Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n^3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.
Keyword note:Yan__Donghui Huang__Ling Jordan__Michael_I
Report ID:772
Relevance:100

Title:Spectra of large random trees
Author(s):Bhamidi, Shankar; Evans, Steven N.; Sen, Arnab; 
Date issued:March 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0d84 (PDF)
Abstract:We analyze the eigenvalues of the adjacency matrices of a wide variety of random trees. Using general, broadly applicable arguments based on the interlacing inequalities for the eigenvalues of a principal submatrix of a Hermitian matrix and a suitable notion of local weak convergence for an ensemble of random trees that we call probability fringe convergence, we show that the empirical spectral distributions for each of a number of random tree models converge to a deterministic (model dependent) limit as the number of vertices goes to infinity. Moreover, the masses assigned by the empirical spectral distributions to individual points also converge in distribution to constants. We conclude for ensembles such as the linear preferential attachment models, random recursive trees, and the uniform random trees that the limiting spectral distribution has a set of atoms that is dense in the real line. We obtain precise asymptotics on the mass assigned to zero by the empirical spectral measures via the connection between the number of zero eigenvalues of the adjacency matrix of a tree and the cardinality of a maximal matching on the tree. In particular, we employ a simplified version of an algorithm due to Karp and Sipser to construct maximal matchings and understand their properties. Moreover, we show that the total weight of a weighted matching is asymptotically equivalent to a constant multiple of the number of vertices when the edge weights are independent, identically distributed, non-negative random variables with finite expected value. We greatly extend a celebrated result obtained by Schwenk for the uniform random trees by showing that if any ensemble converges in the probability fringe sense and a very mild further condition holds, then, with probability converging to one, the spectrum of a realization is shared by at least one other (non-isomorphic) tree. For the the linear preferential attachment model with parameter $a > -1$, we show that for any fixed $k$ the $k$ largest eigenvalues jointly converge in distribution to a non-trivial limit when rescaled by $n^{1/2\gamma_a}$, where $\gamma_a = a+2$ is the Malthusian rate of growth parameter for an associated continuous time branching process.
Keyword note:Bhamidi__Shankar Evans__Steven_N Sen__Arnab
Report ID:771
Relevance:100

Title:Hierarchical Bayesian nonparametric models with applications
Author(s):Teh, Yee Whye; Jordan, Michael I.; 
Date issued:January 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0f73 (PDF)
Abstract:Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. A common motif in hierarchical modeling is that of the conditionally independent hierarchy, in which a set of parameters are coupled by making their distributions depend on a shared underlying parameter. These distributions are often taken to be identical, based on an assertion of exchangeability and an appeal to de Finetti's theorem. In this review we discuss a thoroughgoing exploitation of hierarchical modeling ideas in Bayesian nonparametric statistics. The basic idea is that rather than treating distributional parameters parametrically, we treat them nonparametrically. In particular, the base measure $G_0$ in the Dirichlet process can itself be viewed as a random draw from some distribution on measures---specifically it can be viewed as a draw from the Dirichlet process. This yields a natural recursion that we refer to as a hierarchical Dirichlet process. Our focus in this chapter is on nonparametric hierarchies of this kind, where the tools of Bayesian nonparametric modeling are used recursively.
Keyword note:Teh__Yee_Whye Jordan__Michael_I
Report ID:770
Relevance:100

Title:Hyperdeterminantal point processes
Author(s):Evans, Steven N.; Gottlieb, Alex; 
Date issued:February 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0839 (PDF)
Abstract:As well as arising naturally in the study of non-intersecting random paths, random spanning trees, and eigenvalues of random matrices, determinantal point processes (sometimes also called fermionic point processes) are relatively easy to simulate and provide a quite broad class of models that exhibit repulsion between points. The fundamental ingredient used to construct a determinantal point process is a kernel giving the pairwise interactions between points: the joint distribution of any number of points then has a simple expression in terms of determinants of certain matrices defined from this kernel. In this paper we initiate the study of an analogous class of point processes that are defined in terms of a kernel giving the interaction between 2M points for some integer M. The role of matrices is now played by 2M-dimensional "hypercubic" arrays, and the determinant is replaced by a suitable generalization of it to such arrays – Cayley’s first hyperdeterminant. We show that some of the desirable features of determinantal point processes continue to be exhibited by this generalization.
Keyword note:Evans__Steven_N Gottlieb__Alex
Report ID:749
Relevance:100

Title:Mutation-selection balance with recombination: convergence to equilibrium for polynomial selection costs
Author(s):Clayton, Aubrey; Evans, Steven N.; 
Date issued:February 2009
http://nma.berkeley.edu/ark:/28722/bk0005d0g2v (PDF)
Abstract:We study a continuous-time dynamical system that models the evolving distribution of genotypes in an infinite population where genomes may have infinitely many or even a continuum of loci, mutations accumulate along lineages without back-mutation, added mutations reduce fitness, and recombination occurs on a faster time scale than mutation and selection. Some features of the model, such as existence and uniqueness of solutions and convergence to the dynamical system of an approximating sequence of discrete time models, were presented in earlier work by Evans, Steinsaltz, and Wachter for quite general selective costs. Here we study a special case where the selective cost of a genotype with a given accumulation of ancestral mutations from a wild type ancestor is a sum of costs attributable to each individual mutation plus successive interaction contributions from each $k$-tuple of mutations for $k$ up to some finite "degree". Using ideas from complex chemical reaction networks and a novel Lyapunov function, we establish that the phenomenon of mutation-selection balance occurs for such selection costs under mild conditions. That is, we show that the dynamical system has a unique equilibrium and that it converges to this equilibrium from all initial conditions.
Keyword note:Clayton__Aubrey Evans__Steven_N
Report ID:738
Relevance:100