Statistics Technical Reports:Search | Browse by year

Sorted by:
Page: 1 2  Next

Title:Learning a potential function from a trajectory
Author(s):Brillinger, David R.; 
Date issued:December 2006 (PDF)
Abstract:This letter concerns the use of stochastic gradient systems in the modeling of the paths of moving particles and the consequent estimation of a potential function. The work proceeds by setting down a model for the potential function which leads to a stochastic differential equation. The method is simple, direct and flexible being based on a linear model and least squares. The estimated potential function may be used for: simple description, summary, comparison, seeking patterns, simulation, prediction, and model appraisal. Explanatories, attractors and repellors, may be included in the potential function directly. The large sample distribution of the estimated potential function is provided. There is an example analyzing the path of an elk. There are direct extensions to: updating, sliding window, adaptive, robust and real time variants. Index Terms: Mobility model, monitoring, potential function, stochastic differential equation, stochastic gradient system, surveillance, tracking, waypoint data.
Keyword note:Brillinger__David_R
Report ID:723

Title:AdaBoost is Consistent
Author(s):Bartlett, Peter L.; Traskin, Mikhail; 
Date issued:December 2006 (PDF)
Abstract:The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after $n^{1-\varepsilon}$ iterations---for sample size $n$ and $\varepsilon \in (0,1)$---the sequence of risks of the classifiers it produces approaches the Bayes risk.
Keyword note:Bartlett__Peter Traskin__Mikhail
Report ID:722

Title:Probability and real trees
Author(s):Evans, Steven N.; 
Date issued:December 2006 (PDF)
Abstract:These are the notes for the lectures I gave at the Saint-Flour probability summer school in 2005.
Keyword note:Evans__Steven_N
Report ID:721

Title:Lasso-type recovery of sparse representations for high-dimensional data
Author(s):Meinshausen, Nicolai; Yu, Bin; 
Date issued:December 2006 (PDF)
Abstract:The Lasso (Tibshirani, 1996) is an attractive technique for regularization and variable selection for high-dimensional data, where the number of predictor variables p is potentially much larger than the number of samples n. However, it was recently discovered (Zhao and Yu, 2006; Zou, 2005; Meinshausen and Buhlmann, 2006) that the sparsity pattern of the Lasso estimator can only be asymptotically identical to the true sparsity pattern if the design matrix satisfies the so-called irrepresentable condition. The latter condition can easily be violated in applications due to the presence of highly correlated variables. Here we examine the behavior of the Lasso estimators if the irrepresentable condition is relaxed. Even though the Lasso cannot recover the correct sparsity pattern, we show that the estimator is still consistent in the l_2-norm sense for fixed designs under conditions on (a) the number s(n) of non-zero components of the vector beta(n) and (b) the minimal singular values of the design matrices that are induced by selecting of order s(n) variables. The results are extended to vectors beta in weak l_q-balls with 0<q<1. Our results imply that, with high probability, all important variables are selected. The set of selected variables is a useful (meaningful) reduction on the original set of variables (p(n) >n). Finally, our results are illustrated with the detection of closely adjacent frequencies, a problem encountered in astrophysics.
Keyword note:Meinshausen__Nicolai Yu__Bin
Report ID:720

Title:Hierarchical Beta Processes and the Indian Buffet Process
Author(s):Thibaux, Romain; Jordan, Michael I.; 
Date issued:November 2006 (PDF)
Abstract:We show that the beta process is the de Finetti mixing distribution underlying the Indian buffet process of Griffiths and Ghahramani (2005). This result shows that the beta process plays the role for the Indian buffet process that the Dirichlet process plays for Chinese restaurant process, a parallel that guides us in deriving analogs for the beta process of the many known extensions of the Dirichlet process. In particular we define Bayesian hierarchies of beta processes and use the connection to the beta process to develop posterior inference algorithms for the Indian buffet process. We also present an application to document classification, exploring a relationship between the hierarchical beta process and smoothed naive Bayes models.
Keyword note:Thibaux__Romain Jordan__Michael_I
Report ID:719

Title:Probabilistic Analysis of Linear Programming Decoding
Author(s):Daskalakis, Constantinos; Dimakis, Alexandros D. G.; Karp, Richard M.; Wainwright, Martin J.; 
Date issued:October 2006 (PDF)
Abstract:We initiate the probabilistic analysis of linear programming (LP) decoding of low-density parity-check (LDPC) codes. Specifically, we show that for a random LDPC code ensemble, the linear programming decoder of Feldman et al. succeeds in correcting a constant fraction of errors with high probability. The fraction of correctable errors guaranteed by our analysis surpasses all prior non-asymptotic results for LDPC codes, and in particular exceeds the best previous finite-length result on LP decoding by a factor greater than ten. This improvement stems in part from our analysis of probabilistic bit-flipping channels, as opposed to adversarial channels. At the core of our analysis is a novel combinatorial characterization of LP decoding success, based on the notion of a generalized matching. An interesting by-product of our analysis is to establish the existence of "almost expansion" in random bipartite graphs, in which one requires only that almost every (as opposed to every) set of a certain size expands, with expansion coefficients much larger than the classical case.
Keyword note:Daskalakis__Constantinos Dimakis__Alexandros Karp__Richard_M Wainwright__Martin
Report ID:718

Title:A mutation-selection model for general genotypes with recombination
Author(s):Evans, Steven N.; Steinsaltz, David; Wachter, Kenneth W.; 
Date issued:September 2006 (PDF)
Abstract:A probability model is presented for the dynamics of mutation-selection balance in a infinite-population infinite-sites setting sufficiently general to cover mutation-driven changes in full age-specific demographic schedules. An earlier work by the same authors presented a haploid model -- without genetic recombination -- of similar scope. This work complements that model, adding genetic recombination, based on a well-known general discrete-population genetic model of N. Barton and M. Turelli. The model with recombination is a flow on Poisson intensities, substantially different from the haploid model. It is shown that the new model arises from the haploid model when recombination is added, in the limit as generations per unit time go to infinity, and selection strength and mutation per generation go to 0.
Keyword note:Evans__Steven_N Steinsaltz__David Wachter__Kenneth
Report ID:717

Title:Regularized estimation of large covariance matrices
Author(s):Bickel, Peter J.; Levina, Elizaveta; 
Date issued:September 2006 (PDF)
Abstract:This paper considers estimating a covariance matrix of p variables from n oberservations by either banding the sample covariance matrix or estimating a banded version of the inverse of the covariance. We show that these estimates are consistent in the operator norm as long as (log p)^2/n converges to 0, and obtain explicit rates. The results are uniform over some fairly natural well-conditioned families of covariance matices. We also introduce an analogue of the Gaussian white noise model and show that if the population covariance is embeddable in that model and well-conditioned then the banded approximations produce consistent estimates of eigenvalues and associated eigenvectors of the covariance matrix. The results can be extended to smooth versions of banding and to non-Gaussian distributions with sufficient short tails. A resampling approach is proposed for choosing the banding parameter in practice. This approach is illustrated numerically on both simulated and real data.
Keyword note:Bickel__Peter_John Levina__Elizaveta
Report ID:716

Title:Kernel Dimension Reduction in Regression
Author(s):Fukumizu, Kenji; Bach, Francis R.; Jordan, Michael I.; 
Date issued:September 2006 (PDF)
Abstract:We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from a formulation of SDR in terms of the conditional independence of the covariate $X$ from the response $Y$, given the projection of $X$ on the central subspace (Li, 1991; Cook, 1998). We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an M-estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.
Keyword note:Fukumizu__Kenji Bach__Francis_R Jordan__Michael_I
Report ID:715

Title:On Detecting Periodicity in Astronomical Point Processes
Author(s):Bickel, Peter; Kleijn, Bas; Rice, John; 
Date issued:August 2006 (PDF)
Abstract:We consider the problem of detecting periodicity in the rate function of a point process or a marked point process, motivated by the problem of detecting $\gamma$-ray pulsars. The detection problem poses both theoretical and computational challenges. On the theoretical side, there are no compelling optimality results that dictate the choice of a detection algorithm and the properties of detection procedures can be quite difficult to analyze. On the computational side, searching over a range of frequency and frequency drift can be a daunting task, even for a record consisting of only a thousand or so events. We discuss a class of detection procedures, weighted quadratic test statistics arising from likelihood expressions, whose properties we can understand and which do not impose excessive computational burdens. We show how knowledge of the point spread function associated with photon arrivals can be incorporated to improve power. We show that if a search over frequencies is conducted by discretizing a frequency band, the discretization must be very fine and we discuss the use of integration over frequency bands as an alternative. We also discuss the use of extreme value theory in conjunction with simulation in assessing statistical significance for such a search.
Keyword note:Bickel__Peter_John Kleijn__Bas Rice__John_Andrew
Report ID:714

Title:Damage segregation at fissioning may increase growth rates: A superprocess model
Author(s):Evans, Steven N.; Steinsaltz, David; 
Date issued:August 2006 (PDF)
Abstract:A fissioning organism may purge unrepairable damage by bequeathing it preferentially to one of its daughters. We propose a superprocess model, and show that when damage accumulates deterministically, optimal growth is achieved by unequal division of damage between the daughters.
Keyword note:Evans__Steven_N Steinsaltz__David
Report ID:713

Title:Measuring Similarity between Gene Expression Profiles with the Consideration of Both Shape and Magnitude
Author(s):Kim, Kyungpil; Jiang, Keni; Zhang, Shibo; Cai, Li; Lee, In-Beum; Feldman, Lewis; Huang, Haiyan; 
Date issued:June 2006 (PDF)
Abstract:Clustering methods have been widely applied to gene expression data in order to group genes sharing common or similar expression profiles into discrete functional groups. In such analyses, designing an appropriate (dis)similarity measure is critical. Motivated by the Poisson based similarity measure PoissonC designed for SAGE data (Cai et al., 2004), we explore more generally applicable similarity measures in clustering analysis that consider both shape and magnitude of the gene expression profile. Our idea is to model the shape and magnitude information separately and use the estimated shape and magnitude parameters to define a similarity measure in a new data space, wherein each dimension represents different aspects of an expression profile shape. We expect that our new measure would be more effective to detect shape changes compared to PoissonC and have necessary sensitivity to magnitude. The application results of our new measure to different types of expression data demonstrate the effectiveness of our method.
Keyword note:Kim__Ki_Mok Jiang__Keni Zhang__Shibo Cai__Li Lee__In-Beum Feldman__Lewis Huang__Haiyan
Report ID:712

Title:A statistical framework to infer functional gene associations from multiple biologically interrelated microarray experiments
Author(s):Teng, Siew-Leng Melinda; Zhou, Jasmine; Huang, Haiyan; 
Date issued:June 2006 (PDF)
Abstract:Inferring functional gene relationships is a major step in understanding biological networks. With microarray data from an increasing number of biologically interrelated experiments, it now allows for more complete portrayals of functional gene relationships involved in biological processes. In current studies of gene relationships, the existence of dependencies between gene expressions from the biologically interrelated experiments, however, has been widely ignored. When not accounted for, these experimental dependencies can result in inaccurate inferences of functional gene relationships, and hence incorrect biological conclusions. This article proposes a statistical framework and a novel gene co?expression measure, named Knorm correlation, to address this problem. The most important aspect of the proposed model is its ability to decompose the interesting biological variations in gene expressions into two mutually independent components each arising from the genes and the experiments, in addition to variations due to random noises. As a result, the Knorm correlation can critically de-correlate the experimental dependencies before estimating the gene relationships, thus leading to improved accuracies in inferring functional gene relationships. Knorm correlation simplifies to the Pearson coefficient when experiments are uncorrelated. Using simulation studies, a yeast microarray and a human microarray dataset, we demonstrate the success of the Knorm correlation as a more accurate and reliable measure, and the adverse impact of experimental dependencies on the Pearson coefficient, in inferring functional gene relationships from interrelated and interdependent experiments
Keyword note:Teng__Siew-Leng_Melinda Zhou__Jasmine Huang__Haiyan
Report ID:711

Title:Expectation, Conditional Expectation and Martingales in Local Fields
Author(s):Evans, Steven N.; Lidman, Tye; 
Date issued:June 2006 (PDF)
Abstract:We investigate a possible definition of expectation and conditional expectation for random variables with values in a local field such as the $p$-adic numbers. We define the expectation by analogy with the observation that for real-valued random variables in $L^2$ the expected value is the orthogonal projection onto the constants. Previous work has shown that the local field version of $L^\infty$ is the appropriate counterpart of $L^2$, and so the expected value of a local field-valued random variable is defined to be its "projection" in $L^\infty$ onto the constants. Unlike the real case, the resulting projection is not typically a single constant, but rather a ball in the metric on the local field. However, many properties of this expectation operation and the corresponding conditional expectation mirror those familiar from the real-valued case; for example, conditional expectation is, in a suitable sense, a contraction on $L^\infty$ and the tower property holds. We also define the corresponding notion of martingale, show that several standard examples of martingales (for example, sums or products of suitable independent random variables or "harmonic" functions composed with Markov chains) have local field analogues, and obtain versions of the optional sampling and martingale convergence theorems.
Keyword note:Evans__Steven_N Lidman__Tye
Report ID:710

Title:Sharp thresholds for high-dimensional and noisy recovery
Author(s):Wainwright, Martin J.; 
Date issued:June 2006 (PDF)
Abstract:The problem of consistently estimating the sparsity pattern of a vector $\betastar \in \real^\mdim$ based on observations contaminated by noise arises in various contexts, including subset selection in regression, structure estimation in graphical models, sparse approximation, and signal denoising. We analyze the behavior of $\ell_1$-constrained quadratic programming (QP), also referred to as the Lasso, for recovering the sparsity pattern. Our main result is to establish a sharp relation between the problem dimension $\mdim$, the number $\spindex$ of non-zero elements in $\betastar$, and the number of observations $\numobs$ that are required for reliable recovery. For a broad class of Gaussian ensembles satisfying mutual incoherence conditions, we establish existence and compute explicit values of thresholds $\ThreshLow$ and $\ThreshUp$ with the following properties: for any $\threshbou > 0$, if $\numobs > 2 \, \spindex ( \ThreshUp + \threshbou) \log (\mdim - \spindex) + \spindex + 1$, then the Lasso succeeds in recovering the sparsity pattern with probability converging to one for large problems, whereas for $\numobs < 2 \, \spindex ( \ThreshLow - \threshbou) \log (\mdim - \spindex) + \spindex + 1$, then the probability of successful recovery converges to zero. For the special case of the uniform Gaussian ensemble, we show that $\ThreshLow = \ThreshUp = 1$, so that the threshold is sharp and exactly determined.
Keyword note:Wainwright__Martin
Report ID:709

Title:On optimal quantization rules for some sequential decision problems
Author(s):Nguyen, Xuanlong; Wainwright, Martin J.; Jordan, Michael I.; 
Date issued:June 2006 (PDF)
Abstract:We consider the problem of sequential decentralized detection, a problem that entails several interdependent choices: the choice of a stopping rule (specifying the sample size), a global decision function (a choice between two competing hypotheses), and a set of quantization rules (the local decisions on the basis of which the global decision is made). In this paper we resolve an open problem concerning whether optimal local decision functions for the Bayesian formulation of sequential decentralized detection can be found within the class of stationary rules. We develop an asymptotic approximation to the optimal cost of stationary quantization rules and show how this approximation yields a negative answer to the stationarity question. We also consider the class of blockwise stationary quantizers and show that asymptotically optimal quantizers are likelihood-based threshold rules.
Keyword note:Nguyen__XuanLong Wainwright__Martin Jordan__Michael_I
Report ID:708

Title:Representation of Radon Shape Diffusions via Hyperspherical Brownian Motion
Author(s):Panaretos, Victor M.; 
Date issued:April 2006 (PDF)
Abstract:A framework is introduced for the study of general Radon shape diffusions, that is, shape diffusions induced by projections of randomly rotating shapes [Panaretos, 2006]. This is done via a convenient representation of unoriented Radon shape diffusions in (unoriented) D.G. Kendall shape space $\widetilde(\Sigma)_n^k$ through a Brownian motion on the hypersphere. This representation leads to a coordinate system for the generalized version of Radon diffusions since it is shown that shape cna be essentially identified with unoriented shape in the projected case. A bijective correspondence between Brownian motion on real projective space and Radon shape diffusions is established. Furthermore, equations are derived for the general (unoriented) Radon diffusion of shape-and-size, and stationary measures are discussed. References: Panaretos, V.M. (June 2006). The diffusion of Radon shape. Adv. App. Prob. 38 (2), forthcoming.
Keyword note:Panaretos__Victor
Report ID:707

Title:Embracing Statistical Challenges in the Information Technology Age
Author(s):Yu, Bin; 
Date issued:March 2006 (PDF)
Abstract:Information Technology is creating an exciting time for statistics. In this article, we review the diverse sources of IT data in three clusters: IT core, IT systems, and IT fringe. The new data forms, huge data volumes, and high data speeds of IT are contrasted against the constraints on storage, transmission and computation to point to the challenges and opportunities. In particular, we describe the impacts of IT on a typical statistical investigation of data collection, data visualization, and model tting, with an emphasis on computation and feature selection. Moreover, two research projects on network tomography and arctic cloud detection are used throughout the paper to bring the discussions to a concrete level.
Keyword note:Yu__Bin
Report ID:706

Title:Non-equilibrium theory of the allele frequency spectrum
Author(s):Evans, Steven N.; Shvets, Yelena; Slatkin, Montgomery; 
Date issued:April 2006 (PDF) (PostScript)
Abstract:A forward diffusion equation describing the evolution of the allele frequency spectrum is presented. The influx of mutations is accounted for by imposing a suitable boundary condition. For a Wright-Fisher diffusion with or without selection and varying population size, the boundary condition is $\lim_(x \downarrow 0) x f(x,t)=\theta \rho(t)$, where $f(\cdot,t)$ is the frequency spectrum of derived alleles at independent loci at time $t$ and $\rho(t)$ is the relative population size at time $t$. When population size and selection intensity are independent of time, the forward equation is equivalent to the backwards diffusion usually used to derive the frequency spectrum, but the forward equation allows computation of the time dependence of the spectrum both before an equilibrium is attained and when population size and selection intensity vary with time. From the diffusion equation, we derive a set of ordinary differential equations for the moments of $f(\cdot,t)$ and express the expected spectrum of a finite sample in terms of those moments. We illustrate the use of the forward equation by considering neutral and selected alleles in a highly simplified model of human history. For example, we show that approximately 30\% of the expected heterozygosity of neutral loci is attributable to mutations that arose since the onset of population growth in roughly the last $150,000$ years.
Keyword note:Evans__Steven_N Shvets__Yelena Slatkin__Montgomery
Report ID:705

Title:Comparison of MISR aerosol optical thickness with AERONET measurements in Beijing Metropolitan Area
Author(s):Jiang, Xin; Liu, Yang; Yu, Bin; Jiang, Ming; 
Date issued:March 2006 (PDF)
Abstract:Aerosol optical thickness (AOT) retrieved by the Multi-angle Imaging SpectroRadiometer (MISR) from 2002 to 2004 were compared with AOT measurements from an Aerosol Robotic Network (AERONET) site located in Beijing urban area. MISR and AERONET AOTs were highly correlated, with an overall linear correlation coefficient of 0.93 at 558nm wavelength. On average, MISR AOT at 558 nm was 30% lower than the AERONET AOT at 558 nm interpolated from 440 nm and 675 nm. A linear regression analysis using AERONET AOT as the response yielded a slope of 0.58 and an intercept of 0.07 in the green band with similar results in the other three bands, indicating that MISR substantially underestimates AERONET AOT. After applying a narrower averaging time window to control for temporal variability, the agreement between MISR and AERONET AOTs were significantly improved with the correlation coefficient of 0.97 and a slope of 0.71 in an ordinary linear least squares fit. A weighted linear least squares, which reduces the impact of spatial averaging, yielded a better result with the slope going up to 0.73. The best agreement was achieved with the slope of 0.91 when only the central points are Abstract included in the regression analysis. By investigating PM10 spatial distribution of Beijing, we found substantial spatial variations of aerosol loading, which can introduce uncertainty when validating MISR AOT. Our findings also suggest that MISR aerosol retrieval algorithm might need to be adjusted for the extremely high aerosol loadings and substantial spatial variations that it will probably encounter in heavily polluted metropolitan areas.
Keyword note:Jiang__Xin Liu__Yang Yu__Bin Jiang__Ming
Report ID:704

Page: 1 2  Next