Statistics Technical Reports:Search | Browse by year

Term(s):2007
Results:21
Sorted by:
Page: 1 2  Next

Title:The spectrum of kernel random matrices
Author(s):El Karoui, Noureddine; 
Date issued:December 2007
http://nma.berkeley.edu/ark:/28722/bk0005d085d (PDF)
Abstract:We place ourselves in the setting of high-dimensional statistical inference, where the number of variables p in a dataset of interest is of the same order of magnitude as the number of observations n. We consider the spectrum of certain kernel random matrices, in particular n x n matrices whose (i,j)-th entry is f(X_i'X_j/p) or f(\norm{X_i-X_j}^2/p), where p is the dimension of the data, and X_i are independent data vectors. Here f is assumed to be a locally smooth function. The study is motivated by questions arising in statistics and computer science, where these matrices are used to perform, among other things, non-linear versions of principal component analysis. Surprisingly, we show that in high-dimensions, and for the models we analyze, the problem becomes essentially linear - which is at odds with heuristics sometimes used to justify the usage of these methods. The analysis also highlights certain peculiarities of models widely studied in random matrix theory and raises some questions about their relevance as tools to model high-dimensional data encountered in practice.
Keyword note:El__Karoui__Noureddine
Report ID:748
Relevance:100

Title:The Effectiveness of Internet Content Filters
Author(s):Stark, P.B.; 
Date issued:November 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0905 (PDF)
Abstract:As part of its defense of the Child Online Protection Act, which seeks to prevent minors from viewing commercially published harmful-to-minors material on the World Wide Web, the U.S. Department of Justice commissioned a study of the prevalence of adult materials and the effectiveness of Internet content filters in blocking them. As of 2005-6, about 1.1% of webpages indexed by Google and MSN were adult - hundreds of millions of pages. About 6% of a set of 1.3 billion searches executed on AOL, MSN and Yahoo! in summer 2005 retrieve at least one adult webpage among the first ten results, and about 1.7% of the first the results are adult webpages. These estimates are based on simple random samples of webpages indexed by search engines and a stratified random sample of searches. Webpages with sexually explicit content intended for adult entertainment (i.e., not in an educational, medical or artistic context) were used to test a variety of Internet content filters for underblocking failing to block webpages that they are intended to block. A random sample of clean webpages with no sexual content or reference to sex was used to test the filters for overblocking blocking webpages they are not intended to block. Webpages retrieved by the most popular searches according to Wordtracker were also categorized and used to test filters. Generally, filters with lower rates of underblocking had higher rates of overblocking. If the filter most effective at blocking adult materials were applied to search indexes, typical query results or the results of popular queries, the number of clean pages blocked in error would exceed the number of adult pages blocked correctly.
Keyword note:Stark__Philip_B
Report ID:746
Relevance:100

Title:Covariance regularization by thresholding
Author(s):Bickel, Peter J.; Levina, Elizaveta; 
Date issued:October 2007
http://nma.berkeley.edu/ark:/28722/bk0005d094c (PDF)
Abstract:This paper considers regularizing a covariance matrix of p variables estimated from n observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data.
Keyword note:Bickel__Peter_John Levina__Elizaveta
Report ID:744
Relevance:100

Title:Joint covariate selection for grouped classification
Author(s):Obozinski, Guillaume; Taskar, Ben; Jordan, Michael; 
Date issued:October 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0971 (PDF)
Abstract:We address the problem of recovering a common set of covariates that are relevant simultaneously to several classification problems. We propose a joint measure of complexity for the group of problems that couples covariate selection. By penalizing the sum of L2-norms of the blocks of coefficients associated with each covariate across different classification problems, we encourage similar sparsity patterns in all models. To fit parameters under this regularization, we propose a blockwise boosting scheme that follows the regularization path. As the regularization coefficient decreases, the algorithm maintains and updates concurrently a growing set of covariates that are simultaneously active for all problems. We show empirically that this approach outperforms independent L1-based covariate selection on several data sets, both in accuracy and number of selected covariates.
Keyword note:Obozinski__Guillaume Taskar__Ben Jordan__Michael_I
Report ID:743
Relevance:100

Title:Conservative Statistical Post-Election Audits
Author(s):Stark, Philip B.; 
Date issued:October 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0c4d (PDF)
Abstract:There are many sources of error in counting votes on election day: the apparent winner might not be the rightful winner. Hand tallies of the votes in a random sample of precincts can be used to test the hypothesis that the wrong candidate was named the winner. This paper develops a conservative sequential test based on the vote-counting errors found in a hand tally of a simple or stratified random sample of precincts.
Keyword note:Stark__Philip_B
Report ID:741
Relevance:100

Title:On spectral properties of large dimensional correlation matrices and covariance matrices computed from elliptically distributed data
Author(s):El Karoui, Noureddine; 
Date issued:October 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0d3c (PDF)
Abstract:We place ourselves in the setting of high-dimensional statistical inference, where the number of variables p in a dataset of interest is of the same order of magnitude as the number of observations n. More formally we study the asymptotic properties of correlation and covariance matrices under the setting that p/n tends to rho in (0,infinity), for general population covariance. We show that spectral properties for large dimensional correlation matrices are similar to those of large dimensional covariance matrices, for a large class of models studied in random matrix theory. We also derive a Marcenko-Pastur-type system of equations for the limiting spectral distribution of covariance matrices computed from elliptically distributed data. The motivation for this study comes from the relevance of such distributional assumptions to problems in econometrics and portfolio optimization. A mathematical theme of the paper is the important use we make of concentration inequalities.
Keyword note:El__Karoui__Noureddine
Report ID:740
Relevance:100

Title:Consistent Estimates of Deformed Isotropic Gaussian Random Fields on the Plane
Author(s):Anderes, Ethan; Chatterjee, Sourav; 
Date issued:October 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0f2b (PDF)
Abstract:This paper proves fixed domain asymptotic results for estimating a smooth invertible transformation $f\colon \Bbb R^2\rightarrow \Bbb R^2$ when observing the deformed random field $Z\circ f$ on a dense grid in a bounded simply connected domain $\Omega$ where $ Z$ is assumed to be an isotropic Gaussian random field on $\Bbb R^2$. The estimate, $\hat f$, is constructed on a simply connected domain $U$ such that $\overline U\subset\Omega$ and is defined using kernel smoothed quadratic variations, Bergman projections and results from quasiconformal theory. We show under mild assumptions on the random field $Z$ and the deformation $f$ that $\hat f\rightarrow R_\theta f+c$ uniformly on compact subsets of $U$ with probability one as the grid spacing goes to zero, where $R_\theta$ is an unidentifiable rotation and $c$ is an unidentifiable translation.
Keyword note:Anderes__Ethan Chatterjee__Sourav
Report ID:739
Relevance:100

Title:Convergence Analysis of Reweighted Sum-Product Algorithms
Author(s):Roosta, Tanya; Wainwright, Martin J.; Sastry, Shankar; 
Date issued:August 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0g7m (PDF)
Abstract:Markov random fields are designed to represent structured dependencies among large collections of random variables, and are well-suited to capture the structure of real-world signals. Many fundamental tasks in signal processing (e.g., smoothing, denoising, segmentation etc.) require efficient methods for computing (approximate) marginal probabilities over subsets of nodes in the graph. The marginalization problem, though solvable in linear time for graphs without cycles, is computationally intractable for general graphs with cycles. This intractability motivates the use of approximate "message-passing" algorithms. This paper studies the convergence and stability properties of the family of <i>reweighted sum-product algorithms</i>, a generalization of the widely-used sum-product or belief propagation algorithm, in which messages are adjusted with graph-dependent weights. For homogeneous models, we provide a complete characterization of the potential settings and message weightings that guarantee uniqueness of fixed points, and convergence of the updates. For more general inhomogeneous models, we derive a set of sufficient conditions that ensure convergence, and provide bounds on convergence rates. The experimental simulations on various classes of graphs validate our theoretical results.
Keyword note:Roosta__Tanya Wainwright__Martin Sastry__Shankar
Report ID:737
Relevance:100

Title:Spectra of random linear combinations of matrices defined via representations and Coxeter generators of the symmetric group
Author(s):Evans, Steven N.; 
Date issued:August 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0h1t (PDF)
Abstract:We consider the asymptotic behavior as $n \rightarrow \infty$ of the spectra of random matrices of the form \[\frac{1}{\sqrt{n-1}} \sum_{k=1}^{n-1} Z_{nk} \rho_n((k,k+1)),\] where for each $n$ the random variables $Z_{nk}$ are i.i.d. standard Gaussian and the matrices $\rho_n((k,k+1))$ are obtained by applying an irreducible unitary representation $\rho_n$ of the symmetric group on $\{1,2,\ldots,n\}$ to the transposition $(k,k+1)$ that interchanges $k$ and $k+1$ (thus $\rho_n((k,k+1))$ is both unitary and self-adjoint, with all eigenvalues either $+1$ or $-1$). Irreducible representations of the symmetric group on $\{1,2,\ldots,n\}$ are indexed by partitions $\lambda_n$ of $n$. A consequence of the results we establish is that if $\lambda_{n,1} \ge \lambda_{n,2} \ge \cdots \ge 0$ is the partition of $n$ corresponding to $\rho_n$, $\mu_{n,1} \ge \mu_{n,2} \ge \cdots \ge 0$ is the corresponding conjugate partition of $n$ (that is, the Young diagram of $\mu_n$ is the transpose of the Young diagram of $\lambda_n$), $\lim_{n \rightarrow \infty} \frac{\lambda_{n,i}}{n} = p_i$ for each $i \ge 1$, and $\lim_{n \rightarrow \infty} \frac{\mu_{n,j}}{n} = q_j$ for each $j \ge 1$, then the spectral measure of the resulting random matrix converges in distribution to a random probability measure that is Gaussian with mean $\theta Z$ and variance $1 - \theta^2$, where $\theta$ is the constant $\sum_i p_i^2 - \sum_j q_j^2$ and $Z$ is a standard Gaussian random variable.
Keyword note:Evans__Steven_N
Report ID:736
Relevance:100

Title:On the gene ranking of replicated microarray time course data
Author(s):Tai, Yu Chuan; Speed, Terence P.; 
Date issued:August 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0h51 (PDF)
Abstract:Consider the gene ranking problem of replicated microarray time course experiments where there are multiple biological conditions, and genes of interest are those whose temporal profiles are different across conditions. We derive the multi-sample multivariate empirical Bayes statistic for ranking genes in the order of differential expression, from both longitudinal and cross-sectional replicated developmental microarray time course data. Our longitudinal multi-sample model assumes that time course replicates are i.i.d. multivariate normal vectors. On the other hand, we construct our cross-sectional model using a normal regression framework with any appropriate basis for the design matrices. In both cases, we use natural conjugate priors in our empirical Bayes setting which guarantee closed form solutions for the posterior odds. Our simulations and two case studies using published worm and mouse microarray time course datasets indicate that the proposed approaches work well.
Keyword note:Tai__Yu_Chuan Speed__Terry_P
Report ID:735
Relevance:100

Title:Operator norm consistent estimation of large dimensional sparse covariance matrices
Author(s):El Karoui, Noureddine; 
Date issued:July 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0h97 (PDF)
Abstract:Estimating covariance matrices is a problem of fundamental importance in multivariate statistics. In practice it is increasingly frequent to work with data matrices $X$ of dimension $n\times p$, where $p$ and $n$ are both large. Results from random matrix theory show very clearly that in this setting, standard estimators like the sample covariance matrix perform in general very poorly. In this "large $n$, large $p$" setting, it is sometime the case that practitioners are willing to assume that many elements of the population covariance matrix are equal to 0, and hence this matrix is sparse. We develop an estimator to handle this situation. The estimator is shown to be consistent in operator norm, when $p/n\tendsto l\neq 0$, where $l$ is generally finite, as $p\tendsto \infty$. In other words the largest eigenvalue of the difference between the estimator and the population covariance matrix goes to zero. This implies consistency of all the eigenvalues and consistency of eigenspaces associated to isolated eigenvalues. We also propose a notion of sparsity for matrices that is "compatible" with spectral analysis and is independent of the ordering of the variables.
Keyword note:El__Karoui__Noureddine
Report ID:734
Relevance:100

Title:Event Weighted Tests for Detecting Periodicity in Photon Arrival Times
Author(s):Bickel, Peter; Kleijn, Bas; Rice, John; 
Date issued:June 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0j3f (PDF)
Abstract:This paper treats the problem of detecting periodicity in a sequence of photon arrival times, which occurs, for example, in attempting to detect gamma-ray pulsars. A particular focus is on how auxiliary information, typically source intensity, background intensity, and incidence angles and energies associated with each photon arrival should be used to maximize the detection power. We construct a class of likelihood-based tests, score tests, which give rise to event weighting in a principled and natural way, and derive expressions quantifying the power of the tests. These results can be used to compare the efficacies of different weight functions, including cuts in energy and incidence angle. The test is targeted toward a template for the periodic lightcurve, and we quantify how deviation from that template affects the power of detection.
Keyword note:Bickel__Peter_John Kleijn__Bas Rice__John_Andrew
Report ID:733
Relevance:100

Title:An experimental study comparing linguistic phylogenetic reconstruction methods
Author(s):Barbancon, Francois; Warnow, Tandy; Evans, Steven N.; Ringe, Donald; Nakhleh, Luay; 
Date issued:May 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0j9r (PDF)
Abstract:The estimation of linguistic evolution has intrigued many researchers for centuries, and in just the last few years, several new methods for constructing phylogenies from languages have been produced and used to analyze a number of language families. These analyses have led to a great deal of excitement, both within the field of historical linguistics and in related fields such as archaeology and human genetics. They have also been controversial, since the analyses have not always been consistent with each other, and the differences between different reconstructions have been potentially critical to the claims made by the different groups. In this paper, we report on a simulation study we performed in order to help resolve this controversy, which compares some of the main phylogeny reconstruction methods currently being used in linguistic cladistics. Our simulated datasets varied in the number of contact edges, the degree of homoplasy, the deviation from a lexical clock, and the deviation from the rates-across-sites assumption. We find the accuracy of the unweighted methods maximum parsimony, neighbor joining, lexico-statistics, and the method of Gray & Atkinson, to be remarkably consistent across all the model conditions we studied, with maximum parsimony being the best, followed (often closely) by Gray & Atkinson's method, then neighbor joining, and finally lexico-statistics (UPGMA). The accuracy of the two weighted methods (weighted maximum parsimony and weighted maximum compatibility) depends upon the appropriateness of the weighting scheme, and so depends upon the homoplasy levels produced by the model conditions; for low-homoplasy levels, however, the weighted methods generally produce the most accurate results of all methods, while the use of inappropriate weighting schemes can make for poorer results than maximum parsimony and Gray & Atkinson's method under moderate to high homoplasy levels.
Keyword note:Barbancon__Francois Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay
Report ID:732
Relevance:100

Title:Sparse, noisy Boolean functions
Author(s):Mukherjee, S.; Speed, T. P.; 
Date issued:April 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0k1v (PDF)
Abstract:This paper addresses the question of making inferences regarding Boolean functions under conditions of (i) noise, or stochastic variation in observed data, and (ii) sparsity, by which we mean that the number of inputs or predictors far exceeds the arity of the underlying Boolean function. We put forward a simple probability model for such observations, and discuss model selection, parameter estimation and prediction. We present results on synthetic data and on a proteomic dataset from a study in cancer systems biology.
Keyword note:Mukherjee__Sach Speed__Terry_P
Report ID:731
Relevance:100

Title:Low density graph codes that are optimal for binning and coding with side information
Author(s):Wainwright, Martin J.; Martinian, Emin; 
Date issued:April 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0k52 (PDF)
Abstract:We describe and analyze the joint source/channel coding properties of a class of sparse graphical codes based on compounding a low-density generator matrix (LDGM) code with a low-density parity check (LDPC) code. Our first pair of theorems establish that there exist codes from this ensemble, with all degrees remaining bounded independently of block length, that are simultaneously optimal as both source and channel codes when encoding and decoding are performed optimally. More precisely, in the context of lossy compression, we prove that finite degree constructions can achieve any pair $(\myrate, \distor)$ on the rate-distortion curve of the binary symmetric source. In the context of channel coding, we prove that finite degree codes can achieve any pair $(C, \channoise)$ on the capacity-noise curve of the binary symmetric channel. Next, we show that our compound construction has a nested structure that can be exploited to achieve the Wyner-Ziv bound for source coding with side information (SCSI), as well as the Gelfand-Pinsker bound for channel coding with side information (CCSI). Although the current results are based on optimal encoding and decoding, the proposed graphical codes have sparse structure and high girth that renders them well-suited to message-passing and other efficient decoding procedures.
Keyword note:Wainwright__Martin Martinian__Emin
Report ID:730
Relevance:100

Title:Markov chain Monte Carlo for Structural Inference with Prior Information
Author(s):Mukherjee, Sach; Speed, Terence P.; 
Date issued:April 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0m0t (PDF)
Abstract:This paper addresses the question of making inferences regarding features of conditional independence graphs in settings characterized by the availability of rich prior information regarding such features. We focus on Bayesian networks, and use Markov chain Monte Carlo to draw samples from the relevant posterior over graphs. We introduce a class of "locally-informative priors" which are highly flexible and capable of taking account of specific information regarding graph features, and are, in addition, informative at a scale appropriate to local sampling moves. We present examples of such priors for beliefs regarding edges, groups and classes of edges, degree distributions and sparsity, applying our methods to challenging synthetic data as well as data obtained from a biological network in cancer.
Keyword note:Mukherjee__Sach Speed__Terry_P
Report ID:729
Relevance:100

Title:Fast Rates for Estimation Error and Oracle Inequalities for Model Selection
Author(s):Bartlett, Peter L.; 
Date issued:March 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0m41 (PDF)
Abstract:We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.
Keyword note:Bartlett__Peter
Report ID:728
Relevance:100

Title:Coverage Adjusted Entropy Estimation
Author(s):Vu, Vincent Q.; Yu, Bin; Kass, Robert E.; 
Date issued:March 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0m87 (PDF)
Abstract:Data on "neural coding" have frequently been analyzed using information-theoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. In this paper we review briefly several methods that have been advanced to estimate entropy, and highlight a method due to Chao and Shen that appeared recently in the environmental statistics literature. This method begins with the elementary Horvitz-Thompson estimator, developed for sampling from a finite population and adjusts for the potential new species that have not yet been observed in the sample---these become the new patterns or "words" in a spike train that have not yet been observed. The adjustment is due to I.J. Good, and is called the Good-Turing coverage estimate. We provide a new empirical regularization derivation of the coverage-adjusted probability estimator, which shrinks the MLE (the naive or "direct" plug-in estimate) toward zero. We prove that the coverage adjusted estimator, due to Chao and Shen, is consistent and first-order optimal, with rate $O_P(1/\log n)$, in the class of distributions with finite entropy variance and that within the class of distributions with finite $q$th moment of the log-likelihood, the Good-Turing coverage estimate and the total probability of unobserved words converge at rate $O_P(1/(\log n)^q)$. We then provide a simulation study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The results show that, with a minor modification, the coverage adjusted estimator performs much better than the MLE and is better than the Best Upper Bound estimator, due to Paninski, when the number of possible words $m$ is unknown or infinite.
Keyword note:Vu__Vincent_Q Yu__Bin Kass_Robert_E
Report ID:727
Relevance:100

Title:The Impact of an Inaccurate Diagnostic Biomarker on Phase II Clinical Trials in The Development of Targeted Therapy.
Author(s):Wang, Nancy N.; Rabbee, Nusrat; 
Date issued:March 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0n30 (PDF)
Abstract:Current research in oncology aims at developing targeted therapies to treat the heterogeneous patient population. Successful development of a targeted therapy requires a biomarker that identifies patients who are most likely to benefit from the treatment. However, most biomarkers are inherently inaccurate. We present a simulation study to examine how the sensitivity and specificity of a single, binary biomarker influences the Cox estimates of hazard ratios in phase II clinical trials. We discuss how the bias introduced by marker inaccuracy impacts the decision of whether to carry a drug forward to a phase III clinical trial. Finally, we propose a bootstrap-based method for reducing the bias of the Cox estimator, in the presence of an inaccurate marker.
Keyword note:Wang__Nancy_N Rabbee__Nusrat
Report ID:726
Relevance:100

Title:Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting
Author(s):Wainwright, Martin J.; 
Date issued:January 2007
http://nma.berkeley.edu/ark:/28722/bk0005d0n76 (PDF)
Abstract:The problem of recovering the sparsity pattern of a fixed but unknown vector $\beta^* \in \real^p$ based on a set of $n$ noisy observations arises in a variety of settings, including subset selection in regression, graphical model selection, signal denoising, compressive sensing, and constructive approximation. Of interest are conditions on the model dimension $p$, the sparsity index $s$ (number of non-zero entries in $\beta^*$), and the number of observations $n that are necessary and/or sufficient to ensure asymptotically perfect recovery of the sparsity pattern. This
Keyword note:Wainwright__Martin
Report ID:725
Relevance:100

Page: 1 2  Next