Statistics Technical Reports:Search | Browse by year

Term(s):2000
Results:17
Sorted by:

Title:Deconvolution of sparse positive spikes: is it ill-posed?
Author(s):Li, Lei; Speed, Terry; 
Date issued:Oct 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3w8w (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3w9f (PostScript)
Abstract:Deconvolution is usually regarded as one of the so called ill-posed problems of applied mathematics if no constraints on the unknowns can be assumed. In this paper, we discuss the idea of well-defined statistical models being a counterpart of the notion of well-posedness. We show that constraints on the unknowns such as non-negativity and sparsity can help a great deal to get over the inherent ill-posedness in deconvolution. This is illustrated by a parametric deconvolution method based on the spike-convolution model. Not only does this knowledge, together with the choice of the measure of goodness of fit, help people think about data (models), it also determines the way people compute with data (algorithms). This view is illustrated by taking a fresh look at two familiar deconvolvers: the widely-used Jansson method, and another one which is to minimize the Kullback-Leibler distance between observations and fitted values. In the latter case, we point out that a counterpart of the EM algorithm exists for the problem of minimizing the Kullback-Leibler distance in the context of deconvolution. We compare the performance of these deconvolvers using data simulated from a spike-convolution model and DNA sequencing data.
Keyword note:Li__Lei Speed__Terry_P
Report ID:586
Relevance:100

Title:Parametric deconvolution of positive spike trains
Author(s):Li, Lei; Speed, Terry; 
Date issued:May 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3w57 (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3w6s (PostScript)
Abstract:This paper describes a parametric deconvolution method (PDPS) appropriate for a particular class of signals which we call spike-convolution models. These models arise when a sparse spike train---Dirac deltas according to our mathematical treatment---is convolved with a fixed point-spread function, and additive noise or measurement error is superimposed. We view deconvolution as an estimation problem, regarding the locations and heights of the underlying spikes, as well as the baseline and the measurement error variance as unknown parameters. Our estimation scheme consists of two parts: model fitting and model selection. To fit a spike-convolution model of a specific order, we estimate peak locations by trigonometric moments, and heights and the baseline by least squares. The model selection procedure has two stages. Its first stage is so designed that we expect a model of a somewhat larger order than the truth to be selected. In the second stage, the final model is obtained using backwards deletion. This results in not only an estimate of the model order, but also an estimate of peak locations and heights with much smaller bias and variation than that found in a direct trigonometric moment estimate. A more efficient maximum likelihood estimate can be calculated from these estimates using a Gauss-Newton algorithm. We also present some relevant results concerning the spectral structure of Toeplitz matrices which play a key role in the estimation. Finally, we illustrate the behavior of these estimates using simulated and real DNA sequencing data.
Keyword note:Li__Lei Speed__Terry_P
Report ID:585
Relevance:100

Title:Comparison of methods for image analysis on c{DNA} microarray data
Author(s):Yang, Yee Hwa; Buckley, Michael; Dudoit, Sandrine; Speed, Terry; 
Date issued:Nov 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3v9x (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3w0g (PostScript)
Abstract:Microarrays are part of a new class of biotechnologies which allow the monitoring of expression levels for thousands of genes simultaneously. Image analysis is an important aspect of microarray experiments, one which can have a potentially large impact on subsequent analyses such as clustering or the identification of differentially expressed genes. This paper reviews a number of existing image analysis methods used on cDNA microarray data and compares their effects on the measured log ratios of fluorescence intensities. In particular, this study examines the statistical properties of different segmentation and background adjustment methods. The different image analysis methods are applied to microarray data from a study of lipid metabolism in mice. We show that in some cases background adjustment can substantially reduce the precision - that is, increase the variability - of low-intensity spot values. In contrast, the choice of segmentation procedure has a smaller impact. In addition, this paper proposes new addressing, segmentation and background correction methods for extracting information from microarray images. The segmentation component uses a seeded region growing algorithm which makes provision for spots of different shapes and sizes. The background estimation approach uses an image analysis technique known as morphological opening. All these methods are implemented in a software package named Spot.
Keyword note:Yang__Yee_Hwa Buckley__Michael Dudoit__Sandrine Speed__Terry_P
Report ID:584
Relevance:100

Title:Applications of the continuous-time ballot theorem to Brownian motion and related processes
Author(s):Schweinsberg, Jason; 
Date issued:Nov 2000
Date modified:revised January 2001
http://nma.berkeley.edu/ark:/28722/bk0000n275x (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n276g (PostScript)
Abstract:Motivated by questions related to a fragmentation process which has been studied by Aldous, Pitman, and Bertoin, we use the continuous-time ballot theorem to establish some results regarding the lengths of the excursions of Brownian motion and related processes. We show that the distribution of the lengths of the excursions below the maximum for Brownian motion conditioned to first hit $\lambda > 0$ at time $t$ is not affected by conditioning the Brownian motion to stay below a line segment from $(0,c)$ to $(t, \lambda)$. We extend a result of Bertoin by showing that the length of the first excursion below the maximum for a negative Brownian excursion plus drift is a size-biased pick from all of the excursion lengths, and we describe the law of a negative Brownian excursion plus drift after this first excursion. We then use the same methods to prove similar results for the excursions of more general Markov processes.
Keyword note:Schweinsberg__Jason
Report ID:583
Relevance:100

Title:The swine flu vaccine and Guillain-Barre syndrome: a case study in relative risk and specific causation
Author(s):Freedman, David; Stark, Philip; 
Date issued:November 2000
Keyword note:Freedman__David Stark__Philip_B
Report ID:582
Relevance:100

Title:Infinitely divisible laws associated with hyperbolic functions
Author(s):Pitman, Jim; Yor, Marc; 
Date issued:Oct 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3v45 (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3v5q (PostScript)
Abstract:The infinitely divisible distributions of positive random variables C_t, S_t and T_t with Laplace transforms in x ( 1 / \cosh \sqrt( 2 x ) )^t, (\sqrt(2 x / \sinh \sqrt( 2 x ) )^t, and ( (\tanh \sqrt( 2 x) ) / \sqrt( 2 x ) )^t respectively are characterized for various t >0 in a number of different ways: by simple relations between their moments and cumulants, by corresponding relations between the distributions and their Levy measures, by recursions for their Mellin transforms, and by differential equations satisfied by their Laplace transforms. Some of these results are interpreted probabilistically via known appearances of these distributions for t =1 or 2 in the description of the laws of various functionals of Brownian motion and Bessel processes, such as the heights and lengths of excursions of a one-dimensional Brownian motion. The distributions of C_1 and S_2 are also known to appear in the Mellin representations of two important functions in analytic number theory, the Riemann zeta function and the Dirichlet L-function associated with the quadratic character modulo 4. Related families of infinitely divisible laws, including the gamma, logistic and generalized hyperbolic secant distributions, are derived from S_t and C_t by operations such as Brownian subordination, exponential tilting, and weak limits, and characterized in various ways.
Keyword note:Pitman__Jim Yor__Marc
Report ID:581
Relevance:100

Title:Local field U-statistics
Author(s):Evans, Steven N.; 
Date issued:Sep 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3v1h (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3v22 (PostScript)
Abstract:Using the classical theory of symmetric functions, a general distributional limit theorem is established for $U$--statistics constructed from a sequence of independent, identically distributed random variables taking values in a local field with zero characteristic.
Keyword note:Evans__Steven_N
Report ID:580
Relevance:100

Title:Some Infinity Theory for Predictor Ensembles
Author(s):Breiman, Leo; 
Date issued:Aug 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3t8v (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3t9d (PostScript)
Abstract:To dispel some of the mystery about what makes tree ensembles work, they are looked at in distribution space i.e. the limit case of "infinite" sample size. It is shown that the simplest kind of trees are complete in D-dimensional space if the number of terminal nodes T is greater than D. For such trees we show that the Adaboost minimization algorithm gives an ensemble converging to the Bayes risk. Random forests which are grown using i.i.d random vectors in the tree construction are shown to be equivalent to a kernel acting on the true margin. The form of this kernel is derived for purely random tree growing and its properties explored. The notions of correlation and strength for random forests is reflected in the symmetry and skewness of the kernel
Keyword note:Breiman__Leo
Report ID:579
Relevance:100

Title:Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments
Author(s):Dudoit, Sandrine; Yang, Yee Hwa; Callow, Matthew J.; Speed, Terence P.; 
Date issued:Aug 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3t56 (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3t6r (PostScript)
Abstract:Microarrays are part of a new class of biotechnologies which allow the monitoring of expression levels for thousands of genes simultaneously. This paper describes statistical methods for the identification of differentially expressed genes in replicated cDNA microarray experiments. Although it is not the main focus of the paper, we stress the importance of issues such as image processing and normalization. Image processing is required to extract measures of transcript abundance for each gene spotted on the array from the laser scan images. Normalization is needed to identify and remove systematic sources of variation, such as differing dye labeling efficiencies and scanning properties. There can be many systematic sources of variation and their effects can be large relative to the effects of interest. After a brief presentation of our image processing method, we describe a within-slide normalization approach which handles spatial and intensity dependent effects on the measured expression levels. Given suitably normalized data, our proposed method for the identification of single differentially expressed genes is to consider a univariate testing problem for each gene and then correct for multiple testing using adjusted p-values. No specific parametric form is assumed for the distribution of the expression levels and a permutation procedure is used to estimate the joint null distribution of the test statistics for each gene. Several data displays are suggested for the visual identification of genes with altered expression and of important features of these genes. The above methods are applied to microarray data from a study of gene expression in two mouse models with very low HDL cholesterol levels. The genes identified using data from replicated slides are compared to those obtained by applying recently published single-slide methods.
Keyword note:Dudoit__Sandrine Yang__Yee_Hwa Callow__Matthew_J Speed__Terry_P
Report ID:578
Relevance:100

Title:Linear functionals of eigenvalues of random matrices
Author(s):Diaconis, Persi; Evans, Steven N.; 
Date issued:Jun 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3p9b (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3q0w (PostScript)
Abstract:Let $M_n$ be a random $n \times n$ unitary matrix with distribution given by Haar measure on the unitary group. Using explicit moment calculations, a general criterion is given for linear combinations of traces of powers of $M_n$ to converge to a Gaussian limit as $n \rightarrow \infty$. By Fourier analysis, this result leads to central limit theorems for the measure on the circle that places a unit mass at each of the eigenvalues of $M_n$. For example, the integral of this measure against a function with suitably decaying Fourier coefficients converges to a Gaussian limit without any normalisation. Known central limit theorems for the number of eigenvalues in a circular arc and the logarithm of the characteristic polynomial of $M_n$ are also derived from the criterion.
Keyword note:Diaconis__Persi Evans__Steven_N
Report ID:577
Relevance:100

Title:Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
Author(s):Dudoit, Sandrine; Fridlyand, Jane; Speed, Terence P.; 
Date issued:Jun 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3n7q (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3n88 (PostScript)
Abstract:A reliable and precise classification of tumors is essential for successful treatment of cancer. cDNA microarrays and high-density oligonucleotide chi ps are novel biotechnologies which are being used increasingly in cancer research. By allowing the monitoring of expression levels for thousands of genes simultaneously, such techniques may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish between tumor classes (already known or yet to be discovered) using gene expression data is an important aspect of this novel approach to cancer classification. In this paper, we compare the performance of different discrimination methods for the classification of tumors based on gene expression data. These methods include: nearest neighbor classifiers, linear discriminant analysis, and classification trees. In our comparison, we also consider recent machine learning approaches such as bagging and boosting. We investigate the use of prediction votes to assess the confidence of each prediction. The methods are applied to datasets from three recently published cancer gene expression studies.
Keyword note:Dudoit__Sandrine Fridlyand__Jane Speed__Terry_P
Report ID:576
Relevance:100

Title:Generalization bounds for incremental search classification
Author(s):Gat, Yoram; 
Date issued:Mar 2000
http://nma.berkeley.edu/ark:/28722/bk0000n2h8q (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n2h98 (PostScript)
Abstract:algorithms This paper presents generalization bounds for a certain class of classification algorithms. The bounds presented take advantage of the local nature of the search that these algorithms use in order to obtain bounds that are better than those that can be obtained using VC type bounds. The results are applied to well-known classification algorithms such as classification trees and the perceptron.
Keyword note:Gat__Yoram
Report ID:575
Relevance:100

Title:Immanants and finite point processes
Author(s):Diaconis, Persi; Evans, Steven N.; 
Date issued:Mar 2000
http://nma.berkeley.edu/ark:/28722/bk0000n2906 (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n291r (PostScript)
Abstract:Given a Hermitian, non-negative definite kernel $K$ and a character $\chi$ of the symmetric group on $n$ letters, define the corresponding immanant function $K^\chi[x_1, \ldots, x_n]
Keyword note:Diaconis__Persi Evans__Steven_N
Report ID:574
Relevance:100

Title:Salt and Blood Pressure: Conventional Wisdom Reconsidered
Author(s):Freedman, D. A.; Petitti, D. B.; 
Date issued:Apr 2000
http://nma.berkeley.edu/ark:/28722/bk0000n3w2k (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3w34 (PostScript)
Abstract:The "salt hypothesis" is that higher levels of salt in the diet lead to higher levels of blood pressure, with attendant risk of cardiovascular disease. Intersalt was designed to test the hypothesis, with a cross-sectional study of salt levels and blood pressures in 52 populations. The study is often cited to support the salt hypothesis, but the data are somewhat contradictory. Thus, four of the populations (Kenya, Papua, and two Indian tribes in Brazil) have very low levels of salt and blood pressure. Across the other 48 populations, however, blood pressures go down as salt levels go up-- contradicting the salt hypothesis. Regressions of blood pressure on age indicate that for young people, blood pressure is inversely related to salt intake-- another paradox. This paper discusses the Intersalt data and study design, looking at some of the statistical issues and identifying respects in which the study failed to follow its own protocol. Also considered are human experiments bearing on the salt hypothesis. The effect of salt reduction is stronger for hypertensive subjects than normotensives. Even the effect of a large reduction in salt intake on blood pressure is modest, and publication bias is a concern. To determine the health effects of salt reduction, a long-term intervention study would be needed, with endpoints defined in terms of morbidity and mortality; dietary interventions seem more promising. Funding agencies and medical journals have taken a stronger position favoring the salt hypothesis than is warranted by the evidence, raising questions about the interaction between the policy process and science.
Keyword note:Freedman__David Petitti__D_B
Report ID:573
Relevance:100

Title:An $O(n^2)$ bound for the relaxation time of a Markov chain on cladograms.
Author(s):Schweinsberg, Jason; 
Date issued:Mar 2000
Date modified:revised June 2001
Abstract:A cladogram is an unrooted tree with labeled leaves and unlabeled internal branchpoints of degree $3$. Aldous has studied a Markov chain on the set of $n$-leaf cladograms in which each transition consists of removing a random leaf and its incident edge from the tree and then reattaching the leaf to a random edge of the remaining tree. Using coupling methods, Aldous has shown that the relaxation time (i.e. the inverse of the spectral gap) for this chain is $O(n^3)$. Here, we use a method based on distinguished paths to prove an $O(n^2)$ bound for the relaxation time, establishing a conjecture of Aldous.
Pub info:RSA Vol 20 (2002)
Keyword note:Schweinsberg__Jason
Report ID:572
Relevance:100

Title:Coalescents with simultaneous multiple collisions
Author(s):Schweinsberg, Jason; 
Date issued:Jan 2000
Abstract:We study a family of coalescent processes that undergo ``simultaneous multiple collisions,'' meaning that many clusters of particles can merge into a single cluster at one time, and many such mergers can occur simultaneously. This family of processes, which we obtain from simple assumptions about the rates of different types of mergers, essentially coincides with a family of processes that M\''(o)hle and Sagitov obtain as a limit of scaled ancestral processes in a population model with exchangeable family sizes. We characterize the possible merger rates in terms of a single measure, show how these coalescents can be constructed from a Poisson process, and discuss some basic properties of these processes. This work generalizes some work of Pitman, who provides similar analysis for a family of coalescent processes in which many clusters can coalesce into a single cluster, but almost surely no two such mergers occur simultaneously.
Pub info:EJP Vol 5 (2000) Paper 12
Keyword note:Schweinsberg__Jason
Report ID:571
Relevance:100

Title:A probability model for census adjustment
Author(s):Freedman, D. A.; Stark, P. B.; Wachter, K. W.; 
Date issued:Mar 2000
http://nma.berkeley.edu/ark:/28722/bk0000n388k (PDF)
http://nma.berkeley.edu/ark:/28722/bk0000n3894 (PostScript)
Abstract:The census can be adjusted using capture-recapture techniques: capture in the census, recapture in a special Post Enumeration Survey (PES) done after the census. The population is estimated using the Dual System Estimator (DSE). Estimates are made separately for demographic groups called post strata; adjustment factors are then applied to these demographic groups within small geographic areas. We offer a probability model for this process, in which several sources of error can be distinguished. In this model, correlation bias arises from behavioral differences between persons counted in the census and persons missed by the census. The first group may on the whole be more likely to respond to the PES: if so, the DSE will be systematically too low, and that is an example of correlation bias. Correlation bias is distinguished from heterogeneity, which occurs if the census has a higher capture rate in some geographic areas than others. Finally, ratio estimator bias and variance are considered. The objective is to clarify the probabilistic foundations of the DSE, and the definitions of certain terms widely used in discussing that estimator.
Keyword note:Freedman__David Stark__Philip_B Wachter__Kenneth
Report ID:557
Relevance:100