**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2000**Results:**17**Sorted by:**

**Title:**Deconvolution of sparse positive spikes: is it ill-posed?**Author(s):**Li, Lei; Speed, Terry; **Date issued:**Oct 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3w8w (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3w9f (PostScript) **Abstract:**Deconvolution is usually regarded as one of the so called ill-posed problems of applied mathematics if no constraints on the
unknowns can be assumed. In this paper, we discuss the idea of well-defined statistical models being a counterpart of the
notion of well-posedness. We show that constraints on the unknowns such as non-negativity and sparsity can help a great deal
to get over the inherent ill-posedness in deconvolution. This is illustrated by a parametric deconvolution method based on
the spike-convolution model. Not only does this knowledge, together with the choice of the measure of goodness of fit, help
people think about data (models), it also determines the way people compute with data (algorithms). This view is illustrated
by taking a fresh look at two familiar deconvolvers: the widely-used Jansson method, and another one which is to minimize
the Kullback-Leibler distance between observations and fitted values. In the latter case, we point out that a counterpart
of the EM algorithm exists for the problem of minimizing the Kullback-Leibler distance in the context of deconvolution. We
compare the performance of these deconvolvers using data simulated from a spike-convolution model and DNA sequencing data.**Keyword note:**Li__Lei Speed__Terry_P**Report ID:**586**Relevance:**100

**Title:**Parametric deconvolution of positive spike trains**Author(s):**Li, Lei; Speed, Terry; **Date issued:**May 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3w57 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3w6s (PostScript) **Abstract:**This paper describes a parametric deconvolution method (PDPS) appropriate for a particular class of signals which we call
spike-convolution models. These models arise when a sparse spike train---Dirac deltas according to our mathematical treatment---is
convolved with a fixed point-spread function, and additive noise or measurement error is superimposed. We view deconvolution
as an estimation problem, regarding the locations and heights of the underlying spikes, as well as the baseline and the measurement
error variance as unknown parameters. Our estimation scheme consists of two parts: model fitting and model selection. To
fit a spike-convolution model of a specific order, we estimate peak locations by trigonometric moments, and heights and the
baseline by least squares. The model selection procedure has two stages. Its first stage is so designed that we expect a model
of a somewhat larger order than the truth to be selected. In the second stage, the final model is obtained using backwards
deletion. This results in not only an estimate of the model order, but also an estimate of peak locations and heights with
much smaller bias and variation than that found in a direct trigonometric moment estimate. A more efficient maximum likelihood
estimate can be calculated from these estimates using a Gauss-Newton algorithm. We also present some relevant results concerning
the spectral structure of Toeplitz matrices which play a key role in the estimation. Finally, we illustrate the behavior of
these estimates using simulated and real DNA sequencing data.**Keyword note:**Li__Lei Speed__Terry_P**Report ID:**585**Relevance:**100

**Title:**Comparison of methods for image analysis on c{DNA} microarray data**Author(s):**Yang, Yee Hwa; Buckley, Michael; Dudoit, Sandrine; Speed, Terry; **Date issued:**Nov 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3v9x (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3w0g (PostScript) **Abstract:**Microarrays are part of a new class of biotechnologies which allow the monitoring of expression levels for thousands of genes
simultaneously. Image analysis is an important aspect of microarray experiments, one which can have a potentially large impact
on subsequent analyses such as clustering or the identification of differentially expressed genes. This paper reviews a number
of existing image analysis methods used on cDNA microarray data and compares their effects on the measured log ratios of fluorescence
intensities. In particular, this study examines the statistical properties of different segmentation and background adjustment
methods. The different image analysis methods are applied to microarray data from a study of lipid metabolism in mice. We
show that in some cases background adjustment can substantially reduce the precision - that is, increase the variability
- of low-intensity spot values. In contrast, the choice of segmentation procedure has a smaller impact. In addition, this
paper proposes new addressing, segmentation and background correction methods for extracting information from microarray images.
The segmentation component uses a seeded region growing algorithm which makes provision for spots of different shapes and
sizes. The background estimation approach uses an image analysis technique known as morphological opening. All these methods
are implemented in a software package named Spot.**Keyword note:**Yang__Yee_Hwa Buckley__Michael Dudoit__Sandrine Speed__Terry_P**Report ID:**584**Relevance:**100

**Title:**Applications of the continuous-time ballot theorem to Brownian motion and related processes**Author(s):**Schweinsberg, Jason; **Date issued:**Nov 2000**Date modified:**revised January 2001

http://nma.berkeley.edu/ark:/28722/bk0000n275x (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n276g (PostScript) **Abstract:**Motivated by questions related to a fragmentation process which has been studied by Aldous, Pitman, and Bertoin, we use the
continuous-time ballot theorem to establish some results regarding the lengths of the excursions of Brownian motion and related
processes. We show that the distribution of the lengths of the excursions below the maximum for Brownian motion conditioned
to first hit $\lambda > 0$ at time $t$ is not affected by conditioning the Brownian motion to stay below a line segment from
$(0,c)$ to $(t, \lambda)$. We extend a result of Bertoin by showing that the length of the first excursion below the maximum
for a negative Brownian excursion plus drift is a size-biased pick from all of the excursion lengths, and we describe the
law of a negative Brownian excursion plus drift after this first excursion. We then use the same methods to prove similar
results for the excursions of more general Markov processes.**Keyword note:**Schweinsberg__Jason**Report ID:**583**Relevance:**100

**Title:**The swine flu vaccine and Guillain-Barre syndrome: a case study in relative risk and specific causation**Author(s):**Freedman, David; Stark, Philip; **Date issued:**November 2000**Keyword note:**Freedman__David Stark__Philip_B**Report ID:**582**Relevance:**100

**Title:**Infinitely divisible laws associated with hyperbolic functions**Author(s):**Pitman, Jim; Yor, Marc; **Date issued:**Oct 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3v45 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3v5q (PostScript) **Abstract:**The infinitely divisible distributions of positive random variables C_t, S_t and T_t with Laplace transforms in x ( 1 / \cosh
\sqrt( 2 x ) )^t, (\sqrt(2 x / \sinh \sqrt( 2 x ) )^t, and ( (\tanh \sqrt( 2 x) ) / \sqrt( 2 x ) )^t respectively are
characterized for various t >0 in a number of different ways: by simple relations between their moments and cumulants, by
corresponding relations between the distributions and their Levy measures, by recursions for their Mellin transforms, and
by differential equations satisfied by their Laplace transforms. Some of these results are interpreted probabilistically via
known appearances of these distributions for t =1 or 2 in the description of the laws of various functionals of Brownian motion
and Bessel processes, such as the heights and lengths of excursions of a one-dimensional Brownian motion. The distributions
of C_1 and S_2 are also known to appear in the Mellin representations of two important functions in analytic number theory,
the Riemann zeta function and the Dirichlet L-function associated with the quadratic character modulo 4. Related families
of infinitely divisible laws, including the gamma, logistic and generalized hyperbolic secant distributions, are derived
from S_t and C_t by operations such as Brownian subordination, exponential tilting, and weak limits, and characterized in
various ways.**Keyword note:**Pitman__Jim Yor__Marc**Report ID:**581**Relevance:**100

**Title:**Local field U-statistics**Author(s):**Evans, Steven N.; **Date issued:**Sep 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3v1h (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3v22 (PostScript) **Abstract:**Using the classical theory of symmetric functions, a general distributional limit theorem is established for $U$--statistics
constructed from a sequence of independent, identically distributed random variables taking values in a local field with zero
characteristic.**Keyword note:**Evans__Steven_N**Report ID:**580**Relevance:**100

**Title:**Some Infinity Theory for Predictor Ensembles**Author(s):**Breiman, Leo; **Date issued:**Aug 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3t8v (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3t9d (PostScript) **Abstract:**To dispel some of the mystery about what makes tree ensembles work, they are looked at in distribution space i.e. the limit
case of "infinite" sample size. It is shown that the simplest kind of trees are complete in D-dimensional space if the number
of terminal nodes T is greater than D. For such trees we show that the Adaboost minimization algorithm gives an ensemble converging
to the Bayes risk. Random forests which are grown using i.i.d random vectors in the tree construction are shown to be equivalent
to a kernel acting on the true margin. The form of this kernel is derived for purely random tree growing and its properties
explored. The notions of correlation and strength for random forests is reflected in the symmetry and skewness of the kernel**Keyword note:**Breiman__Leo**Report ID:**579**Relevance:**100

**Title:**Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments**Author(s):**Dudoit, Sandrine; Yang, Yee Hwa; Callow, Matthew J.; Speed, Terence P.; **Date issued:**Aug 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3t56 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3t6r (PostScript) **Abstract:**Microarrays are part of a new class of biotechnologies which allow the monitoring of expression levels for thousands of genes
simultaneously. This paper describes statistical methods for the identification of differentially expressed genes in replicated
cDNA microarray experiments. Although it is not the main focus of the paper, we stress the importance of issues such as image
processing and normalization. Image processing is required to extract measures of transcript abundance for each gene spotted
on the array from the laser scan images. Normalization is needed to identify and remove systematic sources of variation, such
as differing dye labeling efficiencies and scanning properties. There can be many systematic sources of variation and their
effects can be large relative to the effects of interest. After a brief presentation of our image processing method, we describe
a within-slide normalization approach which handles spatial and intensity dependent effects on the measured expression levels.
Given suitably normalized data, our proposed method for the identification of single differentially expressed genes is to
consider a univariate testing problem for each gene and then correct for multiple testing using adjusted p-values. No specific
parametric form is assumed for the distribution of the expression levels and a permutation procedure is used to estimate the
joint null distribution of the test statistics for each gene. Several data displays are suggested for the visual identification
of genes with altered expression and of important features of these genes. The above methods are applied to microarray data
from a study of gene expression in two mouse models with very low HDL cholesterol levels. The genes identified using data
from replicated slides are compared to those obtained by applying recently published single-slide methods.**Keyword note:**Dudoit__Sandrine Yang__Yee_Hwa Callow__Matthew_J Speed__Terry_P**Report ID:**578**Relevance:**100

**Title:**Linear functionals of eigenvalues of random matrices**Author(s):**Diaconis, Persi; Evans, Steven N.; **Date issued:**Jun 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3p9b (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3q0w (PostScript) **Abstract:**Let $M_n$ be a random $n \times n$ unitary matrix with distribution given by Haar measure on the unitary group. Using explicit
moment calculations, a general criterion is given for linear combinations of traces of powers of $M_n$ to converge to a Gaussian
limit as $n \rightarrow \infty$. By Fourier analysis, this result leads to central limit theorems for the measure on the
circle that places a unit mass at each of the eigenvalues of $M_n$. For example, the integral of this measure against a function
with suitably decaying Fourier coefficients converges to a Gaussian limit without any normalisation. Known central limit
theorems for the number of eigenvalues in a circular arc and the logarithm of the characteristic polynomial of $M_n$ are also
derived from the criterion.**Keyword note:**Diaconis__Persi Evans__Steven_N**Report ID:**577**Relevance:**100

**Title:**Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data**Author(s):**Dudoit, Sandrine; Fridlyand, Jane; Speed, Terence P.; **Date issued:**Jun 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3n7q (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3n88 (PostScript) **Abstract:**A reliable and precise classification of tumors is essential for successful treatment of cancer. cDNA microarrays and high-density
oligonucleotide chi ps are novel biotechnologies which are being used increasingly in cancer research. By allowing the monitoring
of expression levels for thousands of genes simultaneously, such techniques may lead to a more complete understanding of the
molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish
between tumor classes (already known or yet to be discovered) using gene expression data is an important aspect of this novel
approach to cancer classification. In this paper, we compare the performance of different discrimination methods for the
classification of tumors based on gene expression data. These methods include: nearest neighbor classifiers, linear discriminant
analysis, and classification trees. In our comparison, we also consider recent machine learning approaches such as bagging
and boosting. We investigate the use of prediction votes to assess the confidence of each prediction. The methods are applied
to datasets from three recently published cancer gene expression studies.**Keyword note:**Dudoit__Sandrine Fridlyand__Jane Speed__Terry_P**Report ID:**576**Relevance:**100

**Title:**Generalization bounds for incremental search classification**Author(s):**Gat, Yoram; **Date issued:**Mar 2000

http://nma.berkeley.edu/ark:/28722/bk0000n2h8q (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2h98 (PostScript) **Abstract:**algorithms This paper presents generalization bounds for a certain class of classification algorithms. The bounds presented
take advantage of the local nature of the search that these algorithms use in order to obtain bounds that are better than
those that can be obtained using VC type bounds. The results are applied to well-known classification algorithms such as classification
trees and the perceptron.**Keyword note:**Gat__Yoram**Report ID:**575**Relevance:**100

**Title:**Immanants and finite point processes**Author(s):**Diaconis, Persi; Evans, Steven N.; **Date issued:**Mar 2000

http://nma.berkeley.edu/ark:/28722/bk0000n2906 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n291r (PostScript) **Abstract:**Given a Hermitian, non-negative definite kernel $K$ and a character $\chi$ of the symmetric group on $n$ letters, define the
corresponding immanant function $K^\chi[x_1, \ldots, x_n]**Keyword note:**Diaconis__Persi Evans__Steven_N**Report ID:**574**Relevance:**100

**Title:**Salt and Blood Pressure: Conventional Wisdom Reconsidered**Author(s):**Freedman, D. A.; Petitti, D. B.; **Date issued:**Apr 2000

http://nma.berkeley.edu/ark:/28722/bk0000n3w2k (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3w34 (PostScript) **Abstract:**The "salt hypothesis" is that higher levels of salt in the diet lead to higher levels of blood pressure, with attendant risk
of cardiovascular disease. Intersalt was designed to test the hypothesis, with a cross-sectional study of salt levels and
blood pressures in 52 populations. The study is often cited to support the salt hypothesis, but the data are somewhat contradictory.
Thus, four of the populations (Kenya, Papua, and two Indian tribes in Brazil) have very low levels of salt and blood pressure.
Across the other 48 populations, however, blood pressures go down as salt levels go up-- contradicting the salt hypothesis.
Regressions of blood pressure on age indicate that for young people, blood pressure is inversely related to salt intake--
another paradox. This paper discusses the Intersalt data and study design, looking at some of the statistical issues and
identifying respects in which the study failed to follow its own protocol. Also considered are human experiments bearing
on the salt hypothesis. The effect of salt reduction is stronger for hypertensive subjects than normotensives. Even the
effect of a large reduction in salt intake on blood pressure is modest, and publication bias is a concern. To determine
the health effects of salt reduction, a long-term intervention study would be needed, with endpoints defined in terms of morbidity
and mortality; dietary interventions seem more promising. Funding agencies and medical journals have taken a stronger position
favoring the salt hypothesis than is warranted by the evidence, raising questions about the interaction between the policy
process and science.**Keyword note:**Freedman__David Petitti__D_B**Report ID:**573**Relevance:**100

**Title:**An $O(n^2)$ bound for the relaxation time of a Markov chain on cladograms.**Author(s):**Schweinsberg, Jason; **Date issued:**Mar 2000**Date modified:**revised June 2001**Abstract:**A cladogram is an unrooted tree with labeled leaves and unlabeled internal branchpoints of degree $3$. Aldous has studied
a Markov chain on the set of $n$-leaf cladograms in which each transition consists of removing a random leaf and its incident
edge from the tree and then reattaching the leaf to a random edge of the remaining tree. Using coupling methods, Aldous has
shown that the relaxation time (i.e. the inverse of the spectral gap) for this chain is $O(n^3)$. Here, we use a method based
on distinguished paths to prove an $O(n^2)$ bound for the relaxation time, establishing a conjecture of Aldous.**Pub info:**RSA Vol 20 (2002)**Keyword note:**Schweinsberg__Jason**Report ID:**572**Relevance:**100

**Title:**Coalescents with simultaneous multiple collisions**Author(s):**Schweinsberg, Jason; **Date issued:**Jan 2000**Abstract:**We study a family of coalescent processes that undergo ``simultaneous multiple collisions,'' meaning that many clusters of
particles can merge into a single cluster at one time, and many such mergers can occur simultaneously. This family of processes,
which we obtain from simple assumptions about the rates of different types of mergers, essentially coincides with a family
of processes that M\''(o)hle and Sagitov obtain as a limit of scaled ancestral processes in a population model with exchangeable
family sizes. We characterize the possible merger rates in terms of a single measure, show how these coalescents can be constructed
from a Poisson process, and discuss some basic properties of these processes. This work generalizes some work of Pitman,
who provides similar analysis for a family of coalescent processes in which many clusters can coalesce into a single cluster,
but almost surely no two such mergers occur simultaneously.**Pub info:**EJP Vol 5 (2000) Paper 12**Keyword note:**Schweinsberg__Jason**Report ID:**571**Relevance:**100

**Title:**A probability model for census adjustment**Author(s):**Freedman, D. A.; Stark, P. B.; Wachter, K. W.; **Date issued:**Mar 2000

http://nma.berkeley.edu/ark:/28722/bk0000n388k (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n3894 (PostScript) **Abstract:**The census can be adjusted using capture-recapture techniques: capture in the census, recapture in a special Post Enumeration
Survey (PES) done after the census. The population is estimated using the Dual System Estimator (DSE). Estimates are made
separately for demographic groups called post strata; adjustment factors are then applied to these demographic groups within
small geographic areas. We offer a probability model for this process, in which several sources of error can be distinguished.
In this model, correlation bias arises from behavioral differences between persons counted in the census and persons missed
by the census. The first group may on the whole be more likely to respond to the PES: if so, the DSE will be systematically
too low, and that is an example of correlation bias. Correlation bias is distinguished from heterogeneity, which occurs if
the census has a higher capture rate in some geographic areas than others. Finally, ratio estimator bias and variance are
considered. The objective is to clarify the probabilistic foundations of the DSE, and the definitions of certain terms widely
used in discussing that estimator.**Keyword note:**Freedman__David Stark__Philip_B Wachter__Kenneth**Report ID:**557**Relevance:**100