**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2004**Results:**24**Sorted by:****Page: 1 2 Next**

**Title:**Boosted Lasso and Reverse Boosting**Author(s):**Zhao, Peng; Yu, Bin; **Date issued:**December 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2x7c (PDF) **Abstract:**This paper introduces the concept of "backward" step in contrast with forward fashion algorithms like Boosting and Forward
Stagewise Fitting. Like classical elimination methods, this "backward" step works by shrinking the model complexity of an
ensemble learner. Through a step analysis, we show that this additional step is necessary for minimizing $L_1$ penalized loss
(Lasso loss). We also propose a BLasso algorithm as a combination of both backward and forward steps which is able to produce
the complete regularization path for Lasso problems. Moreover, BLasso can be generalized to solve problems with general convex
loss with general convex penalty.**Keyword note:**Zhao__Peng Yu__Bin**Report ID:**678**Relevance:**100

**Title:**Estimating the Proportion of False Null Hypotheses among a Large Number of Independently Tested Hypotheses**Author(s):**Meinshausen, Nicolai; Rice, John; **Date issued:**October 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2x58 (PDF) **Abstract:**We consider the problem of estimating the number of false null hypotheses among a very large number of independently tested
hypotheses, focusing on the situation in which the proportion of false null hypotheses is very small. We propose a family
of methods for establishing lower $100(1-\alpha)\%$ confidence bounds for this proportion, based on the empirical distribution
of the p-values of the tests. Methods in this family are then compared in terms of ability to consistently estimate the proportion
by letting $\alpha \rightarrow 0$ as the number of hypothesis tests increases and the proportion decreases. This work is
motivated by a signal detection problem occurring in astronomy.**Keyword note:**Meinshausen__Nicolai Rice__John_Andrew**Report ID:**676**Relevance:**100

**Title:**Variational inference for Dirichlet process mixtures**Author(s):**Blei, David M.; Jordan, Michael I.; **Date issued:**October 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2x35 (PDF) **Abstract:**Dirichlet process (DP) mixture models are the cornerstone of nonparametric Bayesian statistics, and the development of Monte-Carlo
Markov chain (MCMC) sampling methods for DP mixtures has enabled their applications to a variety of practical data analysis
problems. However, MCMC sampling can be prohibitively slow, and it is important to explore alternatives. One class of alternatives
is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization
problems (Opper & Saad, 2001; Wainwright & Jordan, 2003). Thus far, variational methods have mainly been explored in the
parametric setting, in particular within the formalism of the exponential family (Attias, 2000; Ghahramani & Beal, 2001; Blei,
et al., 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that
compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale
image analysis problem.**Keyword note:**Blei__David_M Jordan__Michael_I**Report ID:**674**Relevance:**100

**Title:**A stochastic model of language evolution that incorporates homoplasy and borrowing**Author(s):**Warnow, Tandy; Evans, Steven N.; Ringe, Don; Nakhleh, Luay; **Date issued:**September 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2w3n (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2w46 (PostScript) **Abstract:**We propose a stochastic model of language evolution that permits both homoplasy and non-tree evolution (so as to reflect borrowing
between lineages). We discuss the issues involved in reconstructing evolutionary histories under the model, specifically
showing that the tree component of the model is identifiable, that efficient methods exist for reconstructing the tree, and
that full likelihoods can be calculated in linear time. We conclude with a discussion of data selection and analysis, and
compare our stochastic model for language evolution to existing models in molecular evolution.**Keyword note:**Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay**Report ID:**673**Relevance:**100

**Title:**A comparison of phylogenetic reconstruction methods on an IE dataset**Author(s):**Nakhleh, Luay; Warnow, Tandy; Ringe, Don; Evans, Steven N.; **Date issued:**September 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2w00 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2w1j (PostScript) **Abstract:**Researchers interested in the history of the Indo-European family of languages have used a variety of methods to estimate
the phylogeny of the family, and have obtained widely differing results. In this paper we explore the reconstructions of Indo-European
phylogeny obtained by using the major phylogeny estimation procedures on an existing database for 24 Indo-European languages
compiled by linguists Don Ringe and Ann Taylor. Our study finds that the different methods agree in part, but that there
also several striking differences. We discuss the reasons for these differences, and make proposals with respect to phylogenetic
reconstruction in historical linguistics.**Keyword note:**Nakhleh__Luay Warnow__Tandy Ringe__Don Evans__Steven_N**Report ID:**672**Relevance:**100

**Title:**Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations**Author(s):**Wainwright, M. J.; Jordan, M. I.; **Date issued:**September 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2w9z (PDF) **Abstract:**The Sherali-Adams (SA) and Lasserre (LA) approaches are "lift-and-project" methods that generate nested sequences of linear
and/or semidefinite relaxations of an arbitrary 0-1 polytope $P \subseteq [0,1]^n$. Although both procedures are known to
terminate with an exact description of $P$ after $n$ steps, there are various open questions associated with characterizing,
for particular problem classes, whether exactness is obtained at some step $s < n$. This paper provides sufficient conditions
for exactness of these relaxations based on the hypergraph-theoretic notion of treewidth. More specifically, we relate the
combinatorial structure of a given polynomial system to an underlying hypergraph. We prove that the complexity of assessing
the global validity of moment sequences, and hence the tightness of the SA and LA relaxations, is determined by the \emph(treewidth)
of this hypergraph. We provide some examples to illustrate this characterization.**Keyword note:**Wainwright__Martin Jordan__Michael_I**Report ID:**671**Relevance:**100

**Title:**A New Look at Survey Propagation and its Generalizations**Author(s):**Maneva, E.; Mossel, E.; Wainwright, M. J.; **Date issued:**September 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2v8w (PDF) **Abstract:**We study the survey propagation algorithm~\cite(MZ02, BMZ03,BMWZ03), which is an iterative technique that appears to be very
effective in solving random $k$-SAT problems even with densities close to threshold. We first describe how any SAT formula
can be associated with a novel family of Markov random fields (MRFs), parameterized by a real number $\rho \in [0,1]$. We
then show that applying belief propagation---a well-known "message-passing" technique---to this family of MRFs recovers various
algorithms, ranging from pure survey propagation at one extreme ($\rho = 1$) to standard belief propagation on the uniform
distribution over SAT assignments at the other extreme ($\rho = 0$). Configurations in these MRFs have a natural interpretation
as generalized satisfiability assignments, on which a partial order can be defined. We isolate \emph(cores) as minimal elements
in this partial ordering, and prove that any core is a fixed point of survey propagation. We investigate the associated lattice
structure, and prove a weight-preserving identity that shows how any MRF with $\rho > 0$ can be viewed as a "smoothed" version
of the naive factor graph representation of the $k$-SAT problem ($\rho = 0$). Our experimental results suggest that random
formulas typically do not possess non-trivial cores. This result and additional experiments indicate that message-passing
on our family of MRFs is most effective for values of $\rho \neq 1$ (i.e., distinct from survey propagation). Finally, we
isolate properties of Gibbs sampling and message-passing algorithms that are typical for an ensemble of $k$-SAT problems.**Keyword note:**Maneva__E Mossel__Elchanan Wainwright__Martin**Report ID:**669**Relevance:**100

**Title:**Unidentifiable divergence times in rates-across-sites models**Author(s):**Evans, Steven N.; Warnow, Tandy; **Date issued:**August 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2v57 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2v6s (PostScript) **Abstract:**The rates-across--sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution
of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate
matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that
it occurs faster for some characters than others. To allow estimation of tree topologies and edge lengths for such models,
it is commonly assumed that the scale factors are not arbitrary unknown constants, but rather unobserved, independent, identically
distributed draws from a member of some parametric family of distributions. A popular choice is the gamma family. We consider
an example of a clock-like tree with three taxa, one unknown edge length, and a parametric family of scale factor distributions
that contain the gamma family. This model has the property that, for a generic choice of unknown edge length and scale factor
distribution, there is another edge length and scale factor distribution which generates data with exactly the same distribution,
so that even with infinitely many data it will be typically impossible to make correct inferences about the unknown edge length.**Keyword note:**Evans__Steven_N Warnow__Tandy**Report ID:**668**Relevance:**100

**Title:**A multivariate empirical Bayes statistic for replicated microarray time course data**Author(s):**Tai, Yu Chuan; Speed, Terence P.; **Date issued:**August 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2v34 (PDF) **Abstract:**In this paper we derive a one-sample multivariate empirical Bayes statistic (the $MB$-statistic) to rank genes in the order
of differential expression from replicated microarray time course experiments . We do this by testing the null hypothesis
that the expectation of a $k$-vector of a gene's expression levels is a multiple of $1_k$, the vector of $k$ $1$s. The importance
of moderation in this context is explained. Together with the $MB$-statistic we have the one-sample $\widetilde(T)^2$ statistic,
a variant of the one-sample Hotelling $T^2$. Both the $MB$-statistic and $\widetilde(T)^2$ statistic can be used to rank genes
in the order of evidence of nonconstancy, incorporating any correlation structure among time point samples and the replication.
In a simulation study we show that the one-sample $MB$-statistic, $\widetilde(T)^2$ statistic, and moderated Hotelling $T^2$
statistic achieve the smallest number of false positives and false negatives, and all perform equally well. Several special
and limiting cases of the $MB$-statistic are derived, and two-sample versions are described.**Keyword note:**Tai__Yu_Chuan Speed__Terry_P**Report ID:**667**Relevance:**100

**Title:**Using Random Forest to Learn Imbalanced Data**Author(s):**Chen, Chao; Liaw, Andy; Breiman, Leo; **Date issued:**July 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2v11 (PDF) **Abstract:**In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based
on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall,
false positive rate and false negative rate, $F$-measure and weighted accuracy are computed. Both methods are shown to improve
the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.**Keyword note:**Chen__Chao Liaw__Andy Breiman__Leo**Report ID:**666**Relevance:**100

**Title:**Estimating velocity fields on a freeway from low resolution video recordings**Author(s):**Cho, Young; Rice, John; **Date issued:**July 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2t9x (PDF) **Abstract:**We present an algorithm to estimate velocity fields from low resolution video recordings. The algorithm does not attempt
to identify and track individual vehicles, nor does it attempt to estimate derivatives of the field of pixel intensities.
Rather, we compress a frame by obtaining an intensity profile in each lane along the direction of traffic flow. The speed
estimate is then computed by searching for a best matching profile in a frame at a later time. Because the algorithm does
not need high quality images, it is directly applicable to a compressed format digital video stream, such as mpeg, from conventional
traffic video cameras. We illustrate the procedure using a 15 minute long VHS recording to obtain speed estimates on a one
mile stretch of highway I-80 in Berkeley, California.**Keyword note:**Cho__Young Rice__John_Andrew**Report ID:**665**Relevance:**100

**Title:**Measuring Traffic**Author(s):**Bickel, Peter; Chen, Chao; Kwon, Jaimyoung; Rice, John; van Zwet, Erik; Varaiya, Pravin; **Date issued:**June 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2t7t (PDF) **Abstract:**A traffic performance measurement system, PeMS, currently functions as a statewide repository for traffic data gathered by
thousands of automatic sensors. It has integrated data collection, processing, and communications infrastructure with data
storage and analytical tools. In this paper, we discuss statistical issues that have emerged as we attempt to process a data
stream of two GB per day of wildly varying quality. In particular, we focus on detecting sensor malfunction, imputation of
missing or bad data, estimation of velocity, and forecasting of travel times on freeway networks.**Keyword note:**Bickel__Peter_John Chen__Chao Kwon__Jaimyoung Rice__John_Andrew van_Zwet__Erik Varaiya__Pravin**Report ID:**664**Relevance:**100

**Title:**Cloud Detection over Snow and Ice Using MISR Data**Author(s):**Shi, Tao; Yu, Bin; Clothiaux, Eugene E.; Braverman, Amy J.; **Date issued:**June 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2t5q (PDF) **Abstract:**Clouds play a major role in Earth's climate and cloud detection is a crucial step in the processing of satellite observations
in support of radiation budget, numerical weather prediction and global climate model studies. To advance the observational
capabilities of detecting clouds and retrieving their cloud-top altitudes, NASA launched the Multi-angle Imaging SpectroRadiometer
(MISR) in 1999, which provides data in nine different views of the same scene using four spectral channels. Cloud detection
is particularly difficult in the snow- and ice-covered polar regions and availability of the novel MISR angle-dependent radiances
motivates the current study on cloud detection using statistical methods. Three schemes using MISR data for polar cloud detection
are investigated in this study. Using domain knowledge, three physical features are developed for detecting clouds in daylight
polar regions. The features measure the correlations between MISR angle-dependent radiances, the smoothness of the reflecting
surfaces, and the amount of forward scattering of radiances. The three features are the basis of the the first scheme, called
Enhanced Linear Correlation Matching Classification (ELCMC). The ELCMC algorithm thresholds on three features and the thresholds
are either fixed or found through the EM algorithm based on a mixture of two 1-dim Gaussians. The ELCMC algorithm results
are subsequently used as training data in the development of two additional schemes, one Fisher's Quadratic Discriminate Analysis
(ELCMC-QDA) and the other a Gaussian kernel Support Vector Machine(ELCMC-SVM). For both QDA- and SVM-based experiments two
types of inputs are tested, the set of three physical features and the red radiances of the nine MISR cameras. All three
schemes are applied to two polar regions where expert labels show that the MISR operational cloud detection algorithm does
not work well, with a 53% misclassification rate in one region and a 92% nonretrieval rate in the other region. The ELCMC
algorithm produces misclassification rates of 6.05% and 6.28% relative to expert labelled regions across the two polar scenes.
The misclassification rates are reduced to approximately 4% by ELMCM-QDA and ELCMC-SVM in one region and approximately 2%
in the other. Overall, all three schemes provided significantly more accurate results and greater spatial coverage than the
MISR operational stereo-based cloud detection algorithm. Compared with ELCMC-QDA, ELCMC-SVM is more robust against mislabels
in the ELCMC results and provide slightly better results, but it is computationally slower.**Keyword note:**Shi__Tao Yu__Bin Clothiaux__Eugene_E Braverman__Amy_J**Report ID:**663**Relevance:**100

**Title:**Balls-in-boxes duality for coalescing random walks and coalescing Brownian motion**Author(s):**Evans, Steven N.; Zhou, Xiaowen; **Date issued:**June 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2t22 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2t3m (PostScript) **Abstract:**We present a duality between two systems of coalescing random walks and an analogous duality between two systems of coalescing
Brownian motions. Our results extend previous work in the literature and we apply them to the study of a system of coalescing
Brownian motions with Poisson immigration.**Keyword note:**Evans__Steven_N Zhou__Xiaowen**Report ID:**662**Relevance:**100

**Title:**Sparse Gaussian process classification with multiple classes**Author(s):**Seeger, Matthias; Jordan, Michael I.; **Date issued:**April 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2s4n (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2s56 (PostScript) **Abstract:**Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training
points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the
binary classification informative vector machine (IVM) to multiple classes. In contrast to earlier efficient approaches to
kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid
uncertainty estimates and allows for hyperparameter estimation via marginal likelihood maximization. While most earlier proposals
suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic
manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the
number of classes and the number of training points.**Keyword note:**Seeger__Matthias Jordan__Michael_I**Report ID:**661**Relevance:**100

**Title:**Inference of divergence times as a statistical inverse problem**Author(s):**Evans, Steven N.; Ringe, Don; Warnow, Tandy; **Date issued:**April 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2s10 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2s2j (PostScript) **Abstract:**Dating of divergence times in historical linguistics is an instance of a statistical inverse problem, and many of the issues
that complicate the proper treatment of other inverse problems are also present there.**Keyword note:**Evans__Steven_N Ringe__Don Warnow__Tandy**Report ID:**660**Relevance:**100

**Title:**Stochastic models of language evolution and an application to the Indo-European family of languages**Author(s):**Warnow, Tandy; Evans, Steven N.; Ringe, Don; Nakhleh, Luay; **Date issued:**April 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2r8b (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2r9w (PostScript) **Abstract:**We propose several models of how languages evolve, and discuss statistical estimation of evolution under these models.**Keyword note:**Warnow__Tandy Evans__Steven_N Ringe__Don Nakhleh__Luay**Report ID:**659**Relevance:**100

**Title:**Decentralized Detection and Classification using Kernel Methods**Author(s):**Nguyen, XuanLong; Wainwright, Martin J.; Jordan, Michael I.; **Date issued:**April 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2s79 (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2s8v (PostScript) **Abstract:**We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each
sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known,
we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework
of empirical risk minimization and marginalized kernels, and analyze its computational and statistical properties both theoretically
and empirically. We provide an efficient implementation of the algorithm, and demonstrate its performance on both simulated
and real data sets.**Keyword note:**Nguyen__XuanLong Wainwright__Martin Jordan__Michael_I**Report ID:**658**Relevance:**100

**Title:**A generalized model of mutation-selection balance with applications to aging**Author(s):**Steinsaltz, David; Evans, Steven N.; Wachter, Kenneth W.; **Date issued:**March 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2r1g (PDF)

http://nma.berkeley.edu/ark:/28722/bk0000n2r21 (PostScript) **Abstract:**A probability model is presented for the dynamics of mutation-selection balance in a haploid infinite-population infinite-sites
setting sufficiently general to cover mutation-driven changes in full age-specific demographic schedules. The model accommodates
epistatic as well as additive selective costs. Closed form characterizations are obtained for solutions in finite time,
along with proofs of convergence to stationary distributions and a proof of the uniqueness of solutions in a restricted case.
Examples are given of applications to the biodemography of aging, including instabilities in current formulations of mutation
accumulation.**Keyword note:**Steinsaltz__David Evans__Steven_N Wachter__Kenneth**Report ID:**657**Relevance:**100

**Title:**Supplement to "Consistent Independent Component Analysis and Prewhitening"**Author(s):**Chen, A.; Bickel, P. J.; **Date issued:**March 2004

http://nma.berkeley.edu/ark:/28722/bk0000n2t0z (PDF) **Abstract:**In this paper we study the statistical properties of a characteristic-function based algorithm for independent component analysis
(ICA), which was proposed by Eriksson et. al. (2003) and Chen & Bickel (2003) independently. First, statistical consistency
of this algorithm with prewhitening is analyzed, especially in existence of heavy-tailed sources. Second, without prewhitening
this algorithm is shown to be robust against small additive noise.**Keyword note:**Chen__Aiyou Bickel__Peter_John**Report ID:**656**Relevance:**100