**Statistics Technical Reports:**Search | Browse by year

**Term(s):**2011**Results:**12**Sorted by:**

**Title:**Adjusting Treatment Effect Estimates by Post-Stratification in Randomized Experiments**Author(s):**Miratrix, Luke W.; Sekhon, Jasjeet S.; Yu, Bin; **Date issued:**November 2011

http://nma.berkeley.edu/ark:/28722/bk0010v4t5r (PDF) **Abstract:**Experimenters often use post-stratification to adjust estimates. Post-stratification is akin to blocking, except that the
number of treated units in each strata is a random variable be- cause stratification occurs after treatment assignment. We
analyze both post-stratification and blocking under the Neyman model and compare the efficiency of these designs. We derive
the variances for a post-stratified estimator and a simple difference-in-means estimator under different randomization schemes.
Post-stratification is nearly as efficient as blocking: the difference in their variances is on the order of 1/n2, provided
treatment proportion is not too close to 0 or 1. Post-stratification is therefore a reasonable alternative to blocking when
the latter is not feasible. However, in finite samples, post-stratification can increase variance if the number of strata
is large and the strata are poorly chosen. To examine why the estimatorsâ€™ variances are different, we extend our results by
conditioning on the observed number of treated units in each strata. Conditioning also provides more accurate variance estimates
because it takes into account how close (or far) a realized random sample is from a comparable blocked experiment. We then
show that the practical substance of our results remain under an infinite population sampling model. Finally, we provide an
analysis of an actual experiment to illustrate our analytical results.**Keyword note:**Miratrix__Luke Sekhon__Jasjeet_S Yu__Bin**Report ID:**809**Relevance:**100

**Title:**Killed Brownian motion with a prescribed lifetime distribution and models of default**Author(s):**Ettinger, Boris; Evans, Steven N.; Hening, Alexandru; **Date issued:**November 2011

http://nma.berkeley.edu/ark:/28722/bk0010j9z6r (PDF) **Abstract:**The inverse first passage time problem asks whether, for a Brownian motion $B$ and a nonnegative random variable $\zeta$,
there exists a time-varying barrier $b$ such that $\mathbb{P}\{B_s > b(s), \, 0 \le s \le t\} = \mathbb{P}\{\zeta > t\}$.
We study a "smoothed" version of this problem and ask whether there is a "barrier" $b$ such that $\mathbb{E}[\exp(-\lambda
\int_0^t \psi(B_s - b(s)) \, ds)] = \mathbb{P}\{\zeta > t\}$, where $\lambda$ is a killing rate parameter and $\psi: \mathbb{R}
\to [0,1]$ is a non-increasing function. We prove that if $\psi$ is suitably smooth, the function $t \mapsto \mathbb{P}\{\zeta
> t\}$ is twice continuously differentiable, and the condition $0 < -\frac{d \log \mathbb{P}\{\zeta > t\}}{dt} < \lambda$
holds for the hazard rate of $\zeta$, then there exists a unique continuously differentiable function $b$ solving the smoothed
problem. We show how this result leads to flexible models of default for which it is possible to compute expected values
of contingent claims.**Keyword note:**Ettinger__Boris Evans__Steven_N Hening__Alexandru**Report ID:**808**Relevance:**100

**Title:**Phylogenetic analyses of alignments with gaps**Author(s):**Evans, Steven N.; Warnow, Tandy; **Date issued:**October 2011

http://nma.berkeley.edu/ark:/28722/bk0010j9z8v (PDF) **Abstract:**Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion,
i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of
indels has not been fully investigated. We prove that treating indels as missing data can be inconsistent for a general
(and rather simple) model of sequence evolution, even when given the true alignment. We also prove that the true tree can
be identified solely from the pattern of gaps in the true alignment (that is, character states can be ignored). Our results
show that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical
properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model.
Moreover, the pattern of gaps in an accurate alignment may give substantial information about the underlying phylogeny, over
and above what is present in the character states. These observations suggest that the recent focus on developing statistical
methods that treat indel events properly is an important direction for phylogeny estimation.**Keyword note:**Evans__Steven_N Warnow__Tandy**Report ID:**807**Relevance:**100

**Title:**Lipschitz minorants of Brownian Motion and Levy processes**Author(s):**Abramson, Joshua; Evans, Steven N.; **Date issued:**October 2011

http://nma.berkeley.edu/ark:/28722/bk0010k0002 (PDF) **Abstract:**For $\alpha > 0$, the $\alpha$-Lipschitz minorant of a function $f: \mathbb{R} \to \mathbb{R}$ is the greatest function $m
: \mathbb{R} \to \mathbb{R}$ such that $m \leq f$ and $|m(s)-m(t)| \le \alpha |s-t|$ for all $s,t \in \mathbb{R}$, should
such a function exist. If $X=(X_t)_{t \in \mathbb{R}}$ is a real-valued L\'evy process that is not pure linear drift with
slope $\pm \alpha$, then the sample paths of $X$ have an $\alpha$-Lipschitz minorant almost surely if and only if $| \mathbb{E}[X_1]
| < \alpha$. Denoting the minorant by $M$, we investigate properties of the random closed set $\mathcal{Z} := {t \in \mathbb{R}
: M_t = X_t \wedge X_{t-}}$, which, since it is regenerative and stationary, has the distribution of the closed range of some
subordinator "made stationary" in a suitable sense. We give conditions for the contact set $\mathcal{Z}$ to be countable or
to have zero Lebesgue measure, and we obtain formulas that characterize the L\'evy measure of the associated subordinator.
We study the limit of \mathcal{Z}$ as $\alpha \to \infty$ and find for the so-called abrupt L\'evy processes introduced by
Vigon that this limit is the set of local infima of $X$. When $X$ is a Brownian motion with drift $\beta$ such that $|\beta|
< \alpha$, we calculate explicitly the densities of various random variables related to the minorant.**Keyword note:**Abramson__Joshua Evans__Steven_N**Report ID:**806**Relevance:**100

**Title:**A limit theorem for occupation measures of Levy processes in compact groups**Author(s):**Berger, Arno; Evans, Steven N.; **Date issued:**September 2011

http://nma.berkeley.edu/ark:/28722/bk0010k0025 (PDF) **Abstract:**A short proof is given of a necessary and sufficient condition for the normalized occupation measure of a Levy process in
a metrizable compact group to be asymptotically uniform with probability one.**Keyword note:**Berger__Arno Evans__Steven_N**Report ID:**805**Relevance:**100

**Title:**Estimation and correction for GC-content bias in high throughput sequencing**Author(s):**Benjamini, Yuval; Speed, Terence P.; **Date issued:**June 2011

http://nma.berkeley.edu/ark:/28722/bk0008s4h77 (PDF) **Abstract:**GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing
assays, particularly the Illumina Genome Analyzer technology. This bias can dominate the signal of interest for analyses that
focus on measuring fragment abundance within a genome, such as copy number estimation. The bias is not consistent between
samples, and current methods to remove it in a single sample do not assume any knowledge of the curve shape or scale. In this
work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content
of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal:
both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Based on these findings,
we propose a new method to calculate predicted coverage and correct for the bias. This parsimonious model produces single
bp prediction which suffices to predict the GC effect on fragment coverage at all scales, all chromosomes and for both strands;
this allows optimal GC-effect correction regardless of the downstream smoothing or binning. We demonstrate our model's potential
for improving on current approaches to copy-number estimation. These GC-modeling considerations can also inform other high-throughput
sequencing analyses such as ChIP-seq and RNA-seq. Finally, our analysis provides empirical evidence strengthening the hypothesis
that PCR is the most important cause of the GC bias.**Keyword note:**Benjamini__Yuval Speed__Terry_P**Report ID:**804**Relevance:**100

**Title:**Stochastic equations on projective systems of groups**Author(s):**Evans, Steven N.; Gordeeva, Tatyana; **Date issued:**June 2011

http://nma.berkeley.edu/ark:/28722/bk0008s4h9b (PDF) **Abstract:**We consider stochastic equations of the form $X_k = \phi_k(X_{k+1}) Z_k$, $k \in \mathbb{N}$, where $X_k$ and $Z_k$ are random
variables taking values in a compact group $G_k$, $\phi_k: G_{k+1} \to G_k$ is a continuous homomorphism, and the noise $(Z_k)_{k
\in \mathbb{N}}$ is a sequence of independent random variables. We take the sequence of homomorphisms and the sequence of
noise distributions as given, and investigate what conditions on these objects result in a unique distribution for the "solution"
sequence $(X_k)_{k \in \mathbb{N}}$ and what conditions permits the existence of a solution sequence that is a function of
the noise alone (that is, the solution does not incorporate extra input randomness "at infinity"). Our results extend previous
work on stochastic equations on a single group that was originally motivated by Tsirelson's example of a stochastic differential
equation that has a unique solution in law but no strong solutions.**Keyword note:**Evans__Steven_N Gordeeva__Tatyana**Report ID:**803**Relevance:**100

**Title:**Stochastic population growth in spatially heterogeneous environments**Author(s):**Evans, Steven N.; Ralph, Peter L.; Schreiber, Sebastian J.; Sen, Arnab; **Date issued:**May 2011

http://nma.berkeley.edu/ark:/28722/bk0008n9185 (PDF) **Abstract:**Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita
growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and
dispersal on population growth, we study the following model for population abundances in $n$ patches; the conditional law
of $X_{t+dt}$ given $X_t=x$ is such that when $dt$ is small the conditional mean of $X_{t+dt}^i-X_t^i$ is approximately $[x^i\mu_i+\sum_j(x^j
D_{ji}-x^i D_{ij})]dt$, where $X_t^i$ and $\mu_i$ are the abundance and per capita growth rate in the $i$-th patch respectivly,
and $D_{ij}$ is the dispersal rate from the $i$-th to the $j$-th patch, and the conditional covariance of $X_{t+dt}^i-X_t^i$
and $X_{t+dt}^j-X_t^j$ is approximately $x^i x^j \sigma_{ij}dt$. We show for such a spatially extended population that if
$S_t=(X_t^1+...+X_t^n)$ is the total population abundance, then $Y_t=X_t/S_t$, the vector of patch proportions, converges
in law to a random vector $Y_\infty$ as $t\to\infty$, and the stochastic growth rate $\lim_{t\to\infty}t^{-1}\log S_t$ equals
the space-time average per-capita growth rate $\sum_i\mu_i\E[Y_\infty^i]$ experienced by the population minus half of the
space-time average temporal variation $\E[\sum_{i,j}\sigma_{ij}Y_\infty^i Y_\infty^j]$ experienced by the population. We derive
analytic results for the law of $Y_\infty$, find which choice of the dispersal mechanism $D$ produces an optimal stochastic
growth rate for a freely dispersing population, and investigate the effect on the stochastic growth rate of constraints on
dispersal rates. Our results provide fundamental insights into "ideal free" movement in the face of uncertainty, the persistence
of coupled sink populations, the evolution of dispersal rates, and the single large or several small (SLOSS) debate in conservation
biology.**Keyword note:**Evans__Steven_N Ralph__Peter Schreiber__Sebastian_J Sen__Arnab**Report ID:**802**Relevance:**100

**Title:**Summarizing large-scale, multiple-document news data: sparse methods & human validation**Author(s):**Miratrix, Luke; Jia, Jinzhu; Gawalt, Brian; Yu, Bin; El Ghaoui, Laurent; **Date issued:**May 2011

http://nma.berkeley.edu/ark:/28722/bk0008n9208 (PDF) **Abstract:**News media significantly drives the course of events. Understanding how has long been an active and important area of research.
Now, as the amount of online news media available grows, there is even more information calling for analysis, an ever increasing
range of inquiry that one might conduct. We believe subject-specific summarization of multiple news documents at once can
help. In this paper we adapt scalable statistical techniques to perform this summarization under a predictive framework using
a vector space model of documents. We reduce corpora of many millions of words to a few representative key-phrases that describe
a specified subject of interest. We propose this as a tool for news media study.We consider the efficacies of four different
feature selection approaches---phrase co-occurrence, phrase correlation, $L^1$ regularized logistic regression (L1LR), and
$L^1$ regularized linear regression (Lasso)---under many different pre-processing choices. To evaluate these different summarizers
we establish a survey by which non-expert human readers rate generated summaries. Data pre-processing decisions are important;
we also study the impact of several different techniques for vectorizing the documents, and identifying which documents concern
a subject.We find that the Lasso, which consistently produces high-quality summaries across the many pre-processing schemes
and subjects, is the best choice of feature selection engine. Our findings also reinforce the many years of work suggesting
the tf-idf representation is a strong choice of vector space, but only for longer units of text.Though we focus here on print
media (newspapers), our methods are general and could be applied to any corpora, even ones of considerable size.**Keyword note:**Miratrix__Luke Jia__Jinzhu Gawalt__Brian Yu__Bin El__Ghaoui__Laurent**Report ID:**801**Relevance:**100

**Title:**Using Control Genes to Correct for Unwanted Variation in Microarray Data**Author(s):**Gagnon-Bartsch, Johann A.; Speed, Terence P.; **Date issued:**March 2011

http://nma.berkeley.edu/ark:/28722/bk0007t9k2z (PDF) **Abstract:**Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been
proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor
analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning
the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended
for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative
control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological
factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation.
We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance
of an adjustment method, and compare the performance of RUV-2 with that of other commonly used adjustment methods such as
Combat and SVA. We present several example studies, each concerning genes differentially expressed with respect to gender
in the brain, and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting
RUV-2 for use in studies not concerned with differential expression, and conclude that there may be promise, but substantial
challenges remain.**Keyword note:**Gagnon-Bartsch__Johann_A Speed__Terry_P**Report ID:**800**Relevance:**100

**Title:**Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison**Author(s):**Matsen, Frederick A.; Evans, Steven N.; **Date issued:**March 2011

http://nma.berkeley.edu/ark:/28722/bk0007t9k42 (PDF) **Abstract:**It is becoming increasingly common to analyze collections of sequence reads by first assigning each read to a location on
a phylogenetic tree. In parallel, quantitative methods are being developed to compare samples of reads using the information
provided by such phylogenetic placements: one example is the phylogenetic Kantorovich-Rubinstein (KR) metric which calculates
a distance between pairs of samples using the evolutionary distances between the assigned positions of the reads on the phylogenetic
tree. The KR distance generalizes the weighted UniFrac metric. Classical, general-purpose ordination and clustering methods
can be applied to KR distances, but we argue that more interesting and interpretable results are produced by two new methods
that leverage the special structure of phylogenetic placement data. Edge principal components analysis enables the detection
of important differences between samples containing closely related taxa and allows the visualization of the principal component
axes in terms of edges of the phylogenetic tree. Squash clustering produces informative internal edge lengths for clustering
trees by incorporating distances between averages of samples, rather than the averages of distances between samples used in
general-purpose procedures such as UPGMA. We present these methods and illustrate their use with data from the microbiome
of the human vagina.**Keyword note:**Matsen__Frederick_A Evans__Steven_N**Report ID:**799**Relevance:**100

**Title:**Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability**Author(s):**Boettiger, Alistair N.; Ralph, Peter L.; Evans, Steven N.; **Date issued:**March 2011

http://nma.berkeley.edu/ark:/28722/bk0007t9k0v (PDF) **Abstract:**Recent whole genome polymerase binding assays in the Drosophila embryo have shown that a large proportion of unexpressed genes
have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. These constitute a subset
of promoter proximally paused genes which are regulated at transcription elongation rather than at initiation, and it has
been proposed that this difference allows these genes to both express faster and achieve more synchronous expression across
populations of cells, thus overcoming the molecular "noise" arising from low copy number factors. Promoter-proximal pausing
is observed mainly in metazoans, in accord with its posited role in synchrony. Regulating gene expression by controlling release
from a promoter paused state instead of by regulating access of the polymerase to the promoter DNA can be described as a rearrangement
of the regulatory topology so that it controls transcriptional elongation rather than transcriptional initiation. It has been
established experimentally that genes which are regulated at elongation tend to express faster and more synchronously; however,
it has not been shown directly whether or not it is the change in the regulated step per se that causes this increase in speed
and synchrony. We investigate this question by proposing and analyzing a continuous-time Markov chain model of polymerase
complex assembly regulated at one of two steps: initial polymerase association with DNA, or release from a paused, transcribing
state. Our analysis demonstrates that, over a wide range of physical parameters, increased speed and synchrony are functional
consequences of elongation control. Further, we make new predictions about the effect of elongation regulation on the consistent
control of total transcript number between cells, and identify which elements in the transcription induction pathway are most
sensitive to molecular noise and thus may be most evolutionarily constrained. Our methods produce symbolic expressions for
quantities of interest with reasonable computational effort and can be used to explore the interplay between interaction topology
and molecular noise in a broader class of biochemical networks. We provide general-purpose code implementing these methods.**Keyword note:**Boettiger__Alistair_N Ralph__Peter Evans__Steven_N**Report ID:**798**Relevance:**100