Statistics

January 21, 2024
HDSIComm

Statistics | Enric Boix

January 10, 2024
HDSIComm

Universal Learning for Decision-Making | Moise Blanchard

We provide general-use decision-making algorithms under provably minimal assumptions on the data, using the universal learning framework. Classically, learning guarantees typically require two types of assumptions: (1) restrictions on target policies to be learned and (2) assumptions on the data-generating process. Instead, we show that we can provide consistent algorithms with vanishing regret compared to the best policy in hindsight, (1) irrespective of the optimal policy, known as universal consistency, and (2) well beyond standard i.i.d. or stationary assumptions on the data. We present our results for the classical online regression problem as well as for the contextual bandit problem, where the learner’s rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients’ records or customers’ history, which allows for personalized treatment. Precisely, we give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Surprisingly, for finite action spaces, universally learnable processes are the same for contextual bandits as for the supervised learning setting, suggesting that going from full feedback (supervised learning) to partial feedback (contextual bandits) came at no extra cost in terms of learnability. We then show that there always exists an algorithm that guarantees universal consistency whenever this is achievable. In particular, such an algorithm is universally consistent under provably minimal assumptions: if it fails to be universally consistent for some context-generating process, then no other algorithm would succeed either. In the case of finite action spaces, this algorithm balances a fine trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts).

January 10, 2024
HDSIComm

Algorithm Dynamics in Modern Statistical Learning: Universality and Implicit Regularization | Tianhao Wang

Modern statistical learning is featured by the high-dimensional nature of data and over-parameterization of models. In this regime, analyzing the dynamics of the used algorithms is challenging but crucial for understanding the performance of learned models. This talk will present recent results on the dynamics of two pivotal algorithms: Approximate Message Passing (AMP) and Stochastic Gradient Descent (SGD). Specifically, AMP refers to a class of iterative algorithms for solving large-scale statistical problems, whose dynamics admit asymptotically a simple but exact description known as state evolution. We will demonstrate the universality of AMP’s state evolution over large classes of random matrices, and provide illustrative examples of applications of our universality results. Secondly, for SGD, a workhorse for training deep neural networks, we will introduce a novel mathematical framework for analyzing its implicit regularization. This is essential for SGD’s ability to find solutions with strong generalization performance, particularly in the case of over-parameterization. Our framework offers a general method to characterize the implicit regularization induced by gradient noise. Finally, in the context of underdetermined linear regression, we will show that both AMP and SGD can provably achieve sparse recovery, yet they do so from markedly different perspectives.

January 5, 2024
HDSIComm

Statistical insight for biomedical data science, with applications to single-cell RNA-sequencing data | Yiqun Chen

My research centers around bringing statistical insights and understanding to the practice of modern data science, and I will cover two projects related to this research vision in this talk.

The first part of the talk is motivated by the practice of testing data-driven hypotheses. In the biomedical sciences, it has become increasingly common to collect massive datasets without a pre-specified research question. In this setting, a data analyst might use the data both to generate a research question, and to test the associated null hypothesis. For example, in single-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test for differences in the expected gene expression levels between the clusters to quantify up- or down-regulation of genes, annotate known cell types, and identify new cell types. However, this popular practice is invalid from a statistical perspective: once we have used the data to generate hypotheses, standard statistical inference tools are no longer valid. To tackle this problem, I developed a conditional selective approach to test for a difference in means between pairs of clusters obtained via k-means clustering.

The proposed approach has appropriate statistical guarantees (e.g., selective Type 1 error control). In the second part of the talk, I will consider how to leverage large language models (LLMs) such as ChatGPT for biomedical discovery. While significant progress has been made in customizing large language models for biomedical data, these models often require extensive data curation and resource-intensive training. In the context of single-cell RNA-sequencing data, I will show that we can achieve surprisingly competitive results on many downstream tasks via a much simpler alternative: I input textual descriptions of genes into an off-the-shelf LLM, such as ChatGPT, to obtain low-dimensional representations of the genes, or “embeddings.” I then use these embeddings as features in downstream tasks. A similar approach enables LLM-derived embeddings of cells. This work highlights the potential of LLMs to provide meaningful and concise representations for biomedical data, and also raises a number of challenging statistical questions. Addressing these questions requires bringing principled statistical thinking to the practice of modern data science.

January 5, 2024
HDSIComm

Inference and Decision-Making amid Social Interactions | Shuangning Li

From social media trends to family dynamics, social interactions shape our daily lives. In this talk, I will present tools I have developed for statistical inference and decision-making in light of these social interactions.

(1) Inference: I will talk about estimation of causal effects in the presence of interference. In causal inference, the term “interference” refers to a situation where, due to interactions between units, the treatment assigned to one unit affects the observed outcomes of others. I will discuss large-sample asymptotics for treatment effect estimation under network interference where the interference graph is a random draw from a graphon. When targeting the direct effect, we show that popular estimators in our setting are considerably more accurate than existing results suggest. Meanwhile, when targeting the indirect effect, we propose a consistent estimator in a setting where no other consistent estimators are currently available.

(2) Decision-Making: Turning to reinforcement learning amid social interactions, I will focus on a problem inspired by a specific class of mobile health trials involving both target individuals and their care partners. These trials feature two types of interventions: those targeting individuals directly and those aimed at improving the relationship between the individual and their care partner. I will present an online reinforcement learning algorithm designed to personalize the delivery of these interventions. The algorithm’s effectiveness is demonstrated through simulation studies conducted on a realistic test bed, which was constructed using data from a prior mobile health study. The proposed algorithm will be implemented in the ADAPTS HCT clinical trial, which seeks to improve medication adherence among adolescents undergoing allogeneic hematopoietic stem cell transplantation.

March 1, 2023
Kaleigh O'Merry

Sampling from Graphical Models via Spectral Independence | Zongchen Chen

In many scientific settings we use a statistical model to describe a high-dimensional distribution over many variables. Such models are often represented as a weighted graph encoding the dependencies between different variables and are known as graphical models. Graphical models arise in a wide variety of scientific fields throughout science and engineering.

Contact Us

Find us

Email us

Phone support

Statistics | Enric Boix

Universal Learning for Decision-Making | Moise Blanchard

Algorithm Dynamics in Modern Statistical Learning: Universality and Implicit Regularization | Tianhao Wang

Statistical insight for biomedical data science, with applications to single-cell RNA-sequencing data | Yiqun Chen

Inference and Decision-Making amid Social Interactions | Shuangning Li

Sampling from Graphical Models via Spectral Independence | Zongchen Chen

Categories

Archives

Tag Cloud

Post List

Data Science Freshman Makes His First Clouds

FIRST VISITING SCHOLAR TO HALICIOĞLU INSTITUTE BRINGS EXPERTISE IN SENSORS AND DATA PRIVACY

UC San Diego seeks way to help baseball pitchers avoid arm injuries

Categories

Recent Posts

Contact Us

Find us

Email us

Phone support

Statistics

Categories

Archives

Tags

Tag Cloud

Post List

Social

Categories

Recent Posts