This event has passed.

Statistical insight for biomedical data science, with applications to single-cell RNA-sequencing data | Yiqun Chen

Name: Statistical insight for biomedical data science, with applications to single-cell RNA-sequencing data | Yiqun Chen
Start: 2024-01-19T11:00:00-08:00
End: 2024-01-19T12:00:00-08:00

January 19, 2024 @ 11:00 am - 12:00 pm

My research centers around bringing statistical insights and understanding to the practice of modern data science, and I will cover two projects related to this research vision in this talk.

The first part of the talk is motivated by the practice of testing data-driven hypotheses. In the biomedical sciences, it has become increasingly common to collect massive datasets without a pre-specified research question. In this setting, a data analyst might use the data both to generate a research question, and to test the associated null hypothesis. For example, in single-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test for differences in the expected gene expression levels between the clusters to quantify up- or down-regulation of genes, annotate known cell types, and identify new cell types. However, this popular practice is invalid from a statistical perspective: once we have used the data to generate hypotheses, standard statistical inference tools are no longer valid. To tackle this problem, I developed a conditional selective approach to test for a difference in means between pairs of clusters obtained via k-means clustering.

The proposed approach has appropriate statistical guarantees (e.g., selective Type 1 error control). In the second part of the talk, I will consider how to leverage large language models (LLMs) such as ChatGPT for biomedical discovery. While significant progress has been made in customizing large language models for biomedical data, these models often require extensive data curation and resource-intensive training. In the context of single-cell RNA-sequencing data, I will show that we can achieve surprisingly competitive results on many downstream tasks via a much simpler alternative: I input textual descriptions of genes into an off-the-shelf LLM, such as ChatGPT, to obtain low-dimensional representations of the genes, or “embeddings.” I then use these embeddings as features in downstream tasks. A similar approach enables LLM-derived embeddings of cells. This work highlights the potential of LLMs to provide meaningful and concise representations for biomedical data, and also raises a number of challenging statistical questions. Addressing these questions requires bringing principled statistical thinking to the practice of modern data science.

Details

Date: January 19, 2024
Time:
11:00 am - 12:00 pm

Series:

Special Seminar Series

Event Category: Seminar
Event Tags:Statistics

Venue

3234 Matthews Ln
La Jolla, CA 92093 United States

Organizer

HDSI General

Other

Format: Hybrid
Speaker: Yiqun Chen
Event Recording Link: http://bit.ly/HDSI-Seminars

Contact Us

Find us

Email us

Phone support

Statistical insight for biomedical data science, with applications to single-cell RNA-sequencing data | Yiqun Chen

Details

Venue

Organizer

Other