Data Science Theory, Methods and Tools

Researchers in this cluster work on theoretical foundations of Data Science, design machine learning algorithms with provable guarantees, develop methods and tools for the practitioners that are broadly useful in combating the “deluge” of data caused by ever growing sources of data. Researchers with core expertise in algorithms, mathematics, and statistics work with domain experts in areas where there is a perceived benefit to collecting large amounts of data. The constant interplay between the particulars of a domain and generality of methods is essential to the advances we seek in algorithmic data sciences.

View the complete list of Data Science Theory, Methods and Tools cluster founding faculty.

High Dimensional Data Analysis - Dimensionality Reduction, Optimization Methods

Analysis of high dimensional data is a foundational pillar of modern data science and applications, and behind much of the recent advances in “Artificial Intelligence” applications. This group of researchers has expertise in various methods including multivariate analysis (and its recent growth as unsupervised learning), clustering, dimensionality reduction, reconstruction. Our focus is on analytic methods and tools that enable data professionals to efficiently navigate and analyze real-world data that is influenced by a large number of parameters. Many approaches to high-dimensional data require additional assumptions on the data in order to be successful; from sparsity in some representation, to assumptions of manifold-type behavior, to behavior of the tail distribution of the data. These assumptions have been motivated by physical and scientific properties inherent to the application area.

A subgroup of researchers focuses on the computational challenges in optimization and tensor computation. Algorithms for optimization and tensor computation have widespread application in signal processing (blind source separation, phase retrieval, low-rank matrix completion), machine learning (latent variable analysis, clustering), hypergraph theory and high-order statistics. These data-driven applications rely on the formulation and analysis of efficient methods for a range of problems, including nonconvex and global optimization, tensor decomposition, low-rank approximation, and the estimation of tensor eigenvalues.

High Dimensional Data Analysis. Dimensionality Reduction, Optimization Methods Members

Henry Abarbanel Physics and SIO

Ludmil Alexandrov Bioengineering

Ery Arias-Castro Mathematics

Natasha Balac Qualcomm Institute

Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences

Jelena Bradic Mathematics

Richard Carson Economics

Manmohan Chandraker Computer Science & Engineering

Kamalika Chaudhuri Computer Science & Engineering

Alex Cloninger Mathematics

Todd Coleman Bioengineering

Bruce Cornuelle Scripps Institution of Oceanography

Jade d'AlpoimGuedes Anthropology and Scripps Institution of Oceanography

Sanjoy Dasgupta Computer Science & Engineering

Virginia deSa Cognitive Science

Massimiliano DiVentra Physics

Graham Elliott Economics

Hadi Esmaeilzadeh Computer Science & Engineering

James Fowler Medicine and Political Science

Ron Graham Computer Science & Engineering and Mathematics

Barry Grant Molecular Biology

Trey Ideker Health Sciences and Bioengineering

Tara Javidi Electrical & Computer Engineering

Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering

Todd Kemp Mathematics

Young-Han Kim Electrical & Computer Engineering

Rob Knight Pediatrics and Computer Science & Engineering

Melvin Leok Mathematics

Bo Li Mathematics

Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering

Shachar Lovett Computer Science & Engineering

Sonia Martinez Mechanical & Aerospace Engineering

Paul Mischel Pathology and Ludwig Institute for Cancer Research

Eran Mukamel Cognitive Science

Jiawang Nie Mathematics

Alon Orlitsky Electrical & Computer Engineering

PIYA PAL Electrical & Computer Engineering

Kim Prather Chemistry & Biochemistry

Bhaskar Rao Electrical & Computer Engineering

Rayan Saab Mathematics

Lawrence Saul Computer Science & Engineering

Armin Schwartzman Biostatistics

Alan Simmons Psychiatry

George Sugihara McQuown Chair (SIO), Distinguished Professor of Natural Science

Yixiao Sun Economics

Nuno Vasconcelos Electrical & Computer Engineering

Yiqing Xu Political Science

Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health

Angela yu Cognitive Science

Danna Zhang Mathematics

Kun Zhang Bioengineering

Wenxin Zhou Mathematics

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Computer-intensive and non-parametric statistical methods

With the advent of the personal computer at the latter part of the 20th century, statisticians have been gradually moving away from parametric models that often rely on restrictive and/or unreliable assumptions, and going towards nonparametric models that are more flexible. These include resampling/bootstrap, subsampling/jackknife, cross-validation which provide practitioners with a general way to conduct statistical inference (e.g. hypothesis tests, confidence intervals, and prediction) under a nonparametric context. Short term goals of this cluster include: a) bootstrap prediction intervals for the volatility of financial data; b) permutation tests applied to modern detection problems; c) improved estimation of conditional distributions in regression; d) model-free bootstrap for nonparametric regression; and e) multiple hypothesis testing and control of false discovery rate via subsampling.

Computer-intensive and non-parametric statistical methods Members

Analysis of Time Series and Dependent Data

Dependent data are typically obtained as time series, random fields, spatial data, or marked point processes. Applications of statistical methods for dependent data are numerous in the fields of physics, engineering, signal processing, medical imaging, acoustics, geostatistics, geophysics, epidemiology, econometrics, finance, marketing, meteorology, environmental science, forestry, seismology, oceanography, and others. Our ongoing research concerns include: a) improved multi-step-ahead forecasting for multivariate financial returns; b) model-free analysis of time-varying correlations in financial data; c) statistical analysis of fMRI and medical imaging data in an effort to identify connectivity biomarkers in the brain; d) statistical analysis of locally stationary time series with application to climate data; e) estimation in functional time series models with application to traffic data; and f) time series analysis of high-dimensional microbiome and metabolome data.

Analysis of Time Series and Dependent Data Members

Henry Abarbanel Physics and SIO

John Ahlquist Global Policy & Strategy

Christine Alvarado Computer Science & Engineering

Natasha Balac Qualcomm Institute

Brendan Beare Economics

Todd Coleman Bioengineering

Bruce Cornuelle Scripps Institution of Oceanography

Jade d'AlpoimGuedes Anthropology and Scripps Institution of Oceanography

Anders Dale Neurosciences

Virginia deSa Cognitive Science

Shlomo Dubnov Music

Graham Elliott Economics

Jeff Elman Cognitive Science

Hadi Esmaeilzadeh Computer Science & Engineering

James Fowler Medicine and Political Science

William Griswold Computer Science & Engineering and Design Lab

James Hamilton Economics

Seth Hill Political Science

Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering

Todd Kemp Mathematics

Young-Han Kim Electrical & Computer Engineering

Rob Knight Pediatrics and Computer Science & Engineering

Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering

Scott Makeig Institute for Neural Computation

Eran Mukamel Cognitive Science

Alon Orlitsky Electrical & Computer Engineering

PIYA PAL Electrical & Computer Engineering

Dimitris Politis Mathematics

Kim Prather Chemistry & Biochemistry

Yannis Papakonstantinou Computer Science & Engineering

Yannis Papakonstantinou Computer Science & Engineering

Bhaskar Rao Electrical & Computer Engineering

Tajana Rosing Computer Science & Engineering

Rayan Saab Mathematics

Debashish Sahoo Pediatrics and Computer Science & Engineering

Alan Simmons Psychiatry

George Sugihara McQuown Chair (SIO), Distinguished Professor of Natural Science

Yixiao Sun Economics

Frank Vernon Scripps Institution of Oceanography

Bradley Voytek Cognitive Science and Neurosciences

Yiqing Xu Political Science

Danna Zhang Mathematics

Sheng Zhong Bioengineering

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Accelerated Learning Methods: Hardware and Software

As learning methods continue to find new applications and enable new system-level capabilities such as automated driving, efficient implementation of these methods into customized hardware/software solutions becomes essential for continued proliferation to new platforms. This group of researchers explores algorithmic, architectural and hardware accelerator designs and co-design methods to provide orders of magnitude increases in performance and energy efficiency of machine learning systems.

Accelerated Learning Methods: Hardware and Software Members

Chaitan Baru San Diego Supercomputer Center

Gert Cauwenberghs Bioengineering

Manmohan Chandraker Computer Science & Engineering

Todd Coleman Bioengineering

Bruce Cornuelle Scripps Institution of Oceanography

Jordan Crandall Visual Arts

Tom DeFanti Qualcomm Institute

Massimiliano DiVentra Physics

Hadi Esmaeilzadeh Computer Science & Engineering

James Fowler Medicine and Political Science

Rajesh Gupta Computer Science & Engineering

Young-Han Kim Electrical & Computer Engineering

Shachar Lovett Computer Science & Engineering

Michael Norman Physics and San Diego Supercomputer Center

ShyuePing Ong Nanoengineering

Alex Orailoglu Computer Science & Engineering

Tajana Rosing Computer Science & Engineering

Larry Smarr Computer Science & Engineering

Brett Stalbaum Visual Arts

Yiqing Xu Political Science

Avi Yagil Physics

Experimental Design and Hypothesis Testing

Backed by large data sets and sophisticated reasoning tools, poorly designed experiments can easily lead researchers to false conclusions, only with more confidence. To reduce false discovery, automated exploration of large data sets to establish a scientific fact, prove or disprove an assertion requires a careful design of data experiments and statistical analysis especially in online settings. We explore mathematical foundations, formal methods and tools to help Data Science practitioners design sound experiments and make deductions against specified confidence levels.

Experimental Design and Hypothesis Testing Members

Christine Alvarado Computer Science & Engineering

Ery Arias-Castro Mathematics

Jan Hughes-Austin Orthopedic Surgery

Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences

Eric Bennett Biological Sciences

Jelena Bradic Mathematics

Jennifer Burney Global Policy & Strategy

Richard Carson Economics

Todd Coleman Bioengineering

Sarah Creel Cognitive Science

Virginia deSa Cognitive Science

Scott Desposato Political Science

Jeff Elman Cognitive Science

James Fowler Medicine and Political Science

Yoav Freund Computer Science & Engineering

Ron Graham Computer Science & Engineering and Mathematics

Young-Han Kim Electrical & Computer Engineering

Marta Kutas Cognitive Science

Eran Mukamel Cognitive Science

Molly Roberts Political Science

Federico Rossano Cognitive Science

Yixiao Sun Economics

Wesley Thompson Biostatistics and Family Medicine & Public Health

Bradley Voytek Cognitive Science and Neurosciences

Ed Vul Psychology

Yiqing Xu Political Science

Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health

Kirk Christian Bansak Political Science

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Causality and Inference

The inference of causality from empirical data is one of the deepest and most central goals in science. Arguably, virtually all aspects of scholarly inquiry involve the search for the causal forces that shape physical, social, and mental phenomena (among others, undoubtedly). The question is one that has vexed and challenged scholars across a wide-range of disciplines. A variety of approaches have been proposed, each of which has strengths as well as limitations. This group will bring together faculty who deal with diverse sets of data and phenomena across different disciplines, but are joined by a common interest in exploring and applying existing methods for inferring causality, as well as in developing new approaches.

Causality and Inference Members

Christine Alvarado Computer Science & Engineering

Natasha Balac Qualcomm Institute

Tarik Benmarhnia Family Medicine and Public Health

Eli Berman Economics

Jelena Bradic Mathematics

Jennifer Burney Global Policy & Strategy

Richard Carson Economics

Manmohan Chandraker Computer Science & Engineering

Todd Coleman Bioengineering

Bruce Cornuelle Scripps Institution of Oceanography

Sanjoy Dasgupta Computer Science & Engineering

Scott Desposato Political Science

Graham Elliott Economics

Jeff Elman Cognitive Science

James Fowler Medicine and Political Science

Teevrat Garg Global Policy & Strategy

Joshua GraffZivin Global Policy & Strategy and Economics

Rajesh Gupta Computer Science & Engineering

Seth Hill Political Science

Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering

Young-Han Kim Electrical & Computer Engineering

Eran Mukamel Cognitive Science

Alon Orlitsky Electrical & Computer Engineering

JuanPablo PardoGuerra Sociology

Molly Roberts Political Science

Akos Rona-Tas Sociology

Debashish Sahoo Pediatrics and Computer Science & Engineering

Alan Simmons Psychiatry

Hao Su Computer Science and Engineering

George Sugihara Scripps Institution of Oceanography

Yixiao Sun Economics

Xin Tu Biostatistics and Family Medicine & Public Health

Zhuowen Tu Cognitive Science

Kamala Visweswaran Ethnic Studies

Ed Vul Psychology

Yiqing Xu Political Science

Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health

Pinar Yoldas Visual Arts

Angela Yu Cognitive Science

Danna Zhang Mathematics

Wenxin Zhou Mathematics

Kirk Christian Bansak Political Science

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Databases and Data Processing Principles and Systems

Foundational principles and data management and processing systems for ingesting, integrating, cleaning, storing, querying, analyzing, and maintaining data of large volume and variety. Major research directions pursued by the members of this group include (but are not limited to) scalable analytics over graphs, analytics over distributed and heterogeneous information sources, specification and verification of data-centric workflows and data-driven applications, privacy and access control in data exchange and analytics, abstractions and systems for ML-based data analytics, data sourcing issues in ML applications, integrating data management systems with deep learning-based machine perception for new analytics and interaction capabilities, inference and learning in the IoT, extending database platforms and query processors beyond centralized relational databases and into semistructured data, distributed datasets, spatiotemporal & IoT data, infrastructure for data-driven applications and data discovery, datacenter and cluster network topologies and protocols, interface between computer systems and the network, packet and circuit switching, power- and cost-efficient data-intensive computing, foundations of query languages for structured and unstructured data, managing incomplete and imprecise data, scientific workflows, data and process provenance, reproducibility, scalable data-driven computing, process management for the practice of data science, performance management for heterogeneous distributed computing, information integration over heterogeneous data sources, ontology management, polystore solutions to support heterogeneous analytics, social data management and query processing, and text analytics for social and biomedical applications.

Databases and Data Processing Principles and Systems Members

Henry Abarbanel Physics and SIO

Chaitan Baru San Diego Supercomputer Center

Todd Coleman Bioengineering

Scott Desposato Political Science

Hadi Esmaeilzadeh Computer Science & Engineering

William Griswold Computer Science & Engineering and Design Lab

Amarnath Gupta San Diego Supercomputer Center

Young-Han Kim Electrical & Computer Engineering

Arun Kumar Computer Science & Engineering

Michael Norman Physics and San Diego Supercomputer Center

Yannis Papakonstantinou Computer Science & Engineering

Bhaskar Rao Electrical & Computer Engineering

Alan Simmons Psychiatry

Brett Stalbaum Visual Arts

Alexander Vardy Electrical & Computer Engineering and Computer Science & Engineering

Nuno Vasconcelos Electrical & Computer Engineering

Victor Vianu Computer Science & Engineering

Kamala Visweswaran Ethnic Studies

Frank Wuerthwein Physics and San Diego Supercomputer Center

Sonia Martinez Diaz Mechanical and Aerospace Engineering

Streaming and Sub-linear Linear Learning Algorithms

Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.

Streaming and Sub-linear Linear Learning Algorithms Members

Models and Analysis of Multimedia Data Members

Data Security and Privacy

We will explore system designs, new programming languages and paradigms, and ML techniques that can provide strong security and data privacy guarantees. At the same time, we will design new scalable program analysis and ML techniques to find bugs and vulnerabilities in large systems (e.g., browsers and operating systems).

Data Security and Privacy Members

Cinnamon Bloss Family Medicine & Public Health

Kamalika Chaudhuri Computer Science & Engineering

Hadi Esmaeilzadeh Computer Science & Engineering

Kelly Gates Communication & Science Studies

John Graham Qualcomm Institute

Ron Graham Computer Science & Engineering and Mathematics

Lucila Ohno-Machado Medicine

Alex Orailoglu Computer Science & Engineering

Deian Stefan Computer Science & Engineering

Alexander Vardy Electrical & Computer Engineering and Computer Science & Engineering

Kamala Visweswaran Ethnic Studies

Sonia Martinez Diaz Mechanical and Aerospace Engineering

Learning and Reasoning with Large Data Sets Members

Learning and Reasoning with Large Data Sets Members

John Ahlquist Global Policy & Strategy

Ludmil Alexandrov Bioengineering

Christine Alvarado Computer Science & Engineering

Natasha Balac Qualcomm Institute

Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences

Jelena Bradic Mathematics

Richard Carson Economics

Manmohan Chandraker Computer Science & Engineering

Alex Cloninger Mathematics

Todd Coleman Bioengineering

Bruce Cornuelle Scripps Institution of Oceanography

Jade d'AlpoimGuedes Anthropology and Scripps Institution of Oceanography

Sanjoy Dasgupta Computer Science & Engineering

Jeff Elman Cognitive Science

Hadi Esmaeilzadeh Computer Science & Engineering

Sicun Gao Computer Science & Engineering

Joshua GraffZivin Global Policy & Strategy and Economics

Barry Grant Molecular Biology

William Griswold Computer Science & Engineering and Design Lab

Seth Hill Political Science

Trey Ideker Health Sciences and Bioengineering

Daniel Kane Computer Science & Engineering and Mathematics

Young-Han Kim Electrical & Computer Engineering

Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering

Shachar Lovett Computer Science & Engineering

Julian McAuley Computer Science & Engineering

Lucila Ohno-Machado Medicine

Eran Mukamel Cognitive Science

Alon Orlitsky Electrical & Computer Engineering

PIYA PAL Electrical & Computer Engineering

Kim Prather Chemistry & Biochemistry

JuanPablo PardoGuerra Sociology

Bhaskar Rao Electrical & Computer Engineering

Tajana Rosing Computer Science & Engineering

Rayan Saab Mathematics

Debashish Sahoo Pediatrics and Computer Science & Engineering

Alan Simmons Psychiatry

Brett Stalbaum Visual Arts

Hao Su, Computer Science and Engineering

Shankar Subramaniam Bioengineering

Xin Tu Biostatistics and Family Medicine & Public Health

Zhuowen Tu Cognitive Science

Nuno Vasconcelos Electrical & Computer Engineering

Kamala Visweswaran Ethnic Studies

Bradley Voytek Cognitive Science and Neurosciences

Frank Wuerthwein Physics and San Diego Supercomputer Center

Yiqing Xu Political Science

Avi Yagil Physics

Wenxin Zhou Mathematics

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Learning from Unstructured Textual Data Models of Linguistic Data, Natural Language Processing

Learning from Unstructured Textual Data-Models of Linguistic data and NLP Members

Henry Abarbanel Physics and SIO

Amy Alexander Visual Arts

Christine Alvarado Computer Science & Engineering

Natasha Balac Qualcomm Institute

Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences

Leon Bergen Linguistics

Sanjoy Dasgupta Computer Science & Engineering

Virginia deSa Cognitive Science

Jeff Elman Cognitive Science

Hadi Esmaeilzadeh Computer Science & Engineering

Victor Ferreira Psychology

James Fowler Medicine and Political Science

Joshua GraffZivin Global Policy & Strategy and Economics

Amarnath Gupta San Diego Supercomputer Center

Young-Han Kim Electrical & Computer Engineering

Eran Mukamel Cognitive Science

Ndapa Nakashole Computer Science & Engineering

Alon Orlitsky Electrical & Computer Engineering

JuanPablo PardoGuerra Sociology

Molly Roberts Political Science

Akos Rona-Tas Sociology

Brett Stalbaum Visual Arts

Wei Wang Chemistry & Biochemistry and Cellular & Molecular Medicine

Yiqing Xu Political Science