Data Science Theory, Methods and Tools

Researchers in this cluster work on theoretical foundations of Data Science, design machine learning algorithms with provable guarantees, develop methods and tools for the practitioners that are broadly useful in combating the “deluge” of data caused by ever growing sources of data. Researchers with core expertise in algorithms, mathematics, and statistics work with domain experts in areas where there is a perceived benefit to collecting large amounts of data. The constant interplay between the particulars of a domain and generality of methods is essential to the advances we seek in algorithmic data sciences.
High-dimensional Data Analysis - Dimensionality Reduction, Optimization Methods
This group of researchers has expertise in various methods including multivariate analysis (and its recent growth as unsupervised learning), clustering, dimensionality reduction, reconstruction. Our focus is on analytic methods and tools that enable data professionals to efficiently navigate and analyze real-world data that is influenced by a large number of parameters.
- Henry Abarbanel Physics and SIO
- Ludmil Alexandrov Bioengineering
- Ery Arias-Castro Mathematics
- Natasha Balac Qualcomm Institute
- Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences
- Jelena Bradic Mathematics
- Richard Carson Economics
- Manmohan Chandraker Computer Science & Engineering
- Kamalika Chaudhuri Computer Science & Engineering
- Alex Cloninger Mathematics
- Todd Coleman Bioengineering
- Bruce Cornuelle Scripps Institution of Oceanography
- Jade d’AlpoimGuedes Anthropology and Scripps Institution of Oceanography
- Sanjoy Dasgupta Computer Science & Engineering
- Virginia de Sa Cognitive Science
- Massimiliano DiVentra Physics
- Graham Elliott Economics
- Hadi Esmaeilzadeh Computer Science & Engineering
- James Fowler Medicine and Political Science
- Ron Graham Computer Science & Engineering and Mathematics
- Barry Grant Molecular Biology
- Trey Ideker Health Sciences and Bioengineering
- Tara Javidi Electrical & Computer Engineering
- Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering
- Todd Kemp Mathematics
- Young-Han Kim Electrical & Computer Engineering
- Rob Knight Pediatrics and Computer Science & Engineering
- Melvin Leok Mathematics
- Bo Li Mathematics
- Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering
- Shachar Lovett Computer Science & Engineering
- Sonia Martinez Mechanical & Aerospace Engineering
- Paul Mischel Pathology and Ludwig Institute for Cancer Research
- Eran Mukamel Cognitive Science
- Jiawang Nie Mathematics
- Alon Orlitsky Electrical & Computer Engineering
- PIYA PAL Electrical & Computer Engineering
- Kim Prather Chemistry & Biochemistry
- Bhaskar Rao Electrical & Computer Engineering
- Rayan Saab Mathematics
- Lawrence Saul Computer Science & Engineering
- Armin Schwartzman Biostatistics
- Alan Simmons Psychiatry
- George Sugihara McQuown Chair (SIO), Distinguished Professor of Natural Science
- Yixiao Sun Economics
- Glenn Tesler Mathematics
- Nuno Vasconcelos Electrical & Computer Engineering
- Ruth Williams Mathematics
- Yiqing Xu Political Science
- Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health
- Angela Yu Cognitive Science
- Danna Zhang Mathematics
- Kun Zhang Bioengineering
- Wenxin Zhou Mathematics
- Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
Computer-intensive and Non-parametric Statistical Methods
With the advent of the personal computer at the latter part of the 20th century, statisticians have been gradually moving away from parametric models that often rely on restrictive and/or unreliable assumptions, and going towards nonparametric models that are more flexible. These include resampling/bootstrap, subsampling/jackknife, cross-validation which provide practitioners with a general way to conduct statistical inference (e.g. hypothesis tests, confidence intervals, and prediction) under a nonparametric context. Short term goals of this cluster include: a) bootstrap prediction intervals for the volatility of financial data; b) permutation tests applied to modern detection problems; c) improved estimation of conditional distributions in regression; d) model-free bootstrap for nonparametric regression; and e) multiple hypothesis testing and control of false discovery rate via subsampling.
- Ian Abramson Mathematics
- Dimitris Politis Mathematics
- Yixiao Sun Economics
- Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
- Ruth Williams Mathematics
Analysis of Time Series and Dependent Data
Dependent data are typically obtained as time series, random fields, spatial data, or marked point processes. Applications of statistical methods for dependent data are numerous in the fields of physics, engineering, signal processing, medical imaging, acoustics, geostatistics, geophysics, epidemiology, econometrics, finance, marketing, meteorology, environmental science, forestry, seismology, oceanography, and others. Our ongoing research concerns include: a) improved multi-step-ahead forecasting for multivariate financial returns; b) model-free analysis of time-varying correlations in financial data; c) statistical analysis of fMRI and medical imaging data in an effort to identify connectivity biomarkers in the brain; d) statistical analysis of locally stationary time series with application to climate data; e) estimation in functional time series models with application to traffic data; and f) time series analysis of high-dimensional microbiome and metabolome data.
- Henry Abarbanel Physics and SIO
- John Ahlquist Global Policy & Strategy
- Christine Alvarado Computer Science & Engineering
- Natasha Balac Qualcomm Institute
- Brendan Beare Economics
- Todd Coleman Bioengineering
- Bruce Cornuelle Scripps Institution of Oceanography
- Jade d’AlpoimGuedes Anthropology and Scripps Institution of Oceanography
- Anders Dale Neurosciences
- Virginia deSa Cognitive Science
- Shlomo Dubnov Music
- Graham Elliott Economics
- Jeff Elman Cognitive Science
- Hadi Esmaeilzadeh Computer Science & Engineering
- James Fowler Medicine and Political Science
- William Griswold Computer Science & Engineering and Design Lab
- James Hamilton Economics
- Seth Hill Political Science
- Tara Javidi Electrical & Computer Engineering
- Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering
- Todd Kemp Mathematics
- Young-Han Kim Electrical & Computer Engineering
- Rob Knight Pediatrics and Computer Science & Engineering
- Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering
- Scott Makeig Institute for Neural Computation
- Eran Mukamel Cognitive Science
- Alon Orlitsky Electrical & Computer Engineering
- PIYA PAL Electrical & Computer Engineering
- Dimitris Politis Mathematics
- Kim Prather Chemistry & Biochemistry
- Yannis Papakonstantinou Computer Science & Engineering
- Yannis Papakonstantinou Computer Science & Engineering
- Bhaskar Rao Electrical & Computer Engineering
- Tajana Rosing Computer Science & Engineering
- Rayan Saab Mathematics
- Debashish Sahoo Pediatrics and Computer Science & Engineering
- Alan Simmons Psychiatry
- George Sugihara McQuown Chair (SIO), Distinguished Professor of Natural Science
- Yixiao Sun Economics
- Frank Vernon Scripps Institution of Oceanography
- Bradley Voytek Cognitive Science and Neurosciences
- Ruth Williams Mathematics
- Yiqing Xu Political Science
- Danna Zhang Mathematics
- Sheng Zhong Bioengineering
- Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
Accelerated Learning Methods: Hardware and Software
As learning methods continue to find new applications and enable new system-level capabilities such as automated driving, efficient implementation of these methods into customized hardware/software solutions becomes essential for continued proliferation to new platforms. This group of researchers explores algorithmic, architectural and hardware accelerator designs and co-design methods to provide orders of magnitude increases in performance and energy efficiency of machine learning systems.
- Chaitan Baru San Diego Supercomputer Center
- Gert Cauwenberghs Bioengineering
- Manmohan Chandraker Computer Science & Engineering
- Todd Coleman Bioengineering
- Bruce Cornuelle Scripps Institution of Oceanography
- Jordan Crandall Visual Arts
- Tom DeFanti Qualcomm Institute
- Massimiliano DiVentra Physics
- Hadi Esmaeilzadeh Computer Science & Engineering
- James Fowler Medicine and Political Science
- Rajesh Gupta Computer Science & Engineering
- Young-Han Kim Electrical & Computer Engineering
- Shachar Lovett Computer Science & Engineering
- Michael Norman Physics and San Diego Supercomputer Center
- ShyuePing Ong Nanoengineering
- Alex Orailoglu Computer Science & Engineering
- Tajana Rosing Computer Science & Engineering
- Larry Smarr Computer Science & Engineering
- Brett Stalbaum Visual Arts
- Ruth Williams Mathematics
- Yiqing Xu Political Science
- Avi Yagil Physics
Experimental Design and Hypothesis Testing
Backed by large data sets and sophisticated reasoning tools, poorly designed experiments can easily lead researchers to false conclusions, only with more confidence. To reduce false discovery, automated exploration of large data sets to establish a scientific fact, prove or disprove an assertion requires a careful design of data experiments and statistical analysis especially in online settings. We explore mathematical foundations, formal methods and tools to help Data Science practitioners design sound experiments and make deductions against specified confidence levels.
- Christine Alvarado Computer Science & Engineering
- Ery Arias-Castro Mathematics
- Jan Hughes-Austin Orthopedic Surgery
- Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences
- Eric Bennett Biological Sciences
- Jelena Bradic Mathematics
- Jennifer Burney Global Policy & Strategy
- Richard Carson Economics
- Todd Coleman Bioengineering
- Sarah Creel Cognitive Science
- Virginia deSa Cognitive Science
- Scott Desposato Political Science
- Jeff Elman Cognitive Science
- James Fowler Medicine and Political Science
- Yoav Freund Computer Science & Engineering
- Ron Graham Computer Science & Engineering and Mathematics
- Tara Javidi Electrical & Computer Engineering
- Young-Han Kim Electrical & Computer Engineering
- Marta Kutas Cognitive Science
- Eran Mukamel Cognitive Science
- Molly Roberts Political Science
- Federico Rossano Cognitive Science
- Yixiao Sun Economics
- Wesley Thompson Biostatistics and Family Medicine & Public Health
- Bradley Voytek Cognitive Science and Neurosciences
- Ed Vul Psychology
- Ruth Williams Mathematics
- Yiqing Xu Political Science
- Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health
- Kirk Christian Bansak Political Science
Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
Causality and Inference
The inference of causality from empirical data is one of the deepest and most central goals in science. Arguably, virtually all aspects of scholarly inquiry involve the search for the causal forces that shape physical, social, and mental phenomena (among others, undoubtedly). The question is one that has vexed and challenged scholars across a wide-range of disciplines. A variety of approaches have been proposed, each of which has strengths as well as limitations. This group will bring together faculty who deal with diverse sets of data and phenomena across different disciplines, but are joined by a common interest in exploring and applying existing methods for inferring causality, as well as in developing new approaches.
- Christine Alvarado Computer Science & Engineering
- Natasha Balac Qualcomm Institute
- Tarik Benmarhnia Family Medicine and Public Health
- Eli Berman Economics
- Jelena Bradic Mathematics
- Jennifer Burney Global Policy & Strategy
- Richard Carson Economics
- Manmohan Chandraker Computer Science & Engineering
- Todd Coleman Bioengineering
- Bruce Cornuelle Scripps Institution of Oceanography
- Sanjoy Dasgupta Computer Science & Engineering
- Scott Desposato Political Science
- Graham Elliott Economics
- Jeff Elman Cognitive Science
- James Fowler Medicine and Political Science
- Teevrat Garg Global Policy & Strategy
- Joshua GraffZivin Global Policy & Strategy and Economics
- Rajesh Gupta Computer Science & Engineering
- Seth Hill Political Science
- Andrew Kahng Computer Science & Engineering and Electrical & Computer Engineering
- Young-Han Kim Electrical & Computer Engineering
- Eran Mukamel Cognitive Science
- Alon Orlitsky Electrical & Computer Engineering
- JuanPablo PardoGuerra Sociology
- Molly Roberts Political Science
- Akos Rona-Tas Sociology
- Debashish Sahoo Pediatrics and Computer Science & Engineering
- Alan Simmons Psychiatry
- Hao Su Computer Science and Engineering
- George Sugihara Scripps Institution of Oceanography
- Yixiao Sun Economics
- Xin Tu Biostatistics and Family Medicine & Public Health
- Zhuowen Tu Cognitive Science
- Kamala Visweswaran Ethnic Studies
- Ed Vul Psychology
- Ruth Williams Mathematics
- Yiqing Xu Political Science
- Ronghui(Lily) Xu Mathematics and Family Medicine & Public Health
- Pinar Yoldas Visual Arts
- Angela Yu Cognitive Science
- Danna Zhang Mathematics
- Wenxin Zhou Mathematics
- Kirk Christian Bansak Political Science
- Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
Databases and Data Processing Principles and Systems
Foundational principles and data management and processing systems for ingesting, integrating, cleaning, storing, querying, analyzing, and maintaining data of large volume and variety. Major research directions pursued by the members of this group include (but are not limited to) scalable analytics over graphs, analytics over distributed and heterogeneous information sources, specification and verification of data-centric workflows and data-driven applications, privacy and access control in data exchange and analytics, abstractions and systems for ML-based data analytics, data sourcing issues in ML applications, integrating data management systems with deep learning-based machine perception for new analytics and interaction capabilities, inference and learning in the IoT, extending database platforms and query processors beyond centralized relational databases and into semistructured data, distributed datasets, spatiotemporal & IoT data, infrastructure for data-driven applications and data discovery, datacenter and cluster network topologies and protocols, interface between computer systems and the network, packet and circuit switching, power- and cost-efficient data-intensive computing, foundations of query languages for structured and unstructured data, managing incomplete and imprecise data, scientific workflows, data and process provenance, reproducibility, scalable data-driven computing, process management for the practice of data science, performance management for heterogeneous distributed computing, information integration over heterogeneous data sources, ontology management, polystore solutions to support heterogeneous analytics, social data management and query processing, and text analytics for social and biomedical applications.
- Henry Abarbanel Physics and SIO
- Chaitan Baru San Diego Supercomputer Center
- Todd Coleman Bioengineering
- Scott Desposato Political Science
- Hadi Esmaeilzadeh Computer Science & Engineering
- William Griswold Computer Science & Engineering and Design Lab
- Amarnath Gupta San Diego Supercomputer Center
- Tara Javidi Electrical & Computer Engineering
- Young-Han Kim Electrical & Computer Engineering
- Arun Kumar Computer Science & Engineering
- Michael Norman Physics and San Diego Supercomputer Center
- Yannis Papakonstantinou Computer Science & Engineering
- Bhaskar Rao Electrical & Computer Engineering
- Alan Simmons Psychiatry
- Brett Stalbaum Visual Arts
- Alexander Vardy Electrical & Computer Engineering and Computer Science & Engineering
- Nuno Vasconcelos Electrical & Computer Engineering
- Victor Vianu Computer Science & Engineering
- Kamala Visweswaran Ethnic Studies
- Ruth Williams Mathematics
- Frank Wuerthwein Physics and San Diego Supercomputer Center
Sonia Martinez Diaz Mechanical and Aerospace Engineering
Streaming and Sub-linear Learning Algorithms
Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.
- Jelena Bradic Mathematics
- Hadi Esmaeilzadeh Computer Science & Engineering
- Yoav Freund Computer Science & Engineering
- Ron Graham Computer Science & Engineering and Mathematics
- Young-Han Kim Electrical & Computer Engineering
- Shachar Lovett Computer Science & Engineering
- Alon Orlitsky Electrical & Computer Engineering
- Rayan Saab Mathematics
- Sonia Martinez Diaz Mechanical and Aerospace Engineering
- Ruth Williams Mathematics
- Fan Chung Mathematics
Models and Analysis of Multimedia Data Members
- Amy Alexander Visual Arts
- Manmohan Chandraker Computer Science & Engineering
- Alan Simmons Psychiatry
- Ruth Williams Mathematics
- Fan Chung Mathematics
Data Security and Privacy
Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.
- Cinnamon Bloss Family Medicine & Public Health
- Kamalika Chaudhuri Computer Science & Engineering
- Hadi Esmaeilzadeh Computer Science & Engineering
- Kelly Gates Communication & Science Studies
- John Graham Qualcomm Institute
- Ron Graham Computer Science & Engineering and Mathematics
- Lucila Ohno-Machado Medicine
- Alex Orailoglu Computer Science & Engineering
- Deian Stefan Computer Science & Engineering
- Alexander Vardy Electrical & Computer Engineering and Computer Science & Engineering
- Kamala Visweswaran Ethnic Studies
- Ruth Williams Mathematics
- Sonia Martinez Diaz Mechanical and Aerospace Engineering
Learning and Reasoning with Large Data Sets
- John Ahlquist Global Policy & Strategy
- Ludmil Alexandrov Bioengineering
- Christine Alvarado Computer Science & Engineering
- Natasha Balac Qualcomm Institute
- Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences
- Jelena Bradic Mathematics
- Richard Carson Economics
- Manmohan Chandraker Computer Science & Engineering
- Alex Cloninger Mathematics
- Todd Coleman Bioengineering
- Bruce Cornuelle Scripps Institution of Oceanography
- Jade d’AlpoimGuedes Anthropology and Scripps Institution of Oceanography
- Sanjoy Dasgupta Computer Science & Engineering
- Jeff Elman Cognitive Science
- Hadi Esmaeilzadeh Computer Science & Engineering
- Sicun Gao Computer Science & Engineering
- Joshua GraffZivin Global Policy & Strategy and Economics
- Barry Grant Molecular Biology
- William Griswold Computer Science & Engineering and Design Lab
- Seth Hill Political Science
- Trey Ideker Health Sciences and Bioengineering
- Daniel Kane Computer Science & Engineering and Mathematics
- Young-Han Kim Electrical & Computer Engineering
- Thomas Liu Center for fMRI and Radiology Psychiatry and Bioengineering
- Shachar Lovett Computer Science & Engineering
- Julian McAuley Computer Science & Engineering
- Lucila Ohno-Machado Medicine
- Eran Mukamel Cognitive Science
- Alon Orlitsky Electrical & Computer Engineering
- PIYA PAL Electrical & Computer Engineering
- Kim Prather Chemistry & Biochemistry
- JuanPablo PardoGuerra Sociology
- Bhaskar Rao Electrical & Computer Engineering
- Tajana Rosing Computer Science & Engineering
- Rayan Saab Mathematics
- Debashish Sahoo Pediatrics and Computer Science & Engineering
- Alan Simmons Psychiatry
- Brett Stalbaum Visual Arts
- Hao Su, Computer Science and Engineering
- Shankar Subramaniam Bioengineering
- Xin Tu Biostatistics and Family Medicine & Public Health
- Zhuowen Tu Cognitive Science
- Nuno Vasconcelos Electrical & Computer Engineering
- Kamala Visweswaran Ethnic Studies
- Bradley Voytek Cognitive Science and Neurosciences
- Ruth Williams Mathematics
- Frank Wuerthwein Physics and San Diego Supercomputer Center
- Yiqing Xu Political Science
- Avi Yagil Physics
- Wenxin Zhou Mathematics
- Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering
- Fan Chung Mathematics
Learning from Unstructured Textual Data Models of Linguistic Data, Natural Language Processing
- Henry Abarbanel Physics and SIO
- Amy Alexander Visual Arts
- Christine Alvarado Computer Science & Engineering
- Natasha Balac Qualcomm Institute
- Nuno Bandeira Computer Science & Engineering and Skaggs School of Pharmacy & Pharmaceutical Sciences
- Leon Bergen Linguistics
- Sanjoy Dasgupta Computer Science & Engineering
- Virginia deSa Cognitive Science
- Jeff Elman Cognitive Science
- Hadi Esmaeilzadeh Computer Science & Engineering
- Victor Ferreira Psychology
- James Fowler Medicine and Political Science
- Joshua GraffZivin Global Policy & Strategy and Economics
- Amarnath Gupta San Diego Supercomputer Center
- Young-Han Kim Electrical & Computer Engineering
- Eran Mukamel Cognitive Science
- Ndapa Nakashole Computer Science & Engineering
- Alon Orlitsky Electrical & Computer Engineering
- JuanPablo PardoGuerra Sociology
- Molly Roberts Political Science
- Akos Rona-Tas Sociology
- Brett Stalbaum Visual Arts
- Ruth Williams Mathematics
- Wei Wang Chemistry & Biochemistry and Cellular & Molecular Medicine
- Yiqing Xu Political Science