Data Science Theory, Methods and Tools

9-DSC_9326_UCSanDiegoPublication_ErikJepsen

Researchers in this cluster work on theoretical foundations of Data Science, design machine learning algorithms with provable guarantees, develop methods and tools for the practitioners that are broadly useful in combating the “deluge” of data caused by ever growing sources of data. Researchers with core expertise in algorithms, mathematics, and statistics work with domain experts in areas where there is a perceived benefit to collecting large amounts of data. The constant interplay between the particulars of a domain and generality of methods is essential to the advances we seek in algorithmic data sciences.

High-dimensional Data Analysis - Dimensionality Reduction, Optimization Methods

This group of researchers has expertise in various methods including multivariate analysis (and its recent growth as unsupervised learning), clustering, dimensionality reduction, reconstruction. Our focus is on analytic methods and tools that enable data professionals to efficiently navigate and analyze real-world data that is influenced by a large number of parameters. 

Computer-intensive and Non-parametric Statistical Methods

With the advent of the personal computer at the latter part of the 20th century, statisticians have been gradually moving away from parametric models that often rely on restrictive and/or unreliable assumptions, and going towards nonparametric models that are more flexible. These include resampling/bootstrap, subsampling/jackknife, cross-validation which provide practitioners with a general way to conduct statistical inference (e.g. hypothesis tests, confidence intervals, and prediction) under a nonparametric context. Short term goals of this cluster include: a) bootstrap prediction intervals for the volatility of financial data; b) permutation tests applied to modern detection problems; c) improved estimation of conditional distributions in regression; d) model-free bootstrap for nonparametric regression; and e) multiple hypothesis testing and control of false discovery rate via subsampling.

Analysis of Time Series and Dependent Data

Dependent data are typically obtained as time series, random fields, spatial data, or marked point processes. Applications of statistical methods for dependent data are numerous in the fields of physics, engineering, signal processing, medical imaging, acoustics, geostatistics, geophysics, epidemiology, econometrics, finance, marketing, meteorology, environmental science, forestry, seismology, oceanography, and others. Our ongoing research concerns include: a) improved multi-step-ahead forecasting for multivariate financial returns; b) model-free analysis of time-varying correlations in financial data; c) statistical analysis of fMRI and medical imaging data in an effort to identify connectivity biomarkers in the brain; d) statistical analysis of locally stationary time series with application to climate data; e) estimation in functional time series models with application to traffic data; and f) time series analysis of high-dimensional microbiome and metabolome data.

Accelerated Learning Methods: Hardware and Software

As learning methods continue to find new applications and enable new system-level capabilities such as automated driving, efficient implementation of these methods into customized hardware/software solutions becomes essential for continued proliferation to new platforms. This group of researchers explores algorithmic, architectural and hardware accelerator designs and co-design methods to provide orders of magnitude increases in performance and energy efficiency of machine learning systems.

Experimental Design and Hypothesis Testing

Backed by large data sets and sophisticated reasoning tools, poorly designed experiments can easily lead researchers to false conclusions, only with more confidence. To reduce false discovery, automated exploration of large data sets to establish a scientific fact, prove or disprove an assertion requires a careful design of data experiments and statistical analysis especially in online settings. We explore mathematical foundations, formal methods and tools to help Data Science practitioners design sound experiments and make deductions against specified confidence levels.

Peter Gerstoft Scripps Institution of Oceanography and Electrical and Computer Engineering

Causality and Inference

The inference of causality from empirical data is one of the deepest and most central goals in science. Arguably, virtually all aspects of scholarly inquiry involve the search for the causal forces that shape physical, social, and mental phenomena (among others, undoubtedly). The question is one that has vexed and challenged scholars across a wide-range of disciplines. A variety of approaches have been proposed, each of which has strengths as well as limitations. This group will bring together faculty who deal with diverse sets of data and phenomena across different disciplines, but are joined by a common interest in exploring and applying existing methods for inferring causality, as well as in developing new approaches.

Databases and Data Processing Principles and Systems

Foundational principles and data management and processing systems for ingesting, integrating, cleaning, storing, querying, analyzing, and maintaining data of large volume and variety. Major research directions pursued by the members of this group include (but are not limited to) scalable analytics over graphs, analytics over distributed and heterogeneous information sources, specification and verification of data-centric workflows and data-driven applications, privacy and access control in data exchange and analytics, abstractions and systems for ML-based data analytics, data sourcing issues in ML applications, integrating data management systems with deep learning-based machine perception for new analytics and interaction capabilities, inference and learning in the IoT, extending database platforms and query processors beyond centralized relational databases and into semistructured data, distributed datasets, spatiotemporal & IoT data, infrastructure for data-driven applications and data discovery, datacenter and cluster network topologies and protocols, interface between computer systems and the network, packet and circuit switching, power- and cost-efficient data-intensive computing, foundations of query languages for structured and unstructured data, managing incomplete and imprecise data, scientific workflows, data and process provenance, reproducibility, scalable data-driven computing, process management for the practice of data science, performance management for heterogeneous distributed computing, information integration over heterogeneous data sources, ontology management, polystore solutions to support heterogeneous analytics, social data management and query processing, and text analytics for social and biomedical applications.

Sonia Martinez Diaz Mechanical and Aerospace Engineering

Streaming and Sub-linear Learning Algorithms

Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.

Models and Analysis of Multimedia Data Members

Data Security and Privacy

Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.

Learning and Reasoning with Large Data Sets

Learning from Unstructured Textual Data Models of Linguistic Data, Natural Language Processing