Enabling Scientific Discovery Cluster


Data-enabled Computational Science

Researchers in this group are drawn together from the ongoing Center for Computational Mathematics that administers the campus-wide Computational Science, Mathematics and Engineering (CSME) graduate program. With the rise of data, CSME area has evolved into Data-enabled Computational Science that seeks to advance and make available integrated approaches to massively parallel computation – from architectures to algorithms — as building blocks to scientists and engineers.

Understanding and Predicting Dynamics of Complex Biological Systems - Biomedical Informatics

Although advances in biomedical technology have led to first direct glimpses of how distributed biological networks of many kinds support our experience, behavior, and cognition, these views are still highly limited in spatiotemporal resolution and detail. Continued advances in microelectronics and computing capabilities make possible the measurement of the dynamics of biological systems with increasingly high spatial and temporal resolution. Clinical windows even afford the possibility of recording this activity on multiple scales simultaneously. Signal processing of high-dimensional data to model relationships between complex brain dynamic patterns, experience, and behavior, i.e. natural cognition in complex environments is an increasing challenge. Its impact on medicine can be enormous, for instance, in methods to identify, validate and optimize usable biomarkers of neurological and psychiatric diseases. Tackling these and related problems must make use of continuing advances in data science across fields.

Statistics in Biology and Health Sciences

We aim to address the twin challenges of inferential rigor and scientific understanding in the biological and health sciences. Scientific discovery requires rigorous inference to ensure validity of the studies conducted and the scientific conclusions drawn from them. The validity of the inference depends on explicit and implicit assumptions about the data, and rigorous verification of those assumptions is required. Statistical rigor is imperative for reproducibility of scientific discoveries. Scientific understanding requires formulating answerable questions and interpretable models. While every scientific study involves experts on the particular subject matter, even experts can have difficulty in formulating precise testable hypotheses and designing studies that address those hypotheses, as well as developing and fitting appropriate statistical models that directly address those hypotheses. We will combine strategies for developing interpretable and appropriate statistical models with statistically-rigorous inference to address these data science challenges.

Open Collaborative Ecosystem for Advanced Neuroimaging eXploration (OCEANX)

We aim to address challenges in the analysis and processing of large and complex neuroimaging datasets that are acquired across a diverse range of studies of brain physiology, function, and structure. Our approach is to create a flexible storage and computing environment that will enable investigators to readily take advantage of state-of-the art data science tools and to collaborate more easily across studies. We envision that OceanX will serve as a unique testbed for discovery, collaboration, and training in the neuroimaging data sciences. It will also enable research aimed at determining the environments, processes, and methods that best foster discovery and collaboration.

Computational Discovery in Material Sciences

We aim to develop data science as a powerful accelerator for materials discovery and design across all scales, from atomistic manipulation to nanoscale properties to device-level integration. In recent years, data science has emerged as an increasingly important tool for materials science due to the advent of vast computing power and efficient, accurate quantum chemical codes as well as development of combinatorial experimental techniques. The consequence of both trends is an explosion in the quantity of materials data generated. Increasingly, the key challenge is the development of analytics to generate useful insights and design principles from this large materials data, aka the decoding of the Materials Genome.

Computational Neurosciences

Computational neuroscience is a unique source of data science innovation because it both requires advancing data analytic methods to grapple with high-dimensional, structured, dynamic neural signals, and because its subject of study — the human mind and brain — is the most effective natural learning system we know of. Thus computational neuroscience offers two avenues for advancing data science: innovation in analytic methods to understand the neural substrates of the human mind, and the design of computational models to emulate natural intelligence.

Geosciences and Climate/Weather Predictions

Physical models and hypotheses are central to Geosciences and Climate/Weather Predictions. Earth science data sets tend to be poorly sampled, noisy, and incomplete, and are often difficult to use in standard machine learning algorithms. It would be transformative if we could develop a modeling framework that combines machine learning and physical modeling. In this group we use data science tools to solve inverse problems in ocean and earth observations. This will improve geophysical predictions such a weather and wildfires.


Understanding large biological datasets from high-throughput methods such as DNA sequencing and mass spectrometry poses considerable challenges. Advanced analytical methods are required for problems ranging from genomic evolution to gene expression to studies of human, animal and environmental microbiomes. Typical issues include sparse and/or compositional data, highly multivariate data with far more dimensions than samples, and the need to integrate with heterogeneous phenotype data including imaging and clinical records. Solving these problems has the potential to revolutionize our understanding of the living world, and our ability to control it in medical and technological applications.

Sensing Data and Sensor Networks

Sensor networks are used in most observation systems. Examples include classical seismic sensor network to modern 5G telecommunications using wave propagation data. Other types of networks temperature or textual information and the sensor networks might evolve with observations. Array processing has here been a main workhorse. Graph signal processing might provide a more general approach for processing the data.

Traditional approaches to traffic engineering and network deployments rely on generic modelling assumptions and rule of thumb over provisioning. Future generation systems, such as 5G systems, aspire to network vastly larger variety of devices to support highly diverse applications. The design and operation of these expensive, complex interconnected systems will be increasingly data driven and can benefit from advances in machine learning algorithms. Our goals in this regard include (1) creation of datasets to capture city scale data traffic and mobility patterns and (2) algorithms to infer numerous measures of value to network designers and operators as well as multiple disciplines, including public health, mental health, environment, transportation and energy usage.