Developing New Algorithms to Analyze Large Data Sets

Mathematicians are developing new algorithms to make the analysis of large data sets feasible. Mathematics Professor Alexander Cloninger developed an algorithm to map the detection of blood diseases.

Among the multiple data-driven or data science projects ongoing in the Division of Physical Sciences are those that enable the extraction of relevant information from large data sets in research conducted in the natural sciences, engineering, health and social sciences.

Researchers in the division are developing novel statistical tests and models to deal with high dimensional data. They also create new algorithms to make the analysis of large data sets feasible.

One example of a data science project underway in the division involves the Department of Mathematics’ Alexander Cloninger, who developed multi-dimensional two sample testing for detection of blood diseases. His novel anisotropic maximum mean discrepancy test for comparing two high-dimensional point clouds takes into account the local dimensionality of data, as well as local behavior.

The test has stronger statistical power, compared to current methods, with application to detecting Acute Myeloid Leukemia from flow cytometry analysis—the examination of patients’ cells that creates multi-dimensional clouds of thousands of cells based on chemical and physical characteristics.

The two-sample test defines distance between any two individuals studied, and the distance is used to form an unsupervised network between all people in the study. The result is a clustering that isolates patients with AML to a high degree of precision. This method offers other favorable features and applications, including relevance to the analysis of diffusion MRI.