HDSI Faculty Feature Article Series
with Arya Mazumdar, Ph.D., Associate Professor, UC San Diego
by Trista Sobeck & Bobby Gordon
Data Science is deeper and bigger than many think it is. It entails getting information that is used for making decisions, making predictions, or even for strategizing. There are layers upon layers to data science. It is the true meeting place of statistics and computer science—two disciplines that have only just begun to hit their strides within the last century.
And Professor Arya Mazumdar says that data science—is the science of the twenty-first century.
“I think ‘data science’ is extracting information from data,” he says. “How much information we can extract out of data is related directly with how much structure we have in data —and extracting it will reveal key properties of the data. To me this is data science,” concludes Professor Mazumdar.
Arya earned his Ph.D. from the University of Maryland, College Park in 2011 specializing in information theory. “Information theory deals with the notion of measures of information from a statistical perspective and forms the foundational science for reliable communication,” he explains. While in school, Arya would eventually become enchanted with “error-correcting codes”—the topic of his Ph.D. thesis.
“From that time, I viewed any dataset to be a facade that hides some simple structure. And information-theoretic thinking can be useful to design algorithms to extract information about that structure,” he says.
From there, after asking multiple questions, he saw a myriad of thought processes that are useful in a lot of unsupervised/semi-supervised machine learning problems. It was then he started working on statistical learning theory, which forms the theoretical foundation of data science.
Today, at HDSI, he researches statistical reconstruction (which is exactly extracting structure from noisy data), information-computation trade-offs; and teaches Theoretical Foundations of Data Science, Algorithms for Data Science, Probability and Statistics, and Optimization Methods. “My research and teaching are both representative of the foundational aspects of data science in HDSI,” he says. Fitting for someone whose interests stemmed from an information theory course.
A Weird Geometry of Data
Arya is interested in the high dimensionality of data. “Since in modern applications of machine learning there can be a very large number of features, the algorithms often work against our intuition because of the weird geometry of data,” he says.
He is also interested in data heterogeneity. “Many times data of different nature are mixed, and inferring meaning out of such mixtures is significantly challenging. A somewhat related theme is to come up with algorithms that are robust to adversarial perturbations. While this phenomenon is well-studied in statistics, in machine learning such perturbation can also happen in the prediction time,” he explains.
De-mixing is also the topic of two papers by Arya and his students in the upcoming NeurIPS conference. “Both papers provide efficient algorithms for the respective problems,” he says.
- Support Recovery of Sparse Signals from a Mixture of Linear Measurements with student, Soumyabrata Pal, and postdoc Venkata Gandikota. This paper deals with de-mixing data coming from different models.
- Fuzzy Clustering with Similarity Queries with student, Soumyabrata Pal, and a collaborator from Tel Aviv University, Wasim Huleihel. In this paper, the team shows how to segregate data by using some minimal information.
Professor Mazumdar’s advice for someone who is just getting interested in the field of data science is to work with some real data so you can understand the challenges of data science. “Real datasets are widely available, and simple exercise with them will enhance your understanding of the topic, which will be useful both in academic research and in industry,” he explains.
Secondly, data-driven methods can be widely different across different disciplines. “While every data scientist should have some basic knowledge, it is useful to have expertise on some aspects of data science or in some domain,” he says.
The final, but perhaps most important piece of advice Prof. Mazumdar has is to be responsible with data work. “Data science is a quickly evolving field and critical decision-making may hinge on your analysis of data. Be mindful of that.”
Follow Arya Mazumdar on Twitter @MountainOfMoon
Arya Mazumdar’s website: http://mazumdar.ucsd.edu