PhD Course Requirements
Group A courses are introductory level graduate courses in the foundational areas of data science. Group B are core graduate level courses with prerequisites from Group A courses. Group C are advanced, specialized and free-standing courses, often part of the required courses in the Data Science specialization of the Graduate Program in other departments. In all three groups, required courses are indicated as such; they can not be substituted by other courses without exception approval from the graduate program committee.
The doctoral program is structured as a total of 52 units in courses from these group A, B, and C as described in detail here. Out of the 52 units, 48 units (or 12 courses) must be taken for letter grade and at least 40 units must be using graduate-level courses.
The remaining 4 (= 52 – 48) units are for professional preparation, consisting of 1 unit of faculty research seminar, 2 units of TA/tutor training and 1 unit of survival skills course taken for a passing (satisfactory) grade. Finally, as mentioned earlier, out of the 12 regular courses, at least 10 must be graduate-level courses; at most two can be upper-level undergraduate courses. 36 units or 9 courses must be completed within six quarters from the start of the degree program.
Group A: Preparatory Knowledge and Skill Areas
Credit for maximum of 3 courses
There are five important knowledge and skills necessary for understanding (and advancing) core data science. It is, therefore, important that all our entering students either have background preparation or have courses available in the program to ensure a successful completion of the stipulated doctoral degree program. A student can receive credit towards the Ph.D. degree for a maximum of three courses from the list of courses to the right.
DSC 200: Data Science Programming; 4 units:
Computing structures and programming concepts such as object orientation, data structures such as queues, heaps, lists, search trees and hash tables. Laboratory skills include Jupyte notebooks, RESTful interfaces and various software development kits (SDKs).
DSC 202: Data management for Data Science; 4 Unit:
Principles of data management, relational data model, relational algebra, SQL for data science, NoSQL Databases (document, key–value, graph, column-family), Multidimensional data management (data warehousing, OLAP Queries, OLAP Cubes, Visualizing multidimensional data). [Recommended: Multivariate calculus, optimization]
DSC 210: Numerical Linear Algebra; 4 units:
Linear algebraic systems, least squares problems, orthogonalization methods, ill-conditioned problems, eigenvalue and singular value decomposition, principal component analysis.
DSC 211: Introduction to Optimization; 4 units:
Continuity and differentiability of a function of several variables, gradient vector, Hessian matrices, Taylor approximation, fundamentals of optimization, Lagrange multipliers, convexity, gradient descent. [Prerequisite: DSC 210]
DSC 212: Probability and Statistics for Data Science; 4 units:
Probability, random variables, distributions, central limit theorem, maximum likelihood estimation, method of moments, confidence intervals, hypothesis testing, Bayesian estimation, introduction to simulation and the bootstrap.
Group B: Core Knowledge and Skill Areas
Doctoral students are required to take a minimum of 6 courses for letter-grade credit from Group B courses. Students can take more than 6 courses from this group to satisfy letter grade course requirements except (satisfactory completion of professional preparation) teaching, survival skills and research seminar courses. Students who satisfy all letter-grade course requirements are expected to enroll into individual research (DSC 298) in a section offered by the faculty advisor to meet residency requirements and maintain graduate student standing during the period of dissertation research.
Four core courses are required for all Ph.D. students, including those with a Bachelors in Data Science. The four required courses are:
Required Core Courses [minimum 4 courses]
Required Core Courses [minimum 2 courses]
DSC 240: Machine Learning; 4 units:
A graduate level course in machine learning algorithms: decision trees, principal component analysis, k-means, clustering, logistic regression, random forests, boosting, neural networks, deep learning. [Prerequisites DSC 210 + DSC 212] [Recommended Preparation: multivariate calculus and optimization In addition to the listed prerequisites, undergraduate level calculus is required.]
DSC 260: Data Ethics and Fairness; 4 units:
Ethical considerations regarding privacy and control of information. Principles of fairness, accountability, and transparency. Use of metadata to information algorithms. Algorithmic fairness. Policy issues such as the Fair Information Practices Principles Act, and laws concerning the “right to be forgotten.”
DSC 241: Statistical Models; 4 units:
linear/nonlinear models, generalized linear models, model fitting and model selection (cross-validation, knockoffs, etc.), regularization and penalization (ridge regression, lasso, etc.), robust methods, nonparametric regression, conformal prediction, causal inference.
[Prerequisites DSC210 + DSC212] [Recommended Preparation: basic programming (e.g, DSC 200) or prior exposure to a language like R, Python, Matlab, etc., is required.]
DSC 204A: Scalable Data Systems; 4 units:
Storage/memory hierarchy, distributed scalable computing (i.e., cluster, cloud, edge) principles. Big Data storage, management and processing at scale. Dataflow programming systems and programming models (MapReduce/Hadoop and Spark).
[Prerequisite: DSC 202]
DSC 206: Algorithms for Data Science; 4 units:
With the advent of large-scale machine learning, online social networks, and computationally intensive models, data scientists must deal with data that is massive in size, arrives fast, and must be processed within an interactive or online manner. This course studies the mathematical foundations of massive data processing, developing algorithms and analyzing them. We explore methods for sampling, sketching, and distributed processing of large scale databases, clustering, dimensionality reduction, and methods of optimization for the purpose of scalable statistical description, querying, pattern mining, and learning from data. [Prerequisite: DSC212]
DSC 203: Data Visualization and Scalable Visual Analytics; 4 units:
Commonly used algorithms and techniques in data visualization. Interactive reasoning and exploratory analysis though visual interfaces. Application of data visualization in various domains including science, engineering, and medicine. Scalable interactive methods involving exploring with big data and visualization methods. Techniques to evaluate effectivity and interpretability of analytical products for diverse users to obtain insights in support of assessment, planning, and decision making. [Prerequisite: DSC 202]
DSC 204B: Big Data Analytics & Applications; 4 units:
This course is a hands-on introduction to big data analytics. Topics covered include:
I/O bottleneck and the memory hierarchy; HDFS and Spark; RMS error minimization, PCA and percent of variance explained. Analysis of NOAA weather data. Data collection and curation. Limitations of train/test methodology and leaderboards. Kmeans and intrinsic dimension. Classification, Boosting and XGBoost. Margins. Neural Networks and tensorflow.
Students will develop the skills and attitudes required to write jupyter notebooks that can be understood by domain experts. [Prerequisite: DSC 200 + DSC 210 + DSC 212]
DSC 215: Statistical Critical Thinking; 4 units:
We hold science in high regard, however, not all scientific claims are correct. How do we know which claims to trust and which not to? This fundamental question is at the heart of this course. The goal of this course is to enable the student to evaluate any paper in data science, regardless of application area. Topics covered include experimental design, claims, evidence and statistical significance, The Replication Crisis, falsifiability, philosophy of science, history of probability and statistics. About half of the class meetings, as well as the final project, would be devoted to evaluating contemporary papers in data science. This class will be in the form of an open discussion, based on provided reading materials. The only prerequisite is a statistics class that covers hypothesis testing and P-values. [Recommended Preparation: Hypothesis testing and p-values, basic statistics.]
DSC 242: High-dimensional Probability and Statistics; 4 units:
Concentration inequalities, Markov processes and ergodicity, martingale inequalities, empirical processes, sparse linear models in high dimensions, Principal component analysis in high dimensions, estimation of large covariance matrices.
[Recommended Preparation: undergraduate probability theory.]
DSC 243: Advanced Optimization; 4 units:
Linear/quadratic programming, optimization under constraints, gradient descent (deterministic and stochastic), convergence rate of gradient descent, acceleration phenomena in convex optimization, stochastic optimization with large data sets, complexity lower bounds for convex
optimization. [Prerequisites: DSC 211 + DSC 212]
DSC 244: Large-Scale Statistical Analysis; 4 units:
Exploratory data analysis, diagnostics, bootstrap, large-scale (multiple) hypothesis testing, false discovery rate, empirical Bayes methods. [Prerequisites: DSC 210 + DSC 212 + DSC 241]
DSC 245: Introduction to Causal Inference; 4 units:
Causal versus predictive inference, potential outcomes and randomized experiments (A/B testing), structural causal models (interventions, counterfactuals, causal diagram, do-operator, d-separation), identification of causal effect (back-door and front-door criterion, do-calculus), estimation of causal effect (matching, propensity score, g-computation, doubly robust estimation, regression discontinuity and instrumental variables, conditional effects), structure learning (constraint and score-based algorithms), advanced topics (mediation and path-specific effects, bounding causal effect, selection bias, external validity and transportability, processing missing data, causal inference in networks). [Prerequisite: DSC 212 + DSC 240]
DSC 250: Advanced Data Mining; 4 units:
Graph mining and basic text analysis (including keyphrase extraction and generation), set expansion and taxonomy construction, graph representation learning, graph convolutional neural networks, heterogeneous information networks, label propagation, and truth findings.
[Recommended Preparation: knowledge about Machine Learning and Data Mining, coding with python, C/C++, Java; statistics.]
DSC 261: Responsible Data Science; 4 units:
responsible data management, algorithmic fairness (fairness definitions, impossibility results, causal fairness, building fair ML models, fairness beyond classification), algorithmic transparency (interpretability vs explainability, auditing-black-box algorithms, algorithmic recourse), privacy and data protection, sampling bias, reproducibility
[Prerequisite: DSC 212 + DSC 240] [Recommended Preparation: machine learning, causal inference, data management.]
Group C: Professional Preparation and Elective Courses
Group C courses aim to provide either practical experiences in chosen specialization areas, or advanced training for students preparing for doctoral programs.
Professional Preparation Courses:
Required professional preparation courses include: 2 unit TA/tutor training (DSC 599), 1 unit of academic survival skills (DSC 295) and 1 unit faculty research seminar (DSC 293), all of which must be completed with a Satisfactory (S) grade using the S/U option.
DSC 599: TA/TUTOR Training; 2 units (S/U):
Expected TA duties, evaluation methods. Rules governing TA appointment, conduct and evaluation. Practice effective teaching strategies including communications with students and instructors, conduct of discussion sessions, formulating learning objectives and implementation of active learning strategies.
DSC 293: Faculty Research Seminar; 1 unit (S/U):
Weekly faculty research seminar. Individual HDSI colloquia and distinguished lecturers may be included at the discretion of the instructor.
DSC 294: Research Rotation; 4 units (S/U):
Special topics research under the direction of an HDSI faculty member. The research topics may include training in specific research methodologies consisting of practical laboratory skills, computational skills or proof systems in a research group/laboratory in which the student may pursue doctoral dissertation research.
Prerequisites: Data Science graduate students and consent of the instructor.
DSC 295: Academia Survival Skills; 1 unit (S/U):
Basic skills necessary to succeed as a researcher in Data Science including scripting, cloud computing skills, fellowship proposal preparation, CV preparation, writing reviews, preparing posters etc.
General Elective Courses:
Courses here aim to provide advanced training for students in the doctoral programs, or practical experiences in chosen specialization areas. Students can choose from the following elective or specialization tracks. Additional elective courses will be offered based on faculty interest and availability.
Data Science Electives
DSC 205: Geometry of Data; 4 units:
Graph-based data modeling, analysis and representation. Topics include: spectral graph theory, spectral clustering, kernel-based manifold learning, dimensionality reduction and visualization, multiway data analysis, multimodal and multiview data representation, graph neural networks. [Prerequisites: DSC 210 or ECE 269 + DSC 212 + DSC 240] [Recommended Preparation: Matlab/Python coding, linear algebra, probability theory/statistics. Review basic linear algebra (inner products, orthogonality, eigen-decomposition) and probability theory (multivariate random variable, statistical independence, covariance).]
DSC 213: Statistics on Manifolds; 4 units:
This is a graduate topics course covering statistics with manifold constraints. Topics include: Frechet means and variances, principal geodesic analysis, directional statistics, random fields on manifolds, statistical distances between distributions, transport problems, and information geometry. Manifold constraints will be considered on simplexes, spheres, Stiefel manifold, stratified manifolds, cone of positive definite matrices, trees, compositional data, and other relevant manifolds. [Prerequisites: DSC 210 + DSC 212] [Recommended Preparation: Differential geometry.]
DSC 214: Topological Data Analysis; 4 units:
Topological methods provide powerful tools for analyzing complex data. This course introduces basic concepts and topological structures, as well as recent theoretical and algorithmic developments, together with examples of applications. Some topics include: basics in topology, simplicial complexes to model data, persistent homology, discrete Morse theory, topology inference, the Mapper methodology, hierarchical clustering, and integration of topological methods with machine learning. [Recommended Preparation: Linear Algebra and programming.]
DSC 231: Embedded Sensing and IOT Data Models and Methods; 4 units:
Sensory data and control is mediated by devices near the edge of sensor networks, referred to as IOT (Internet of Things) devices. Components of IOT platforms: signal processing, communications/networking, control, real-time operating systems. Interfaces to cloud computing stack, publish-subscribe protocols such as MQTT, embedded software/middleware components, metadata schema, metadata normalization methods, applications in selected CPS (cyber-physical system) applications. [Recommended Preparation: embedded systems and embedded software, basic courses in digital hardware, algorithms and data structures, programming, and computer architecture.]
DSC 251: Machine Learning in Control; 4 units:
Estimation of stability and uncertainty, optimal control, and sequential decision making. [Prerequisites: DSC 211 + DSC 240] [Recommended Preparation: Probability theory.]
DSC 252: Statistical Natural Language Processing; 4 units:
Diving deep into the classical NLP pipeline: tokenization, stemming, lemmatization, part-of-speech tagging, named entity recognition, parsing, and machine translation. Finite-state transducer, context-free grammar, Hidden Markov Models (HMM), and Conditional Random Fields (CRF) will be covered in detail. [Recommended Preparation: Introduction-level Machine Learning.]
DSC 253: Advanced Data-driven Text Mining; 4 units:
Unsupervised, weakly supervised, and distantly supervised methods for text mining problems, including information retrieval, open-domain information extraction, text summarization (both extractive and generative), and knowledge graph construction. Bootstrapping, comparative analysis, learning from seed words and existing knowledge bases will be the key methodologies. [Recommended Preparation: knowledge about Machine Learning and Data Mining, coding with python, C/C++, Java; statistics.]
DSC 254: Statistical Signal and Image Analysis; 4 units.
A graduate level course on signal and image analysis spanning three main themes. Statistical signal processing: random processes, stochasticity, stationarity, Wiener filter, Kalman filter, matched filter ; Signal processing: time-frequency representations, wavelets, signal processing with sparse representation (dictionary learning) ; Image processing: registration, image degradation and restoration: noise models + denoising, image pyramids, random fields. [Prerequisites: DSC 210 or ECE 269 + DSC 212 + DSC 220]
Possible electives from other disciplines
CSE 234: Data Systems for Machine Learning; 4 units.
Data management and systems issues across the whole lifecycle of ML-based analytics in real-world applications, including: data sourcing, preparation, and organization for ML; programming models and systems for scalable ML training, feature engineering, and model selection; systems for ML inference, deployment, and explanations; and governed ML platforms and feature stores.
MATH 281A-B-C: Mathematical Statistics (4-4-4 units).
Math 281A consists of statistical models, sufficiency, efficiency, optimal estimation, least squares and maximum likelihood, large sample theory. Math 281B continues and discusses Hypothesis testing and confidence intervals, one-sample and two-sample problems. Bayes theory, statistical decision theory, linear models and regression. Math 281C finished the sequence with nonparametrics: tests, regression, density estimation, bootstrap and jackknife.
MATH 284: Survival Analysis; 4 units:
Survival analysis is an important tool in many areas of applications including biomedicine, economics, engineering. It deals with the analysis of time to events data with censoring. This course discusses the concepts and theories associated with survival data and censoring, comparing survival distributions, proportional hazards regression, nonparametric tests, competing risk models, and frailty models. The emphasis is on semiparametric inference, and material is drawn from recent literature.
MATH 285. Stochastic Processes; 4 units:
Elements of stochastic processes, Markov chains, hidden Markov models, martingales, Brownian motion, Gaussian processes.
[Recommended preparation: undergraduate probability theory. ]
MATH 287A. Time Series Analysis; 4 units:
Discussion of finite parameter schemes in the Gaussian and non-Gaussian context. Estimation
for finite parameter schemes. Linear vs. nonlinear time series. Stationary processes and their spectral representation. Spectral estimation.
[Students who have not taken MATH 282A may enroll with consent of the instructor.]
MATH 287B: Multivariate Analysis; 4 units;
Bivariate and more general multivariate normal distribution. Study of tests based on Hotelling’s T2. Principal components, canonical correlations, and factor analysis will be discussed as well as some competing nonparametric methods, such as cluster analysis.
[Students who have not taken MATH 282A may enroll with consent of the instructor.]
MATH 287D: Statistical Learning Theory; 4 units.
Topics include regression methods: (penalized) linear regression and kernel smoothing; classification methods: logistic regression and support vector machines; model selection; and mathematical tools and concepts useful for theoretical results such as VC dimension, concentration of measure, and empirical processes.
COGS 243: Statistical Inference and data analysis; 4 units:
This course provides a rigorous treatment of hypothesis testing, statistical inference, model fitting, and exploratory data analysis techniques used in the cognitive and neural sciences. Students will acquire an understanding of mathematical foundations and hands-on experience in applying these methods using Matlab.
Cognitive science PhD students must enroll for four units and will be required to do assignments and a final project. All other students can enroll for two units and will be required to complete all assignments but not a final project (or by request of a project and no assignments).