COMP_SCI 497: Explanation and reproducibility in data-driven science



CS PhD students or permission of instructor


In this seminar course, we will consider what it means to produce reproducible explanations in data-driven science. As the complexity and size of available data increases, intuitive explanations of what has been learned from data are in high demand. However, what does it mean for an explanation to be accurate and reproducible, and how do threats to validity of data-driven inferences differ depending on the underlying goal of statistical modeling? The readings of the course will be drawn from recent and classic literature pertaining to reproducibility, replication, and explanation in data inference published in computer science, statistics, and related fields. The course is structured in three parts. In part one we will examine recent evidence of problems of reproducibility, replicability and robustness in data-driven science. In part two we will examine theories and evidence related to causes of these problems. In part three, we will consider solutions and open questions.  Topics include: ML reproducibility, the social science replication crisis, adaptive data analysis, causal inference, generalizability, and uncertainty communication.

  • This course satisfies the Project or Technical Elective.

Coursework: Students will prepare and lead discussions on the papers selected. Coursework includes assignments related to replication and pre-registration and a preliminary study for a research project.

Prerequisites: For graduate students and Senior undergraduates only. Students should have practical experience and exposure to methods in both predictive (e.g., ML) and explanatory (e.g., low dimensional regression) approaches to statistical modeling. If you are interested taking this class contact CS Department for permission #.

INSTRUCTOR: Prof. Jessica Hullman