Inside Our ProgramProgram Events
Events
-
Jul31
EVENT DETAILS
Data analysis processes are regularly employed to inform high-stakes decisions. However, typical workflows for implementing data analysis overlook the considerable subjectivity that is baked into the analysis process and how that might impact the results. This subjectivity reflects ontological uncertainty---implicit, qualitative uncertainty regarding how data should be analysed and modelled. While statisticians have proposed techniques such as multiverse analysis to surface implicit ontological uncertainty, analysts currently do not possess the tools to implement, evaluate and communicate the results of such analyses. This dissertation bridges that gap by providing a pipeline for systematically reasoning about ontological uncertainty that is implicit in data analysis. In the first study, I developed and evaluated multiverse, an R library, which lowers the barrier to implementing multiverse analysis. The library provides flexible and expressive syntax to allow analysts to declare any alternative data analysis step through local changes in code. The library is designed to integrate into both computational notebook and scripting data analysis workflows, and optimises execution by pruning redundant computations. . I evaluate how the multiverse R library supports programming multiverse analyses using (a) principles of cognitive ergonomics, and (b) case studies based on semi-structured interviews with researchers who have successfully implemented an end-to-end analysis using multiverse. I identified design trade-offs (e.g., increased flexibility versus learnability), and suggested future directions for supporting analysts in adopting multiverse analyses (e.g., how to evaluate a multiverse analysis?). In the second study, I address the issues of evaluation by first identifying principles for validating the composition of, and interpreting the uncertainty in, the results of a multiverse analysis. I designed Milliways, a novel interactive visualisation system, to support the principled validation and interpretation of multiverse analyses. Milliways provides interlinked panels presenting result distributions, individual analysis composition, multiverse code specification, and data summaries. In the third study, I compare the two different approaches for depicting ontological uncertainty---ensembles and p-boxes---by conducting experiments to investigate the impact of the visual representation on how the multiple uncertainty distributions are interpreted. Based on these results, I identified how the results of multiverse analyses should be visualised so that viewers adopt the desired (possibilistic) interpretation of ontological uncertainty. Together, these three studies outline a systematic approach for surfacing, reasoning about, and communicating ontological uncertainty that is often implicit in data analysis processes.
TIME Thursday, July 31, 2025 at 2:00 PM - 4:00 PM
LOCATION ITW, Ford Motor Company Engineering Design Center map it
CONTACT Wynante R Charles wynante.charles@northwestern.edu EMAIL
CALENDAR Department of Computer Science (CS)
-
Aug4
EVENT DETAILS
How much does prior knowledge about the number of clusters, $k$, influence the statistical feasibility and algorithmic performance of clustering methods? In this thesis, we explore two complementary clustering paradigms to elucidate the central role played by cluster cardinality. We first investigate Gaussian Mixture Models, a widely-used framework for clustering high-dimensional data. Here, knowledge of $k$ proves critical: we demonstrate a fundamental statistical barrier wherein mixtures of spherical Gaussians with unknown number of components become indistinguishable from a single Gaussian distribution, unless their pairwise mean separation is on the order of $\min(\sqrt{\log k}, \sqrt{d})$. Without prior knowledge or stronger assumptions, even the detection of multiple clusters is impossible, highlighting that knowing the correct number of components is an inherent necessity in Gaussian Mixture Models. In contrast, we examine Correlation Clustering, which is explicitly formulated without reference to the number of clusters. Objects are clustered based solely on possibly inconsistent pairwise similarity or dissimilarity labels. We provide improved approximation algorithms for the local-error objective in both complete and incomplete information settings. Furthermore, we introduce a more general model which we call the Correlation Clustering with Asymmetric Classification Errors, and present novel approximation guarantees tailored to this richer scenario.
Together, these results reveal a fundamental dichotomy: the cluster cardinality is either a crucial piece of structural information that defines feasibility, as in Gaussian Mixture Models, or an intentionally omitted parameter whose absence motivates alternative local objective functions, as in Correlation Clustering. Leveraging this dichotomy is thus essential for effectively choosing and employing clustering methods in practice.
TIME Monday, August 4, 2025 at 1:30 PM - 4:00 PM
LOCATION 3501, Mudd Hall ( formerly Seeley G. Mudd Library) map it
CONTACT Wynante R Charles wynante.charles@northwestern.edu EMAIL
CALENDAR Department of Computer Science (CS)
-
Sep25
EVENT DETAILS
TBA
TIME Thursday, September 25, 2025 at 9:00 AM - 11:00 AM
CONTACT Wynante R Charles wynante.charles@northwestern.edu EMAIL
CALENDAR Department of Computer Science (CS)