McCormick Events | Northwestern Engineering

Jan

CS Colloquium - Nan Tang "From Absolute Good Data to Relative Good Data: Data Preparation for Data-centric AI"

Department of Computer Science

12:00 PM

EVENT DETAILS

Abstract

A well-known principle in data science (and AI) is garbage-in-garbage-out, which means that if you feed bad data to your analytical tools, you will get bad results. The only way to get good results is to use good data. Moreover, Data science (or AI) = Code (or algorithms, models) + Data. On one hand, because 99% of the research efforts are on the code part, many analytical tools and AI models have become commodities. On the other hand, real-world data is getting bigger and messier, which causes the dilemma for practitioners that it is harder and harder to get good data. The mission of data democratization, which allows anyone to have access to good data, is more important than ever.

In this talk, I will first introduce what is data preparation and how to prepare data for absolute good data, where absolute means the ground truth. The need for absolute good data is evident, because it can be used to support many data science tasks such as statistical analysis, data visualization, and data mining. However, the pursuit of the ground truth is impossible without enough general and domain knowledge. I will overview two main directions that I have been pushing to make it possible, with human intelligence and with artificial intelligence.

Despite all the efforts of preparing absolute good data, it is not enough for AI applications, where they typically have unseen test data. In this case, what we need is relative good data. I will introduce four main directions to prepare relative good data for data-centric AI: (1) not enough train data, (2) not aligned train and test data, (3) not ideal data preparation pipelines, and (4) not consistent data and labels in the train data. In particular, I will present my recent research on model charging for discovering relative good data in the wild for (1) handling not enough train data, and model patching for (2) handling not aligned train and test data due to noise shift. I will give concrete use cases (3&4) for the other two cases and discuss possible ways to address them.

Biography

Dr. Nan Tang is a senior scientist at Qatar Center for Artificial Intelligence, Qatar Computing Research Institute (QCRI), Qatar. His research interests center around preparing good data for successful data science. Prior to joining QCRI in Dec 2011, He was a Research Fellow at LFCS (Laboratory for Foundations of Computer Science) at the University of Edinburgh, Edinburgh, UK (2010--2011). He was a scientific staff member with the CWI (Dutch National Research Center for Mathematics and Computer Science), Amsterdam, Netherlands (2008--2010). He got his PhD. degree from The Chinese University of Hong Kong, China (2007). He holds a visiting position at MIT, US (07/2017-08/2017) and a visiting position at University of Waterloo, Canada (03/2007-08/2007). He has been a PC member of many flagship conferences in data science, such as SIGMOD, VLDB, ICDE, KDD, VIS, and CHI. He has received the VLDB 2010 best paper award, SIGMOD 2020 reproducibility award, VLDB 2021 distinguished reviewer award, and other four best paper nominations in VLDB and ICDE.

TIME Wednesday January 26, 2022 at 12:00 PM - 1:00 PM

CONTACT Pamela Villalovoz pmv@northwestern.edu

CALENDAR Department of Computer Science