COMP_SCI 326: Introduction to the Data Science Pipeline

This course is not currently offered.


(CS 212 and CS 214) or graduate student or instructor consent


This course aims to cover various tools in the process of data science for obtaining, cleaning, visualizing, modeling, and interpreting data. Most of the tools introduced in this course will be based on Python, although the idea can be applied to similar tools in other programming languages. The goal of this course is not about the foundation of relevant technologies but rather when and how to use them in the pipeline of data science. The student will finish a quarter-long self-defined course project to exercise the data-science tools covered in the lecture. As the outcome of this course, the students should be able to independently work on real-life datasets with large scales and gain insights from them.

  • This course fulfills the Technical Elective area.
  • Formerly Comp_Sci 396 - last offer was Spring 2022

COURSE INSTRUCTOR: Huiling Hu or Joshua D'Arcy


Related Materials

  1. “Python Data Science Handbook: Essential Tools for Working with Data” by Jake VanderPlas
  2. “Learning Data Mining with Python” by Robert Layton


Grades will be assigned according to the description below. Letter grades will be assigned based on a percentage-to-letter-grade mapping.

  • Homework assignments (35%)
    • 5 individual assignments
  • Midterm exam (25%)
  • Course Project (40%) Students can define their own topic. The project includes
    • Proposal
    • Milestone
    • Presentation
    • Final Report

Course Outline

Main Topics Include

  • Course overview and logistics
  • Obtaining and managing Data
  • Data cleaning
  • Exploratory Data Analysis
  • Statistics
    • Correlation, Independence and Association
    • Hypothesis Testing
  • Basic machine learning
    • Basic concepts and algorithms
    • Assessment and Overfitting
    • Feature selection
  • Text mining
  • Data Visualization and Storytelling
  • Ethics in Data Science