Academics
  /  
Courses
  /  
Descriptions
COMP_SCI 396: Introduction to the Data Science Pipeline

Quarter Offered

Winter : 9:30 - 10:50 TuTh ; Hu

Prerequisites

Juniors, Seniors, or Graduate students

Description

This course aims to cover various tools in the process of data science for obtaining, cleaning, visualizing, modeling, and interpreting data. Most of the tools introduced in this course will be based on Python, although the idea can be applied to similar tools in other programming languages. As the outcome of this course, the students should be able to independently work on real-life datasets with large scales and gain insights from them.

This course is open to senior undergraduates and master students or instructor permission.

COURSE INSTRUCTOR: Huiling Hu

COURSE COORDINATOR: Huiling Hu

Related Materials

    1. “Python Data Science Handbook: Essential Tools for Working with Data” by Jake VanderPlas
    2. “Learning Data Mining with Python” by Robert Layton

Grading

Grades will be assigned according to the distribution below. Letter grades will be assigned based on the default percentage-to-letter-grade mapping on Canvas. There will be a course project where students will experience the whole data science pipeline based on real data.

  • Homework assignments (30%)
  • Midterm exam (20%)
  • Course Project (50%)

Course Outline

  • Introduction to Data Science Pipeline (1 lecture)
  • Obtaining Data (1 lecture)
  • Data management (2 lectures)
    • Relational databases
    • Scrubbing/Cleaning data
  • Exploratory Data Analysis (5 lectures)
    • Overview
    • Dimensionality reduction
    • Statistical and hypothesis testing
    • Data visualization
    • In-class demonstration
  • Midterm Exam (1 lecture)
  • Modeling Data with Machine Learning (5 lectures)
    • Overview
    • Basic concept of applied text mining
    • Applied text mining using NLTK
    • Basic concept of network analysis
    • Large scale network analysis using NetworkX
  • Interpreting Data and Storytelling (2 lectures)
    • Data visualization
    • Data Storytelling
  • Project Presentation (1 lecture)
  • Course Review (1 lecture)
  • Final Exam