MLDS 430: Data Warehousing and Workflow Management



The course provides a foundation for coupling the analytics and decision-making with the contexts of the V’s of Big Data. Specifically, it will tackle the trade-offs between data properties and the tools/techniques to optimize the generation of information/knowledge, under the impacts of Volume, Velocity and Variety. The students will gain both:

  1. Foundational knowledge: what is the impact of the data properties on the choice of its representation and what does that entail in terms of operations over the data?  and
  2. Hands-on knowledge: what are the corresponding technologies (and the main differences among them) that enable querying the data under a particular representation model?

The course has the following major units:

U1: (Volume) Data Warehousing – how to ask analytical queries over huge datasets that are generated via integration from traditional databases as sources.

U2: (Velocity) Streaming Data – what are the constraints that arise when the rate of arrival of data is faster (and the volume larger) than can be stored in the memory? What does it mean to query such data?

U3: (Variety) Graph Databases – although, in theory, one could use relational databases, why is it that a large volumes of data of practical interest are better off represented under a different model? What are the operators over such model and how are the queries constructed/processed?

U4: (Potpourri) – example: Time Series data: what is the impact of a large volumes of such data? How does one reason about similarity of time series and cluster them? How does one deal with “dimensionality curse”?