COMP_ENG 368, 468: Programming Massively Parallel Processors with CUDA

This course is not currently offered.


COMP_SCI 213, graduate standing, or consent of instructor


A hands-on introduction to parallel programming and optimizations for 1000+ core GPU processors, their architecture, the CUDA programming model, and performance analysis. Students implement various optimizations in massively parallel workloads on modern GPUs. May not receive credit for both COMP_ENG 368-0 and COMP_ENG 468-0.


This course focuses on developing and optimizing applications software on massively parallel graphics processing units (GPUs). Such processing units routinely come with hundreds to thousands of cores per chip and sell for a few hundred to a few thousand dollars. The massive parallelism they offer allows applications to run 2x-450x faster than on conventional multicores. However, to reach this performance improvement, the application must fully utilize the computational, storage and communication resources provided by the device. This course discusses state-of-the-art parallel programming optimization methods to achieve this goal.

Ideally this course will bring together people with strong programming skills, with people with a strong need for solving compute-intensive problems that can benefit from programming graphics processors. The initial part of the course will discuss a popular programming interface for graphics processors, the CUDA programming tools for NVIDIA processors. The course will continue with a closer view of the internal architecture of graphics processors and how it impacts performance. Finally, implementations of applications and algorithms on graphics processors will be discussed.

The course is lab intensive, and it will utilize the machines at the Wilkinson lab. Students taking the course for COMP_ENG/COMP_SCI-368 credit will work on labs that utilize advanced parallel programming, data layout, and algorithm decomposition concepts. Students taking the course for COMP_ENG/COMP_SCI-468 credit will work on the same labs and also a quarter-long open-ended final project that draws upon their own interests and line of research. Ideally, in their final project these students will form interdisciplinary teams and complete the first steps of optimizing a real-world compute-intensive problem in science or engineering (e.g., materials science, astrophysics, civil engineering, etc.).

COURSE COORDINATOR:  Prof. Nikos Hardavellas (Instructor)



  • Programming Massively Parallel Processors: A Hands-on Approach. D. Kirk and W.-M. Hwu. Morgan-Kaufman

Reference textbooks (not required):

  • CUDA by Example. J. Sanders and E. Kandrot. Pearson/Addison-Wesley
  • GPU Computing Gems Emerald Edition. W.-M. Hwu. Morgan-Kaufman
  • GPU Computing Gems Jade Edition. W.-M. Hwu. Morgan-Kaufman
  • Patterns for Parallel Programming. T.G. Mattson, B.A. Sanders, and B.L. Massingill. Pearson/Addison-Wesley
  • Introduction to Parallel Computing. A. Grama, A. Gupta, G. Karypis, and V. Kumar. Pearson/Addison-Wesley
  • Heterogeneous Computing with OpenCL. B.R. Gaster, L. Howes, D.R. Kaeli, P. Mistry, and D. Schaa. Morgan Kaufmann


There will be several programming assignments. Each programming assignment will involve successively more sophisticated programming skills. The labs will be done in groups of two. The list below is tentative and subject to change:

  1. Matrix multiplication. The lab’s focus is on producing correct code. This project reinforces the acquisition of basic GPU/CUDA programming skills, the software interface, and the basic architecture of the device.
  2. Tiled matrix multiplication. This lab focuses on data layout and decomposition, and full utilization of shared memory resources and global bandwidth through bank conflict avoidance and memory coalescing.
  3. Histograms. In this lab you are called to define optimization goals and strategy, implement them, and keep a research lab journal on which you report statistics and analyze every optimization you tried, even ones that did not work or degraded performance. For this assignment you will need to read recent research papers that outline some of the best-known ways to solve this problem.
  4. Parallel prefix sum / vector reduction. This lab focuses on the application of efficient parallel algorithms that utilize shared memory and synchronization and minimize path divergence.
  5. 2D convolution (tentative – typically we do not have time for it, but I always try). While the previous labs involved detailed instructions and scaffolding code, this lab provides students only with a problem statement, along with optimization goals and hints. This lab’s focus is on independent thinking and reinforces concepts learned in the previous labs.


The project focuses on open-ended research. The work will be a quarter-long open-ended final project that draws upon the students’ own interests and line of research. Ideally, in their final project the students will form interdisciplinary teams and complete the first steps of optimizing a real-world contemporary compute-intense problem in science or engineering (e.g., materials science, astrophysics, civil engineering, etc.). The students are expected to base their ideas on recent literature and their own line of research and needs. It is the students responsibility to propose a research project suitable for this class. The project selection requires the permission of the instructor. For this, there will be a project proposal phase. The final project will culminate to a paper report, and a presentation/demo held at the final few class meetings.

COURSE OBJECTIVES: When a student completes this course, s/he should be able to:

  • (a) Apply knowledge of mathematics, science, and engineering (parallel programming techniques, kernel decomposition, synchronization, memory access optimizations, data layout transformations, branching/loop optimizations, algorithm cascading, Brent’s theorem, input diagonalization, FP representations, regularization, compaction, binning, thread coarsening).
  • (b) Design and conduct experiments, analyze, and interpret data (design and conduct experiments on real massively parallel applications written using CUDA, utilize industrial tools to identify and overcome performance bottlenecks, measure execution time and speedup on GPU devices).
  • (c) Design a software system to meet desired needs within realistic constraints (design optimized massively parallel programs for GPUs by amplifying the utilization of constrained resources including PCIe bandwidth, global memory bandwidth, shared memory banks, texture and constant memory, warps, thread blocks, SMP registers, etc.).
  • (d) Ability to function on multidisciplinary teams (the collaborative term projects will ideally pair together computer scientists and engineers with domain experts).
  • (e) Ability to identify, formulate, and solve engineering problems using industry-strength CUDA tools.
  • (g) Ability to communicate effectively (research papers and reports, presentations, posters).
  • (i) Recognize the need for, and have the ability to engage in life-long learning (read recent research papers in an unfamiliar subject and assimilate knowledge without direct supervision).
  • (j) Gain knowledge of contemporary issues (state-of-the-art in GPU programming, energy-efficiency in computer architectures, high-performance parallel programming, heterogeneous computing).
  • (k) Ability to use the techniques, skills, and modern engineering tools necessary for engineering practice (GPUs, CUDA, occupancy calculator, Ocelot, OpenCL, parallel programming techniques).


  • Week 1: Introduction to GPUs and their Programming Model. Topics include: grids, blocks, threads, memory model, execution model, software architecture, basic API.
  • Week 2: GPU Architecture Overview and Execution API. Topics include: streaming processors, streaming multiprocessors, texture clusters, streaming arrays, block scheduling and execution, warps, scoreboarding, memory parallelism, register file, execution staging, subtyping, measuring time, compilation overview.
  • Week 3: Memory Performance and Control Flow; Example application: Matrix Transpose. Topics include: coalescing, bank conflicts, DRAM memory controller organization, DRAM burst mode, tiling, padding, flow divergence.
  • Week 4: Performance and Occupancy; PTX Assembly and Profiler. Topics include: TLP, ILP, OoO execution, RAW/WAW hazards, occupancy measurements, hardware counters, visual profiler, predicated execution, PTX assembly overview.
  • Week 5: Putting everything to work: Parallel Reductions. Topics include: global synchronization, kernel decomposition, memory coalescing, non-divergent branching, eliminating shared-memory bank conflicts, loop unrolling, Brent’s Theorem, algorithm cascading.
  • Week 6: Work with atomics: Histograms; Vector Programming. Topics include: data races, atomics, input diagonalization, privatization, bank mapping, warp voting, vector loop semantics, order of evaluation, vectorization with forward dependencies, pragma directives for data, temporal evolution, elemental functions, uniform/linear clauses, outer loop vectorization, vectorizing function calls.
  • Week 7: Parallel Prefix Scan; Sparse Arrays. Topics include: inclusive scan, enforcing block ordering, exclusive scan, multi block/kernel parallel prefix scan, warp voting, fences, gpu implementations of compressed sparse array representations (CSR, diagonal, ellpack, coordinate, hybrid ell/coo, packet format).
  • Week 8: Advanced Blocking/Tiling; Convolution; Textures. Topics include: thread coarsening, stencil computation on GPUs, cache blocking, register tiling, two-pass methods, loop skewing, texture arrays, nearest point sampling, linear filtering, clamping, CUDA arrays, 2D texture locality.
  • Week 9: Input Binning, Parallel Programming Overview. Topics include: binning, scatter/gather, cut-off summation, bin design, thread coarsening and binning, horizontal vs. vertical scaling, shared memory programming, message passing, data sharing models, algorithm structures vs. parallel programming coding styles, parallel program models vs. programming models, regularization, compaction, binning, data layout transformations, thread coarsening, scatter/gather, tessellation, limiter theory.
  • Week 10: (optional reading) Floating Point Considerations; Introduction to OpenCL. Topics include: floating-point representation, flush-to-zero, denormalization, rounding, units-in-the-last-place, mixed-precision methods, OpenCL kernel structure and invocation, context, devices, command queues, memory allocation, argument construction.
  • Week 11-12: Project Demos and Project Deliverables (Report Paper, Code, Presentation)

ABET Content Category: 100% Engineering (Design component)