# Seminars and Tutorials

## Upcoming EVENTS

### Laurent Lessard

Assistant Professor, Electrical and Computer Engineering

University of Wisconsin-Madison

### Title: “Automating the analysis and design of large-scale optimization algorithms”

#### Wednesday, October 11, 2017

12:00 p.m. – 1:00 p.m.

Tech L440

**ABSTRACT:** Most complicated optimization problems, in particular those involving a large number of variables, are solved in practice using iterative algorithms. The problem of selecting a suitable algorithm is currently more of an art than a science; a great deal of expertise is required to know which algorithms to try and how to properly tune them. Moreover, there are seldom performance guarantees. In this talk, I will show how the problem of algorithm selection can be approached using tools from robust control theory. By solving simple semidefinite programs (that do not scale with problem size), we can derive robust bounds on convergence rates for popular algorithms such as the gradient method, proximal methods, fast/accelerated methods, and operator-splitting methods such as ADMM. The bounds derived in this manner either match or improve upon the best known bounds from the literature. The bounds also lead to a natural energy dissipation interpretation and an associated Lyapunov function. Finally, our framework can be used to search for algorithms that meet desired performance specifications, thus establishing a principled methodology for designing new algorithms.

**BIOGRAPHY:** Laurent Lessard is an Assistant Professor of Electrical and Computer Engineering at the UW–Madison and faculty member of the Optimization Group at the Wisconsin Institute for Discovery. Laurent received the B.A.Sc. in Engineering Science from the University of Toronto, and received the M.S. and Ph.D. in Aeronautics and Astronautics at Stanford University. After completing his doctoral work, Laurent was an LCCC Postdoc at Lund University in Sweden, and a postdoctoral researcher at the University of California, Berkeley. His research interests include: decentralized control, robust control, optimization, and machine learning.

### Krishna Balasubramanian

Department of ORFE, Princeton UniversityOn

### Title: Stein's Identity and Near-Optimal Estimation in

High-dimensional Index Models

#### Friday July 7, 2017

12:00p.m. - 1:00p.m.

Tech M228 -- Lunch in C211

**Abstract:** In this talk I will present a method for estimating the parametric components of semi-parametric multiple index models (a.k.a. not-so-deep neural nets) in a high-dimensional non-Gaussian setting. The proposed estimators leverage the score function based first and second-order Stein's identity and do not require Gaussian or elliptical symmetry assumptions made in the literature. Moreover, to handle score functions and response variables that are heavy-tailed, the estimators are con-structed via carefully thresholding their empirical counterparts. On the theoretical front, I will show that the proposed estimator achieves near-optimal statistical rate of convergence in several settings. I will end this talk with a discussion of on-going work on extending similar ideas for near-optimal estimation in 2-layer neural networks under high-dimensional non-Gaussian assumptions.

### Chao Qu

Postdoctoral CAndidate, National University of Singapore

### Title: variance reduction for restricted strong convexity

#### Thursday, April 13, 2017

11 am -12:00 pm

M228- (lunch in C211)

**Abstract: **Computational challenges of statistical estimators and machine learning algorithms are an active area of study thanks to countless applications involving big data. In particular, several variants of stochastic first order methods based on variance reduction, namely SVRG, SDCA and SAGA, have attracted significant attention because of light computation load for each iteration, and fast convergence speed - for strongly convex problems they all converge with a linear rate. Yet, many powerful statistical and machine learning models (e.g., Lasso, group Lasso, nuclear norm regularized matrix completion), are not strongly convex, and sometimes even non-convex (a prominent example being the SCAD regularizer). In this talk, I will argue that a nice statistical property of those problems, namely restricted strong convexity, ensures that linear convergence of variance reduced stochastic first order methods for those statistical models. From a high level, our results re-confirmed a well-observed phenomenon that statistically easy problems tend to be more computationally friendly. We believe this intriguing fact indicates that there are fundamental relationships between statistics and optimization, and understanding such relationships may shed deep insights.

**Bio-Sketch**: Chao Qu is a PhD student in the Department of Mechanical Engineering at National University of Singapore, advised by Huan Xu. His research interests mainly focuses on machine learning, high dimensional statistics and large scale optimization. Prior to National University of Singapore, he received his M.Phil. degree in ECE from Hong Kong University of Science and Technology and B. Eng. degree in Information Engineering from Xi’an Jiao Tong University.

A box lunch will be served- please rsvp to tara.sadera@northwestern.edu

### Jonathan Eckstein

Professor, Management Science and Information Systems

Rutgers university

### Title: Approximate Versions of the Alternating Direction Method of Multipliers

#### Thursday, November 3, 2016

12:30-1:30 pm

Tech L324

**Abstract:** The Alternating Direction Method of Multipliers (ADMM) is a decomposition method for convex optimization currently enjoying popularity in the solution of machine learning and image processing problems. This talk reviews the convergence mechanism of the ADMM and presents three new, provably convergent approximate versions in which one or both optimization subproblems arising at each iteration may be solved inexactly using practically testable approximation criteria. computational tests applying these methods on several different problem classes indicate that in cases in which an iterative method is used for at least one of the ADMM subproblems, these variants can significantly reduce total computational effort. We also discuss the possible application of these approximation techniques to the Progressive Hedging (PH) algorithm for stochastic programming, which may be viewed as a particular case of the ADMM.

Click HERE for Professor Eckstein's Bio

## PAST EVENTS

### Donald Goldfarb

Alexander and Hermine Avanessians Professor of Industrial Engineering & Operations Research

Columbia University

### Title: Optimization Methods for Learning and Big Data

#### Thursday, October 20, 2016

5:00 - 7:00 pm

Tech M228

**Abstract:** Many problems in machine learning (e.g., logistic regression, support vector machines, neural networks, robust principal component analysis) and signal processing (e.g., face recognition and compressed sensing) are solved by optimization algorithms. In today's age of big data, the size of these problems is often formidable. In this talk, we will discuss some of our work that addresses this challenge using the following approaches: first-order (including proximal and conditional gradient) methods, accelerated variants, stochastic gradient and second-order extensions (including full and limited memory block BFGS updating methods), and alternating direction (multiplier) methods for structured problems.

Click HERE for Professor Goldfarb's Bio

### Matthias Ehrgott

Professor, Head of Department,

Management Sciences

University of Lancaster

### Title: "Bridging the Gap Between Real-world Decision Making and Mathematics: Multi-objective Optimisation in Action"

#### Wednesday, October 19, 2016

5:00 - 7:00 pm

Tech L324

**Abstract:** In this talk I will first present several studies that illustrate the power of multi-objective optimisation in supporting decision makers facing conflicting goals. These studies are drawn from the fields of transportation (airline operations, traffic modelling), health radiation therapy treatment), and manufacturing (of composite materials). I will illustrate the widely differing mathematical methods used in these applications, but emphasise the common benefits of the multi-objective optimisation approach: The improved understanding of the problem achieved through additional insight into the problem structure; the improved support for decision makers through the ability to properly account for trade-offs between conflicting goals; and last but not least, the considerable benefits that result in terms of quality of decisions and/or improved decision making processes.

In the second part of the talk I will focus on a specific application in the radiotherapy treatment planning, namely the generation of deliverable treatment plans and beam angle optimisation in the presence of multiple objectives. External radiation therapy is one of the major forms of treatment of cancer. Planning a radiation treatment for a patient involves making trade-offs between the main goals of radiotherapy, namely to irradiate the tumour according to some prescription to affect tumour control and to avoid damage to surrounding healthy tissue. This conflict permeates all aspects of treatment planning from the selection of beam angles to the optimisation of fluence maps to the sequencing of the gantry for treatment delivery. In this talk I will first describe a matheuristic approach to incorporate multi-objective fluence optimisation in the beam angle optimisation problem and then describe a column generation approach to the multi-objective fluence map optimisation problem, which achieves a reduction of the number of apertures and total fluence required to deliver a Pareto-optimal treatment.

Click HERE for Professor Ehrgott's Bio

### Gesualdo Scutari

Associate Professor of Industrial Engineering

Purdue University

### TITLE: In-network Nonconvex Big Data Optimization

#### Monday, October 10, 2016

12:00-1:00pm

Tech L324

**Abstract:** Nowadays, large-scale systems are ubiquitous. Some examples/applications include wireless communication networks; electricity grid, sensor, and cloud networks; and machine learning and signal processing applications, just to name a few. In many of the above systems, i) data are distributively stored in the network (e.g., clouds, computers, sensors, robots), and ii) it is often impossible to run analytics on central fusion centers, owing to the volume of data, energy constraints, and/or privacy issues. Thus, distributed in-network processing with parallelized multi-processors is preferred. Moreover, many applications of interest lead to large-scale optimization problems with nonconvex, nonseparable objective functions. All this makes the analysis and design of distributed/parallel algorithms over networks a challenging task.

In this talk we will present our ongoing work in this area. More specifically, we consider a large-scale network composed of agents aiming to distributively minimize a (nonconvex) smooth sum-utility function plus a nonsmooth (nonseparable), convex one. The latter is usually employed to enforce some structure in the solution, e.g., sparsity. The agents have access only to their local functions but not the whole objective, and the network is modeled as a directed, time-varying, B- strongly connected graph. We propose a distributed solution method for the above optimization wherein the agents in parallel minimize a convex surrogate of the original nonconvex objective while using a novel broadcast protocol to distribute the computations over the network. We discuss several instances of the general algorithm framework tailored to specific (convex and nonconvex) applications and present some numerical results validating our theoretical findings.

Click **HERE** for Professor Scutari's Bio

**
**

### Peter Frazier, Cornell University

### May 24-25, 2016

seminar: Bayesian Optimization in the Tech Sector: The Parallel Knowledge Gradient Method, and Experiences at Uber, Yelp, and SigOpt

#### Tuesday, May 24, 11:00am-12:00pm

Location: Tech, M228

**Seminar Abstract**:

We discuss parallel derivative-free global optimization of expensive-to-evaluate functions, and the use of Bayesian optimization methods for solving these problems, in the context of applications from the tech sector: optimizing e-commerce systems, real-time economic markets, mobile apps, and hyperparameters in machine learning algorithms. We describe optimization as a decision-making task ("where should I sample next?") and study it using decision-theory, by placing a Bayesian prior distribution on the objective function, and choosing the set of points to evaluate next that provide the largest value of the information. Using this approach, we develop a novel algorithm for parallel derivative-free global optimization of expensive functions, called the parallel knowledge gradient (pKG) method, which outperforms previously state-of-the-art Bayesian methods. We conclude with examples of the impact of these methods in the tech sector, from the speakers' experiences working with Yelp, Uber, and the Bayesian Optimization startup company, SigOpt.

### Tutorial: Coding for Researchers

#### Wednesday, May 25, 4-6pm

Location: Tech, M228

**Tutorial Abstract:**

In this tutorial, we discuss techniques used in the software engineering industry to develop and manage large software products that are also useful for researchers. These techniques allow developing code more quickly, and result in code that is faster, more reliable, easier to use, and easier for others to understand, use, and extend. This tutorial will cover version control (focusing on git), unit testing, profiling, code optimization, code organization, techniques for managing computationally expensive experiments, and steps you can take to get people in the real world to actually use your ideas. This tutorial is designed for students who are writing software as part of their research, but who have not had prior exposure to techniques used in the software engineering industry. Examples will be given in Python, C, and Matlab.

**Bio: Peter I. Frazier is an Associate Professor in the School of Operations Research and Information Engineering at Cornell University, and currently on sabbatical leave at Uber, where he is a Staff Data Scientist and the Data Science Manager for uberPOOL, Uber's carpooling service. His research interest is in sequential decision-making and Bayesian statistics, focusing on Bayesian optimization and optimal learning.**

**
**

### Extracting governing equations in chaotic systems from highly corrupted data

#### Rachel Ward, University of Texas at Austin

Tuesday, May 10, 2016, at 11:00am

Select HERE for Seminar Abstract

**Bio: Rachel Ward obtained her BS in Mathematics from University of Texas at Austin, and her PhD in Applied and Computational Mathematics from Princeton University. She is currently an Assistant Professor in the Department of Mathematics and ICES member at University of Texas at Austin.**

### Recent Advancements in optimization for the big data regime

#### Sham Kakade, University of Washington

Monday, April 25, at 3:30pm

### MULTI-VIEW REPRESENTATION LEARNING WITH CANONICAL CORRELATION ANALYSIS

#### Weiran Wang, Toyota Technological Institute at Chicago

Monday, April 18, at 1:00pm in Tech Room M228

Canonical correlation analysis (CCA) has been the main workhorse for multi-view feature learning, where we have access to multiple ''views'' of data at training time while only one primary view is available at test time. The idea of CCA is to project the views to a common space such that the projections of different views are maximally correlated.

In the first part of the talk, we compare different nonlinear extensions of CCA, and find that the deep neural network extension of CCA, termed deep CCA (DCCA), has consistently good performance while being computationally efficient for large datasets. We further compare DCCA with deep autoencoder-based approaches, as well as new variants. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with a new variant, deep canonically correlated autoencoders (DCCAE).

In the second part of the talk, we study the stochastic optimization of canonical correlation analysis, whose objective is nonconvex and does not decouple over training samples. Although several stochastic optimization algorithms have been recently proposed to solve this problem, no global convergence guarantee was provided by any of them. Based on the alternating least squares formulation of CCA, we propose a globally convergent stochastic algorithm, which solves the resulting least squares problems approximately to sufficient accuracy with state-of-the-art stochastic gradient methods. We provide the overall time complexity of our algorithm which improves upon that of previous work.

This talk includes joint work with Raman Arora (JHU), Jeff Bilmes (UW), Jialei Wang (U Chicago), Nati Srebro (TTIC), and Karen Livescu (TTIC).

Weiran Wang is a postdoctoral scholar at Toyota Technological Institute at Chicago. He obtained the PhD degree from the EECS Department at University of California, Merced in 2013. His research includes algorithms for deep learning, multi-view representation learning, structured output (sequence) prediction, manifold learning, optimization for machine learning, and applications to acoustic modeling for speech recognition.

### FAST AND SIMPLE PCA VIA CONVEX OPTIMIZATION

#### Dan Garber, Toyota Technological Institutue of Chicago

Tuesday, April 5, at 12:30pm in Tech E211 (Mechanical Engineering Conference Room)

Computing the leading eigenvector of a real symmetric matrix is a fundamental problem in numerical linear algebra with numerous important applications, the most notable one probably being principal component analysis. Thus, in face of the ever increasing size of modern datasets, designing faster algorithms for this problem is an important challenge. Unfortunately, this problem is inherently non-convex, which makes it very difficult to adapt to it powerful techniques that were developed for convex problems, such as Nesterov's acceleration and various stochastic gradient methods, that proved highly useful for handling very large scale convex problems.

In this talk I will show that the leading eigenvector problem, though not convex, is reducible to solving a short (i.e. only poly-logarithmic in the important parameters of the problem) sequence of well-conditioned unconstrained convex optimization problems. This in turn, allows us to apply state-of-the-art stochastic gradient methods for solving this sequence of problems. As a result we derive algorithms for computing the leading principal component which are the fastest to date in a wide regime of parameters.

Dan Garber is a Research Assistant Professor at the Toyota Technological Institute at Chicago, a philanthropically endowed academic computer science institute located on the University of Chicago campus. Dan's research interests lie in the intersection of machine learning and continuous optimization, with the main focus being the development of efficient algorithms with novel and provable performance guarantees for basic machine learning, data analysis, decision making and optimization problems. Research highlights include faster projection-free first-order convex optimization algorithms, sublinear-time algorithms for semidefinite programming, and algorithms with stronger guarantees for principal component analysis in a variety of settings including offline (classical) computation, streaming and online.

Dan received both his Ph.D and his M.Sc degrees from the Technion - Israel Institute of Technology, where he worked under the supervision of Prof. Elad Hazan. Before that, Dan completed his bachelor's degree in computer engineering, also in the Technion.

### MARK SCHMIDT, UNIVERSITY OF BRITISH COLUMBIA

Tuesday, March 8, 2016

Seminar: Advances in fitting statistical models to huge datasets

**Seminar Abstract**:

The talk will contain two parts:

In the first part, I will consider the problem of minimizing a finite sum of smooth functions. This is a ubiquitous computational problem across science and engineering, and includes special cases like least squares and logistic regression. I will describe the stochastic average gradient algorithm which, despite over 60 years of work on stochastic gradient algorithms, is the first method to achieve the low iteration cost of stochastic gradient methods while achieving a linear convergence rate as in deterministic gradient methods that process the entire dataset on every iteration.

In the second part, I will consider the even-more-specialized case where we have a linearly-parameterized model (such as linear least squares or logistic regression). I will talk about how coordinate descent methods, though a terrible idea for minimizing general functions, are theoretically and empirically well-suited to solving such problems. I will also discuss how we can design clever coordinate selection rules, that are much more efficient than the classic cyclic and randomized choices.

### Tutorial: The Why and How of Releasing applied-math code

In this talk I will argue that researchers should be releasing more code, whether it be implementations of algorithms proposed in their papers or even reference implementations of basic methods from their field. I'll discuss the benefits of this to the person releasing the code as well as to their field and science/engineering as a whole. Then I'll turn to my personal views on the factors that should go into designing a good code: design principles and making code extendable, performance vs. generality, licensing options, and user support/interfaces. Finally, for the specific case of codes related to solving optimization (where continuous or discrete or decision), I will discuss the *real* objective function that all users care about but that is completely ignored in most research on this topic.

Bio: Mark Schmidt has been an assistant professor in the Department of Computer Science at the University of British Columbia since 2014, and a Canada Research Chair since 2016. His research focuses on developing faster algorithms for large-scale machine learning, with an emphasis on methods with provable convergence rates and that can be applied to structured prediction problems. From 2011 through 2013 he worked at the École normale supérieure in Paris on inexact and stochastic convex optimization methods. He finished his M.Sc. in 2005 at the University of Alberta working as part of the Brain Tumor Analysis Project, and his Ph.D. in 2010 at the University of British Columbia working on graphical model structure learning with L1-regularization. He has also worked at Siemens Medical Solutions on heart motion abnormality detection, with Michael Friedlander in the Scientific Computing Laboratory at the University of British Columbia on semi-stochastic optimization methods, and with Anoop Sarkar at Simon Fraser University on large-scale training of natural language models.

### Robust Bivariate Error Detection in Skewed Data with Application to Historical Radiosonde Winds

#### Amanda Hering, Colorado School of Mines

Tuesday, February 23, at 11:00am

Select HERE for Seminar Abstract

**Bio: Amanda Hering graduated from Baylor University with a bachelors degree in mathematics in 1999. She joined the Department of Applied Mathematics and Statistics at The Colorado School of Mines as an Assistant Professor in 2009 after earning her Ph.D. in Statistics from Texas A&M University. She is the CSM Site Director for the Center for Research and Education in Wind, and her research interests are in the areas of spatial and space-time modeling, wind speed forecasting, model validation, and multivariate methods. She is an Associate Editor for four journal, including Technometricsand Environmetrics. Association and past Associate Editor or Editorial Board Member of Technometrics, Journal of Quality Technology, Operations Research, and Naval Research Logistics.**

### ROBUST MAXIMUM LIKELIHOOD ESTIMATION

#### Omid Nohadani, Northwestern University, Department of Industrial Engineering & Management Sciences

#### Tuesday, November 17th, at 12:30pm in Tech L324

In many applications, statistical estimators serve to derive conclusions from data, most prominently in finance, medical decision-making and clinical trials. However, the conclusions are typically dependent on uncertainties in the data. We use robust optimization principles to construct robust maximum likelihood estimators that are immune against data errors. Both types of errors are considered: a) adversarial type, modeled using the notion of uncertainty sets, and b) probabilistic type, modeled by distributions. We provide efficient local and global search algorithms to compute the robust estimators and discuss them in details for the case of multivariate normally distributed data. The estimator performance is demonstrated on two datasets. First, using computer simulations, we demonstrate that the proposed estimators are robust against both types of data uncertainty and provide significantly more accurate estimates, compared to classical estimators which degrade significantly, when errors are encountered. We establish a range of uncertainty sizes, for which robust estimators are superior. Second, we analyze deviations in cancer radiation therapy planning. Uncertainties amongst plans are caused by patients’ individual anatomies and the trial-and-error nature of the process. Robust estimators prove to be result in more reliable decisions when applied to a large set of past treatments.

### LEARNING (FROM) NETWORKS: FUNDAMENTAL LIMITS, ALGORITHMS, AND APPLICATIONS

#### Soheil Feizi, MIT

#### Friday, November 6th, 12:00pm, in Tech L324

Network models provide a unifying framework for understanding dependencies among variables in medical, biological, and other sciences. Networks can be used to reveal underlying data structures, infer functional modules, and facilitate experiment design. In practice, however, size, uncertainty and complexity of the underlying associations render these applications challenging.

In this talk, we illustrate the use of spectral, combinatorial, and statistical inference techniques in several significant network science problems. First, we consider the problem of network alignment where the goal is to find a bijective mapping between nodes of two networks to maximize their overlapping edges while minimizing mismatches. To solve this combinatorial problem, we present a new scalable spectral algorithm, and establish its efficiency theoretically and experimentally over several synthetic and real networks. Next, we introduce network maximal correlation (NMC) as an essential measure to capture nonlinear associations in networks. We characterize NMC using geometric properties of Hilbert spaces and illustrate its application in learning network topology when variables have unknown nonlinear dependencies. Finally, we discuss the problem of learning low dimensional structures (such as clusters) in large networks, where we introduce logistic Random Dot Product Graphs, a new class of networks which includes most stochastic block models as well as other low dimensional structures. Using this model, we propose a spectral network clustering algorithm that possesses robust performance under different clustering setups. In all of these problems, we examine underlying fundamental limits and present efficient algorithms for solving them. We also highlight applications of the proposed algorithms to data-driven problems such as functional and regulatory genomics of human diseases, and cancer.

### TUESDAY, OCTOBER 27TH, AT 12:30PM IN TECH L324

#### Yuan Luo, Assistant Professor, Northwestern University, Division of Health & Biomedical Informatics and Department of Industrial Engineering & Management Sciences

This talk covers our works on applying subgraph mining and factorization algorithms to clinical narrative text, ICU physiologic time series and computational genomics. The common theme of these works aims at building clinical models that improve both prediction accuracy and interpretability, by exploring relational information in each data modality.

This talk focuses on three concrete examples, implicating neurodevelopmentally coregulated exon clusters in phenotypes of Autism Spectrum Disorder (ASD), predicting mortality risk of ICU patients based on their physiologic measurement time series, and identifying subtypes of lymphoma patients based on pathology report text. In each example, we show how to automatically extract relational information into a graph representation and how to collect important subgraphs that are of interest. Depending on the degree of structure in the data format, heavier machinery of factorization models becomes necessary to reliably group important subgraphs. We demonstrate that these methods lead to not only improved performance but also better interpretability in each application.