Patrick Boueri Speaks at MSiA's Spring Quarter Seminar Series

Interviewed & Written by: Naomi Kaduwela

Patrick Boueri

Each quarter, the Master of Science in Analytics program at Northwestern University invites a guest lecturer to campus to take part in our Seminar Series. For Spring 2019, we invited Patrick Boueri to campus to give a talk on machine learning systems used to create business value. 

Patrick says, 

Effectively serving large scale machine learning systems to create business value is still somewhat of a dark art. Complexities such as train-serve skew, concept drift, latency requirements, champion/challenger replacement, hidden technical debt and many more crop up that are not as apparent while crafting a single trained model. To add even more complexity, mixing these machine learning concerns with a stateful streaming processing engine can cause headaches, dominantly around processing with late-arriving and out-of-order data.
At Uptake, we predict whether an industrial asset will fail in the near future. An asset’s history (state) and other data sources such as weather, are crucial to ensuring an effective, timely prediction. To understand why this is so, imagine a semi-truck whose engine coolant is running too low, causing the engine to overheat. The trend in the engine temperature is an important feature that characterizes that failure mode, and it requires not only the current engine temperature, but previous observations as well. These requirements have led us to build a stateful streaming system, serving heterogeneous, low-latency stateful machine learning models. We have learned lessons from that process and deploying hundreds of models making millions of predictions a day.

As part of the Seminar Series, Patrick spent some time sitting down with one of our students to have a philosophical discussion around machine learning to share his insights around the data science industry today!

Naomi Kaduwela: What was your path to data science?

Patrick Boueri: When I was young I wanted to be an architect, but I couldn’t draw straight with ruler. At Northwestern, I was into theoretical physics and cosmology, but went for something I felt was more practical minded: electrical engineering. When I went to get my Master’s degree in Operations Research, I was exposed to python and modern databases, and gravitated towards it. It’s a field that is adjacent to data science.

NK: What is your favorite part about being a data scientist?

PB: My favorite is definitely the team. It is composed of people from different walks of life that have strong passionate technical interests, which speaks to me. It’s interesting to see the confluence of those across social sciences, math, and hard sciences. They are willing to share knowledge and experience across domains which fosters creativity and fun discussions. Also, when you get into data work, you can answer your own questions, which is very cool to me. If you want to be able to measure something, you can go out and do the empirical work yourself. You’re constantly uncovering new things.

NK: What has been your favorite project?

PB: Favorite project was an air analysis where we trained a tree and then some error accumulates. So we took the tree’s leaves, terminal nodes as a feature vector and clustered the terminal predictions to do error analysis on that. It was a way to introspect on the tree and find out where the errors occur. It was fun to do because you get into the guts of the algorithm. Practically, it’s hard to come up with rules that transcend any data set.

NK: You've referred to serving large machine learning systems in order to create business value as a "dark art". What makes it a dark art?

PB: Data science and products is new, and as a result, there’s some mystery behind what it means. When we say AI, do we mean a logistic regression, a deep learning, a reinforcement learning?

Also, I’m a wizard. Have you seen those pattern detection magic wands that match motions to spells using a microcontroller computer? (Check them out here!)

NK: What are the challenges to getting to more productionized AI solutions?

PB: It’s a confluence of things: expertise, right data, right system.

Getting the right consumers is also important, which is why UI/UX initiatives are key to ensure it drives actions downstream and the right behavior down the road.

Marquee products like Google were built for ad and web search so they have phenomenal ML in that space.

Most companies sell products or services, and analytics is an offering or optimization on a business process. For many companies, analytics is not their core. However, if they have good analytics on their products, they can get increases in margin. But it’s harder to do for them because it’s more spread out and distributed.

NK: Is it possible that there is much abstraction with programming languages and libraries?

PB: Too much abstraction is only a bad thing if they are leaky abstractions. For example, you don’t care about the low level details of what your compiler is doing because it’s a very hermetic abstraction. But data is leaky, and that’s where you get results that make you question every move, because you don’t have a solid foundation.

I worry if they are used incorrectly, in the sense that decisions are being made off of them. But, I’m not sure we are at that point yet, where so many machine learning algorithms are being productionized effectively.

But it’s cool! I wouldn’t want to write a Stochastic gradient descent model every time I train a model.

NK: What are your thoughts on the future of data science tool sets? I also see a lot of customization within each domain with regard to the data science packages.

PB: Even with the abstractions, we get to grapple with the domain. And I think that’s where the future is going to go - to get more productive. You can throw as much machine learning as you want at it, but the way to narrow the search space to get things that are useful, without infinite compute and memory is to bundle domains into ML libraries.

I think we’ll see a lot of specialized domains. What we see growing in audio, natural language processing, will expand, for example, to CRM.

You’ll see providers that have specific domain - like speech to text - with an API built on top of it so plug and play is easy for businesses. We have a lot of disparate tools, so to be able to pull it off the shelf quickly and do it ourselves easily would be a great thing.

NK: What are the challenges to building generalized toolsets?

PB: There’s an opportunity for standardization, as the foundation of more automated Data Science tools relies on data collection standards. For example, in Electronic Health Record, if they are able to standardize to the point where you can easily collect the data, so many use-cases will be enabled.

NK: What are your thoughts on Auto ML (ML creating ML)? What is the current usability?

PB: It’s glorified searching, not intelligent. Computers are good at doing things repeatedly. Where you get intelligence is when you infuse the domain. That’s why Convolutional Neural Networks are so good - they encode the spatial locality into the architecture. There is that equivalent everywhere else. For example, 2 signals might be correlated, but you need to know that. The semantics of the data is important.

Levels of auto ML from Kaggle grandmaster: 1. Hand roll libraries so you get off the shelf capabilities - i.e. bootstrap SKlearn 2. Large searches 3. Domain based searches 4. Then who knows?

All these things are part of larger systems, so gotta design the entire system around it, instead of just one piece at a time.

NK: In our Master of Science in Analytics program this quarter we are learning to build and deploy end to end production ready applications. Can you talk about the uniqueness vs intersections of the data scientist vs software developer role these days?

PB: I would call that role a machine learning engineer. There are strong cultural differences between data scientists and software developers. Software developers can unit-test locally and there’s a sense of certainty where the uncertainty is from users. With data science it can be a bit frustrating because there is uncertainty in the data. So when someone says, “If I give you this data, will it work?” The answer is “I think so”. There is also technical rigor. We can learn from software developers: version control and a DevOps mindset.

NK: What are your thoughts on ethics?

PB: Be kind, be ethical, make sure the AI you build is true to your values. Abstracting out a decision and substituting an algorithm for a human can make lines hard, where they shouldn’t be.

There’s a great book written by a social scientist, talking about how we’ve used data to determine if folks should get benefits from the government or folks that have children at risk and if we should protect those children.

Check out the book!

NK: What are you thoughts on Agile in the Industry?

PB: Most Agile is not real Agile. It’s a process and a way to structure work, more than pivoting on a dime. It’s better than what it was before: the packets of work are smaller and the reevaluation checkpoints are at different times. But people have a need for certainty and planning, and that’s not bad.

NK: As a student of guitar, what are your thoughts on AI in the domains of music and art?

PB: Magenta (an open source research project exploring the role of machine learning as a tool in the creative process) is cool!

In 9-10th grade I wrote a random sampler from blues scale jazz in Java. It could generate a whole note, half note, BFlat. It will augment people’s reality. I think it would be cool for kids to have their dreams come true as they draw them out. I don’t think machine learning in this domain will displace creativity, only enhance it. It’s another tool for artists to use.

Did you see Google’s tensor flow jazz doodle that would compose Bach? (Image below)



NK: We used to have a professor in Predictive Analytics who had a few equations that he felt were “Wake up” formulas - meaning, if he woke us up in the middle of the night, we should be able to recite them! Do you have a “Wake up” formula?

PB: The ideal gas law: PV = nRT
Which is useful when dealing with pressures and volumes and temperatures

My electrical engineering professors might not be happy with that answer though!
Power law equation (formula), to measure the power coming off electrical circuit:

P = I × V.

NK: Have you played against AlphaGo?

PB: We had a screening party at work for the documentary, it was pretty cool! But I have yet to play.

Areas where you have very constrained rules, and the rules of the game are very well laid out, this type of self-play AI flourishes.

NK: What would you do if you didn’t have to work? Any passions?

PB: Working in public policy and city government to push for change.

Uptake has Data Fellows who work for organizations and local governments. Since it’s not a fully established field where professional organizations have already been built around it. People are looking for help and sounding boards because they might be one of the few, or the only one, in the organization doing the discipline of dark arts. It can be lonely. That’d be fun.

NK: What is your advice for the next generation of data scientists? What’s after this AI hype cycle?

PB: There is still such a big bay of opportunity. Production systems are far and few in between. Realizing value across the entire lifecycle chain is a gap.

Don’t be afraid to become an expert. Become the go to person for a while - it’s easier to progress earlier on that way.

Don’t underestimate the value of people and communication. If you work on something for 6 months, and then are not able to communicate results effectively at their end, what’s the point? Communication is a coefficient on the whole equation. It’s generally around .2, because communicating is difficult, which is why you should repeat things always!

NK: What are your tips for success in the workplace?

PB: Set up 1x1s with your boss and ask for a career development plan so that the criteria is objective and transparently laid out regarding how you are measured.

Take on a practitioners mindset. Do very rapid risk assessments. For example, Pharmaceutical companies are good at this. They have research wings and gates with clear yes/no decisions to continue.

Be very upfront and transparent. Separating yourself from your work is important. It’s hard not to, but it’s important to be mindful of. There’s a good quote:

“Don’t attach your identity to your product, because either is liable to change.”

-Kelsey Hightower (Kubernetes)

 Naomi Kaduwela

Naomi Kaduwela is an aspiring Chief Analytics Officer, passionate about propelling Healthcare and Education into the digital age through innovation. Naomi is taking a hiatus from her 5 year corporate career to join the MSiA program at Northwestern to broaden and deepen her Data Science skill set. Learn more about Naomi

McCormick News Article