Projects

Agentic AI Data Scientist

As LLM capabilities grow, so do their agentic skills. The real-world business problems are complex enough and rarely yield to a single agent. This CDL project designs a multi-agent orchestration for open-source language models, mapping the end-to-end data-science workflow across specialized roles (ingestion, analysis, modelling, validation, and reporting). We experiment with routing strategies, task/role routing, tool-aware routing, measure task success and identify the minimal, most effective number of agents. The goal is a robust blueprint for a fully autonomous (and auditable) “AI data scientist” that is reliable, efficient, and production ready.

Link to Agentic AI Data Scientist Application: https://genai-hub.centerfordeeplearning.northwestern.edu/

Agentic AI application Validation

LLM applications leveraging the capabilities of Agentic AI including external tools and databases are one of the preferred choices to solve complex business problems. Even if the Agentic AI applications generate answers to complex problems, it is not sufficient. This project at CDL develops practical evaluation criteria for agentic and RAG systems across open-source and closed-source validation frameworks. We validate not only the final output of Agentic AI application but also validating the intermediate steps like tool selection, routing decisions, retrieval quality, and code execution for efficient and reliable functioning of the application.

Large Language Model for Automated Feature Engineering

Automated feature engineering (AutoFE) aims to liberate data scientists from manual feature construction, which is crucial for improving the performance of machine learning models on tabular data. The semantic information of datasets provides rich context for AutoFE but is exploited in few existing work. In this project, we introduce AutoFE by Prompting (FEBP), a novel AutoFE approach that leverages large language models (LLMs) to process dataset descriptions and automatically construct features. FEBP iteratively improves its solutions through in-context learning of top-performing examples and is able to semantically explain the constructed features. Experiments on seven public datasets show that FEBP outperforms state-of-the-art AuoFE methods by a significant margin.

Churn Prediction for CRM

With noSQL databases companies are capturing every interaction between a customer and a company. Such temporal and unstructured data present opportunities for applications of relatively new memory-augmented recurrent neural networks to model sequences of customer events with the particular goal of predicting retention. The combination of heterogeneous events mixing random and regular time intervals and the relative short length of many individual observations make traditional time series modeling unsuitable.

By using data from corporate partners, we conclude that deep learning approaches have higher predictive power than traditional machine learning approaches.

Anomaly Detection with Recurrent Neural Networks

We consider a classification problem on labeled temporal data where one label, the minority label, is seldom present in the training data and thus we consider the data set to be imbalanced. Typical applications of this type are churn predictions, fraud detection, and rare event predictions. It is difficult to achieve a high classification accuracy for imbalanced datasets as most algorithms assume that the data set is balanced. Most methods for improving model performance on imbalanced data involve decreasing the data imbalance in the training set in order to satisfy the assumptions made by an algorithm.

We study methods that cope with imbalanced data sets and are based on recurrent neural networks. In particular, we have designed models relying on ensemble, auto-encoders and generative adversarial networks. Data from a research partner is used in the study.

Robust Embeddings from Multiple Corpora

Corporations have many corpora in different areas. Maintaining a separate word2vec embedding model for each corpus is burdensome and fragile. We use generative adversarial networks to generate single embeddings that are robust across all corpora.

http://dynresmanagement.com/uploads/3/5/2/7/35274584/gan-corpora.pdf

Nested Multi-Instance Image Classification

There are classification tasks that take as inputs groups of images rather than single images. In order to address such situations, we introducea nested multi-instance deep network. The approach is generic in that it is applicable to general data instances, not just images. The network has several convolutional neural networks grouped together at different stages.

We also introduce methods to replace instances that are missing and a manual dropout when a whole group of instances is missing.

http://dynresmanagement.com/uploads/3/5/2/7/35274584/multi-instance-image.pdf

Definition Modeling for Explainable AI

Today’s Natural Language Processing systems rely on word embeddings, which are vector representations of terms. Embeddings have been shown to capture word syntax and semantics, but because the embeddings are opaque numeric vectors, it is difficult to determine exactly what word information they capture. This limits our ability to use embeddings in applications, or to improve the embeddings. We are addressing this limitation by developing definition models, which make the semantics captured by embeddings explicit by writing out a natural-language definition for each word vector. See our demo and the paper for more.

Tabel: Entity Linking in Web Tables

A large amount of the relational data on the Web is expressed in tables. It has been estimated that the Web contains more than 150 million relational tables, and English Wikipedia alone features more than a million high-quality relational tables. A key step in turning this Web content into machine-processible knowledge involves linking the entities mention in the tables to a knowledge base. This problem is difficult due to ambiguity – for example, the word “Chicago” in a table may refer to the city, the stage musical, the movie, and so on. The TabEL system automatically links entities in tables to their referent entry in a knowledge base, and has been shown to be more accurate compared to systems in previous work.