MSIA 414: Text Analytics

Quarter Offered

Fall ; Roxana Girju


This course will explore techniques to analyze unstructured text such as that found in emails, text messages, conversation transcripts, web pages, books, scientific journals, etc. The course strives to offer a balance between breadth and depth, presenting both an overview of the field as well as some insight into the mathematical underpinning of a few representative techniques.

Students in the course will gain a deep understanding of a wide range of probabilistic computational techniques applied to language data. Foundational computational models that are explored include finite-state transducers, n-gram models, noisy channel models, naive Bayes, hidden Markov models, maximum entropy models, latent Dirichlet allocation, and probabilistic grammars. Students will also learn to apply these techniques to many common text analytic problems such as tokenization, stemming, search, retrieval, cooccurrence analysis, spelling correction, part-of-speech tagging, named entity recognition, relation extraction, coreference, and syntactic parsing. This approach simultaneously teaches students to apply state-of-the-art techniques while providing them with a generalizable foundational understanding, in order to enable students to understand new text analytic models throughout their career.