Student Research | Ner Wars: Evaluating Named Entity Recognition Methods on the Star Wars Original Trilogy

By Macon McLean, Class of 2016

This excerpt is taken from an MSiA student research blog posting. Each month, students in our program submit original extracurricular research as part of our blog competition. The winner(s) are published to the MSiA Student Research Blog, our program website, and receive a chance to attend an analytics conference of their choice. Visit our blog to see more.

Like all people with good taste[1], I have long been a fan of the Star Wars film series.  Ever since witnessing the theatrical re-release of the holy triptych when I was a kid, I have marveled at the way George Lucas invented a living, breathing universe for viewers to enjoy.

Part of making such a universe feel fully-realized is developing a unique vocabulary for its characters to use, and the Star Wars movies pass this test with flying colors.  The evidence is that nearly everybody who’s witnessed one of these adventures from a galaxy far, far away remembers the names of “Luke Skywalker” and “Darth Vader”, and can describe a “Jedi” or “the Force” with ease. These four examples are some of the Star Wars universe’s named entities, words or phrases that clearly describe certain concepts in a way that differentiates them from other concepts with similar attributes.

Named entities are used to inform the process of named entity recognition (NER), the process of automatically identifying and classifying these entities in a given corpus.  This process can be used in creative and meaningful ways, like examining locations in nineteenth-century American fiction[2] for an analysis of named locations in literature[3].  It is often used as part of relation extraction, in which computers pore through large volumes of unstructured text information to identify possible relationships and record them in a standardized, tabular form.  NER classifiers are computational models trained to support these efforts, minimizing the need for manual tagging by humans and preparing the data for inference.  Consider the following sample sentence:

“Steve Spurrier played golf with Darius Rucker at Kiawah Island in May.”

Using just a few simple classes (PERSON, LOCATION, DATE) we can examine each word in the sentence and tag the named entities as follows (commas inserted):

“PERSON, PERSON, O, O, O, PERSON, PERSON, O, LOCATION, LOCATION, O, DATE.”

Tagged sentences like this one can then be used to train an NER classifier, helping it learn when and how to apply those three types of tags.

Read More