Student Research | Semantic Network Analysis of Ideological Communities on Twitter

By Logan Wilson, Class of 2018

This excerpt is taken from an MSiA student research blog posting. Each month, students in our program submit original extracurricular research as part of our blog competition. The winner(s) are published to the MSiA Student Research Blog, our program website, and receive a chance to attend an analytics conference of their choice. Visit our blog to see more.


One of the primary contributions of Twitter to modern media is the notion of the “hashtag” – a single word or phrase that captures an idea, concept, or movement. As more and more people see and share the hashtag, the movement grows and evolves, giving way to new ideas and forming new hashtags. By applying modern analytical techniques of network analysis and natural language processing, paired with the huge troves of available Twitter data, we can examine these hashtags and their relationships to better capture the underlying ideas.

On October 6th, 2018, Brett Kavanaugh was sworn into the Supreme Court after several weeks of fierce debates and accusations regarding Kavanaugh’s alleged sexual misconduct. Throughout the entire period, Twitter served as a platform for humans (and bots) all around the world to voice their opinions and advocate for ideas via their usage of hashtags. Three days later (which is conveniently when I was able to get my Twitter API access sorted out), Twitter was still awash with memes and emotions surrounding the event. The idea was simple – design a network of related hashtags, identify the internal communities, and semantically analyze the tweets associated with each community.

Data Collection

The first problem involved first obtaining the data. The search functionality of the Twitter API is fairly robust, but determining what to search for proved an interesting puzzle. Many ideas and perspectives on Twitter can be related to a subject without referring to it explicitly. For example, consider the tweet in Figure 1 that was retweeted over 126,000 times.

Fig. 1: Tweet by @LynzyLab

It’s clearly referring to a social perspective connected to the allegations against Kavanaugh. Yet, if we only searched for tweets explicitly referencing Kavanaugh, we would miss tweets like this altogether.

My solution requires a little bit of creativity, and a lot of patience. I started by scraping 1,000 tweets containing the hashtag “#kavanaugh.” I then extracted all of the hashtags from these tweets (roughly 500). Then for each of those hashtags, I scraped another 1000 tweets. So in the first 1,000 tweets directly referencing #kavanaugh we’d likely find at least one tweet referencing #TheResistance, which would then allow us to scrape Lynzy’s tweet above. The Twitter API limits you to a max of 18,000 tweets every 15 minutes, so this can take quite a while. Additionally, a lot of the results returned are retweets, so after we remove all the duplicates we end up with around 130,000 tweets. While this is certainly not reflective of the entire universe of related tweets, it’s a good place to get started.

Building the Network

After a little bit of processing (okay, a lot – Twitter data is ugly), we can start building our hashtag network. We’ve got way too many hashtags (around 95,000), so we first eliminate all hashtags occurring less than 100 times, since hashtags used by only a few people won’t be able to tell us much. We create links in our network using the Jaccard index formula in Figure 2.

Fig. 2: Jaccard Index Formula

Specifically, we define an undirected edge to exist between two nodes if their Jaccard index is greater than 0.1. This means that hashtags #A and #B are connected if at least 10% of tweets containing either #A or #B contain both #A and #B. We can build and visualize the network with the Python library NetworkX, as shown in Figure 3.

Fig. 3: Initial Hashtag Network

Read more