What do people think about Super Bowl 2022?

In this post, I am going to download and analyze 50.000 Super Bowl tweets scraped from the internet from the last two weeks perform sentiment analysis to gather market intelligence on the Super Bowl 2022 Playoffs. What are people’s opinions about the Super Bowl 2022?

To do this, I will need to use Natural Language Processing as a way to gain insights into my data. One of the most common forms of analysis we can exploit using NLP is called sentiment analysis, and it consists of converting a text into a score that estimates its sentiment. There are several models we can use to perform sentiment analysis, but they all fulfill the same purpose.

The most common use case of sentiment analysis is to estimate the demand of the market for a certain product, hopefully entering into a trend just when it begins. In Finance, this is one of the most searched ML applications.

The project will be following these steps:

  1. Download data from Twitter
  2. Preprocess the data
  3. Perform sentiment analysis
  4. Analyze results

1. Download data from Twitter

To download data from Twitter without using its metered API, hence without any limit on the volume of data I wish to scrape, I can use different libraries. One of the most common is called twint, however, after the latest Twitter updates, has not been working very well.

As a valid and also simpler alternative, I will be using snscrape.

!pip install snscrape

After installing the library with pip, I will need to declare which are the search parameters. Because I may need to use it on more queries, for example, I could search for the sentiment on the top 10 Billionaires, I want to be able to have a control panel that gives instruction to the program.

As such, I will use movie_dict as a variable to store all the instructions to perform multiple searches. For each search, a csv will be created with all the data I have been able to scrape from Twitter:

import snscrape.modules.twitter as sntwitter
import pandas as pd
import progressbar
from time import sleep
from datetime import datetime
import os

movie_dict = {'super_bowl': ['super bowl superbowl since:2022-01-01 until:2022-01-17', 50000]}

The following is the code that executes the scrape:

today = datetime.today().strftime('%Y%m%d')[2:]+'_'
for index, movie_name in enumerate(movie_dict):
    print(movie_name, '%')
    tweets_list1 = []
    bar = progressbar.ProgressBar(maxval=movie_dict[movie_name][1]+2, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    for i,tweet in enumerate(sntwitter.TwitterSearchScraper(f'{movie_dict[movie_name][0]}').get_items()): #declare a username
        bar.update(i+1)
        if i>movie_dict[movie_name][1]: #number of tweets you want to scrape
            break
        #print(movie_name, i, tweet)
        tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username]) #declare the attributes to be returned
    tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

    tweets_df1[['Datetime', 'Text']].to_csv(f'{index}.csv')
    bar.finish()

This code is an improved version of the standard code used to run a query to filter the tweets you wish to download from Twitter. You can use it to download not only one query, but a list of query

2. Preprocess the data

Now that a csv file has been created for every query in my control panel, let us look at the raw data of a single query:

import pandas as pd

#when importing empty rows, they are transformed to nan, so we need to drop them here
df = pd.read_csv('download/merged.csv')[['text']]
df

Because some of the rows may be null when importing the dataset, I am dropping them and resetting the index. I am also going to apply a small preprocessing snippet. Preprocessing is a step that you can customize depending on your needs. In this case, because I only want to get rid of links and non-ascii characters, I am going to use the following two functions:

#get rid of links and hashtags
df["text"] = df["text"].apply(lambda x : ' '.join([s for s in x.split(' ') if s.find('@') == -1 and s.find('www') == -1 and s.find('https') == -1]))

#get rid of non-ascii characters
df = df.replace(r'\W+', ' ', regex=True)
df

This is a screenshot of the dataframe after preprocessing:

raw data sample

3. Perform sentiment analysis

I am now going to apply a sentiment analysis to our cleaned data. There is a myriad of sentiment analysis libraries you can use to perform the same task, from transformers, textblob, spacy. For this tutorial I am going to use the latest version of spacy, and its extension called spacytextblob.

To install it, I will need to run the following commands and restart the notebook:

!pip install spacytexblob==3.0.1
!pip install spacy==3.2.1
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm

Once the installation is complete, we can run the sentiment analysis and append the score to our dataframe:

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacytextblob")

df['sentiment'] = df['text'].apply(lambda x : nlp(x)._.polarity)
df_sentiment = df.sort_values('sentiment').reset_index(drop=True)
df_sentiment

As we can see, this is the final result:

sample of the sentiment analyis

I decided to sort the values from the most negative, so that we could see some of the most shocking comments regarding Super Bowl tweets.

4. Analyze results

Before analyzing the content of the tweets, we are first going to preprocess our data even more. There are several preprocessing strategies, in this post, we are going to:

  • Lemmatize each word
  • Delete extra characters
  • Remove stop words

I am using my own function to perform this cleaning. Because of the high availability of similar preprocessing functions, if you wish to try other code, perhaps simpler or that it only performs a single preprocessing step, you can easily google it:

import re
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

#adding a counter to check the progress of the algo while it runs
global counter
counter = 0
def preprocess(sentence, stemming=False, lemmatizing=False):
  global counter
  counter += 1
  if counter % 100 == 0:
    pass
    #print(counter)

  #clean as much as possible, but not apply strong editing to the text, yet
  sentence=str(sentence)
  tokenizer = RegexpTokenizer(r'\w+')

  sentence = sentence.lower()
  sentence=sentence.replace('{html}',"") 
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', sentence)
  rem_url=re.sub(r'http\S+', '',cleantext)
  rem_num = re.sub('[0-9]+', '', rem_url)
  tokens = tokenizer.tokenize(rem_num)
  
  filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
  
  if stemming == True and lemmatizing == False:
    stem_words=[stemmer.stem(w) for w in filtered_words]
    return " ".join(stem_words)

  if stemming == False and lemmatizing == True:
    lemma_words=[lemmatizer.lemmatize(w) for w in filtered_words]
    return " ".join(lemma_words)

  if stemming == True and lemmatizing == True:
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(lemma_words)
  
  #at the end of the algo we return filtered words
  return " ".join(filtered_words)

#preprocess the sentiment text
df_sentiment['text'] = df_sentiment['text'].apply(lambda x: preprocess(x, stemming=False, lemmatizing=True))
df_sentiment
sample of preprocessed data

There are several ways we can analyze the results from the sentiment analysis. One common practice is to separate the samples with negative sentiment from the ones with a positive sentiment and extract what are the most common words.

df_neg = df_sentiment[df_sentiment['sentiment'] < 0]
df_pos = df_sentiment[df_sentiment['sentiment'] > 0]

First of all, let us see how many positive and negative reviews we have been inferring from our data, to have a general idea about the opinion of the public regarding the Super Bowl:

print(len(df_neg))
print(len(df_pos))
\
5136
40177

Let us extract the most common words found in both positive and negative positive reviews:

positive_words = pd.DataFrame([dict(Counter(' '.join(df_pos['text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)[0:100].index

negative_words = pd.DataFrame([dict(Counter(' '.join(df_neg['text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)[0:100].index

These are the most common words found in the positive tweets:

['bowl', 'super', 'win', 'year', 'team', 'superbowl', 'bill', 'playoff', 'game', 'brady', 'like', 'get', 'nfl', 'going', 'time', 'would', 'one', 'fan', 'cowboy', 'winning', 'bengal', 'make', 'packer', 'see', 'season', 'last', 'first', 'good', 'patriot', 'think', 'eagle', 'chief', 'want', 'bucs', 'tom', 'back', 'know', 'play', 'got', 'lol', 'beat', 'still', 'never', 'pat', 'next', 'let', 'amp', 'raider', 'even', 'er', 'gonna', 'need', 'another', 'buffalo', 'right', 'since', 'really', 'better', 'way', 'run', 'best', 'week', 'pick', 'steelers', 'every', 'football', 'coach', 'ram', 'que', 'could', 'today', 'day', 'bay', 'guy', 'take', 'championship', 'mvp', 'ever', 'titan', 'say', 'great', 'new', 'show', 'love', 'hope', 'los', 'afc', 'two', 'card', 'also', 'ring', 'wild', 'defense', 'watch', 'dallas', 'champion', 'said', 'well', 'people', 'look']

These, instead, are the most common words found in the negative tweets:

['bowl', 'super', 'game', 'superbowl', 'year', 'team', 'playoff', 'like', 'bill', 'fan', 'get', 'brady', 'nfl', 'time', 'one', 'win', 'never', 'last', 'would', 'bad', 'patriot', 'make', 'cowboy', 'going', 'season', 'eagle', 'bengal', 'play', 'still', 'think', 'bucs', 'pat', 'championship', 'packer', 'worst', 'got', 'see', 'since', 'beat', 'even', 'hate', 'know', 'want', 'tom', 'chief', 'bay', 'back', 'amp', 'let', 'another', 'say', 'next', 'shit', 'every', 'football', 'loss', 'way', 'green', 'take', 'gonna', 'fuck', 'crazy', 'er', 'ever', 'raider', 'first', 'afc', 'made', 'watch', 'day', 'round', 'fucking', 'week', 'nfc', 'could', 'guy', 'worse', 'defense', 'buffalo', 'que', 'lose', 'look', 'least', 'coach', 'said', 'mean', 'getting', 'wrong', 'lost', 'steelers', 'run', 'today', 'call', 'big', 'really', 'also', 'went', 'point', 'remember', 'people']

5. Conclusion

Given the insights we have been inferencing using NLP we can see taht these entiment regarding Super Bowl tweets is overwelmingly positive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: