In this post, I am going to download and analyze 50.000 ETH tweets scraped from the internet from the last two weeks perform sentiment analysis to gather market intelligence. What are people’s opinions about Ethereum tweets?
To do this, I will need to use Natural Language Processing as a way to gain insights into my data. One of the most common forms of analysis we can exploit using NLP is called sentiment analysis, and it consists of converting a text into a score that estimates its sentiment. There are several models we can use to perform sentiment analysis, but they all fulfill the same purpose.
The most common use case of sentiment analysis is to estimate the demand of the market for a certain product, hopefully entering into a trend just when it begins. In Finance, this is one of the most searched ML applications.
The project will be following these steps:
- Download data from Twitter
- Preprocess the data
- Perform sentiment analysis
- Analyze results
1. Download data from Twitter
To download data from Twitter without using its metered API, hence without any limit on the volume of data I wish to scrape, I can use different libraries. One of the most common is called twint, however, after the latest Twitter updates, has not been working very well.
As a valid and also simpler alternative, I will be using snscrape.
!pip install snscrape
After installing the library with pip, I will need to declare which are the search parameters. Because I may need to use it on more queries, for example, I could search for the sentiment on the top 10 Billionaires, I want to be able to have a control panel that gives instruction to the program.
As such, I will use movie_dict as a variable to store all the instructions to perform multiple searches. For each search, a csv will be created with all the data I have been able to scrape from Twitter:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import progressbar
from time import sleep
from datetime import datetime
import os
movie_dict = {'ETH': ['Ethereum ETH since:2022-01-01 until:2022-03-17', 50000]}
The following is the code that executes the scrape:
today = datetime.today().strftime('%Y%m%d')[2:]+'_'
for index, movie_name in enumerate(movie_dict):
print(movie_name, '%')
tweets_list1 = []
bar = progressbar.ProgressBar(maxval=movie_dict[movie_name][1]+2, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
bar.start()
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(f'{movie_dict[movie_name][0]}').get_items()): #declare a username
bar.update(i+1)
if i>movie_dict[movie_name][1]: #number of tweets you want to scrape
break
#print(movie_name, i, tweet)
tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username]) #declare the attributes to be returned
tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])
tweets_df1[['Datetime', 'Text']].to_csv(f'{index}.csv')
bar.finish()
This code is an improved version of the standard code used to run a query to filter the tweets you wish to download from Twitter. You can use it to download not only one query, but a list of query
2. Preprocess the data
Now that a csv file has been created for every query in my control panel, let us look at the raw data of a single query:
import pandas as pd
#when importing empty rows, they are transformed to nan, so we need to drop them here
df = pd.read_csv('download/merged.csv')[['text']]
df
Because some of the rows may be null when importing the dataset, I am dropping them and resetting the index. I am also going to apply a small preprocessing snippet. Preprocessing is a step that you can customize depending on your needs. In this case, because I only want to get rid of links and non-ascii characters, I am going to use the following two functions:
#get rid of links and hashtags
df["text"] = df["text"].apply(lambda x : ' '.join([s for s in x.split(' ') if s.find('@') == -1 and s.find('www') == -1 and s.find('https') == -1]))
#get rid of non-ascii characters
df = df.replace(r'\W+', ' ', regex=True)
df
This is a screenshot of the dataframe after preprocessing:

3. Perform sentiment analysis
I am now going to apply a sentiment analysis to our cleaned data. There is a myriad of sentiment analysis libraries you can use to perform the same task, from transformers, textblob, spacy. For this tutorial I am going to use the latest version of spacy, and its extension called spacytextblob.
To install it, I will need to run the following commands and restart the notebook:
!pip install spacytexblob==3.0.1
!pip install spacy==3.2.1
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm
Once the installation is complete, we can run the sentiment analysis and append the score to our dataframe:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacytextblob")
df['sentiment'] = df['text'].apply(lambda x : nlp(x)._.polarity)
df_sentiment = df.sort_values('sentiment').reset_index(drop=True)
df_sentiment
As we can see, this is the final result:

I decided to sort the values from the most negative, so that we could see some of the most shocking comments regarding ETH tweets.
4. Analyze results
Before analyzing the content of the tweets, we are first going to preprocess our data even more. There are several preprocessing strategies, in this post, we are going to:
- Lemmatize each word
- Delete extra characters
- Remove stop words
I am using my own function to perform this cleaning. Because of the high availability of similar preprocessing functions, if you wish to try other code, perhaps simpler or that it only performs a single preprocessing step, you can easily google it:
import re
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
#adding a counter to check the progress of the algo while it runs
global counter
counter = 0
def preprocess(sentence, stemming=False, lemmatizing=False):
global counter
counter += 1
if counter % 100 == 0:
pass
#print(counter)
#clean as much as possible, but not apply strong editing to the text, yet
sentence=str(sentence)
tokenizer = RegexpTokenizer(r'\w+')
sentence = sentence.lower()
sentence=sentence.replace('{html}',"")
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', sentence)
rem_url=re.sub(r'http\S+', '',cleantext)
rem_num = re.sub('[0-9]+', '', rem_url)
tokens = tokenizer.tokenize(rem_num)
filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
if stemming == True and lemmatizing == False:
stem_words=[stemmer.stem(w) for w in filtered_words]
return " ".join(stem_words)
if stemming == False and lemmatizing == True:
lemma_words=[lemmatizer.lemmatize(w) for w in filtered_words]
return " ".join(lemma_words)
if stemming == True and lemmatizing == True:
stem_words=[stemmer.stem(w) for w in filtered_words]
lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
return " ".join(lemma_words)
#at the end of the algo we return filtered words
return " ".join(filtered_words)
#preprocess the sentiment text
df_sentiment['text'] = df_sentiment['text'].apply(lambda x: preprocess(x, stemming=False, lemmatizing=True))
df_sentiment

There are several ways we can analyze the results from the sentiment analysis. One common practice is to separate the samples with negative sentiment from the ones with a positive sentiment and extract what are the most common words.
df_neg = df_sentiment[df_sentiment['sentiment'] < 0]
df_pos = df_sentiment[df_sentiment['sentiment'] > 0]
First of all, let us see how many positive and negative reviews we have been inferring from our data, to have a general idea about the opinion of the public regarding Ethereum tweets:
print(len(df_neg))
print(len(df_pos))
\
7894
17322
Let us extract the most common words found in both positive and negative positive reviews:
positive_words = pd.DataFrame([dict(Counter(' '.join(df_pos['text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)[0:100].index
negative_words = pd.DataFrame([dict(Counter(' '.join(df_neg['text'].values.tolist()).split(' ')))]).T.sort_values(0, ascending=False)[0:100].index
These are the most common words found in the positive tweets:
['ethereum', 'eth', 'price', 'nft', 'bitcoin', 'last', 'btc', 'crypto', 'gas', 'nfts', 'market', 'right', 'new', 'gwei', 'tweet', 'hour', 'drop', 'nftcommunity', 'dropped', 'compared', 'level', 'increased', 'cryptocurrency', 'bought', 'project', 'fast', 'soon', 'current', 'blockchain', 'primary', 'sorare', 'season', 'fantasyfootball', 'serial', 'defi', 'get', 'social', 'slow', 'amp', 'normal', 'instant', 'ethgasprice', 'gasprice', 'binance', 'latest', 'coin', 'one', 'see', 'like', 'top', 'contract', 'news', 'low', 'via', 'usdt', 'free', 'opensea', 'buy', 'nftart', 'win', 'result', 'bnb', 'token', 'high', 'first', 'network', 'polygon', 'prediction', 'luna', 'time', 'solana', 'xrp', 'nftcollection', 'fee', 'day', 'chain', 'good', 'airdrop', 'amount', 'nftartist', 'best', 'usd', 'nftdrop', 'check', 'wallet', 'collection', 'make', 'support', 'address', 'worth', 'link', 'follow', 'transaction', 'available', 'bullish', 'nftcollector', 'lose', 'min', 'mention', 'great']
These, instead, are the most common words found in the negative tweets:
['eth', 'ethereum', 'price', 'usd', 'nft', 'average', 'day', 'market', 'last', 'level', 'number', 'bought', 'secondary', 'sorare', 'fantasyfootball', 'serial', 'season', 'bitcoin', 'date', 'btc', 'gwei', 'sale', 'floor', 'owner', 'utc', 'hapebeastgang', 'crypto', 'gas', 'coin', 'cryptocurrency', 'past', 'nfts', 'wallet', 'hour', 'amp', 'mention', 'unknown', 'blockchain', 'defi', 'current', 'time', 'via', 'fee', 'link', 'high', 'ethgasprice', 'move', 'chainlink', 'low', 'xrp', 'nftcommunity', 'rarible', 'symbol', 'nonfungible', 'gmt', 'reddit', 'digitalasset', 'detail', 'collectible', 'biz', 'chan', 'like', 'take', 'long', 'slow', 'look', 'ada', 'luna', 'fast', 'week', 'polygon', 'get', 'avax', 'address', 'one', 'ether', 'cardano', 'atom', 'cosmos', 'avalanche', 'lunarcrush', 'opensea', 'chain', 'solana', 'year', 'nftart', 'ftm', 'month', 'internet', 'maybe', 'network', 'terra', 'etc', 'bnb', 'fantom', 'mint', 'transaction', 'change', 'created', 'ethereumgas']
5. Conclusion
Given the insights we have been inferencing using NLP we can see that there is a prevalent positive opinion regarding ETH tweets.