Another twitter sentiment analysis with python part 3

Photo by Samuel Ramos on Unsplash

This is the third part of Twitter sentiment analysis project I am currently working on as a capstone for General Assembly London’s Data Science Immersive course. You can find the links to the previous posts below.

At the end of the second blog post, I have created term frequency data frame looks like this.

Another twitter sentiment analysis with python part 3

The indexes are the token from the tweets dataset (“Sentiment140”), and the numbers in “negative” and “positive” columns represent how many times the token appeared in negative tweets and positive tweets.

Zipf’s Law

Zipf’s Law is first presented by French stenographer Jean-Baptiste Estoup and later named after the American linguist George Kingsley Zipf. Zipf’s Law states that a small number of words are used all the time, while the vast majority are used very rarely. There is nothing surprising about this, we know that we use some of the words very frequently, such as “the”, “of”, etc, and we rarely use the words like “aardvark” (aardvark is an animal species native to Africa). However, what’s interesting is that “given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.”

If you want to know a bit more about Zipf’s Law, I recommend the below Youtube video.

Zipf’s Law can be written as follows: the rth most frequent word has a frequency f(r) that scales according to

for

Let’s see how the tweet tokens and their frequencies look like on a plot.

y_pos = np.arange(500)
plt.figure(figsize=(10,8))
s = 1
expected_zipf = [term_freq_df.sort_values(by='total', ascending=False)['total'][0]/(i+1)**s for i in y_pos]
plt.bar(y_pos, term_freq_df.sort_values(by='total', ascending=False)['total'][:500], align='center', alpha=0.5)
plt.plot(y_pos, expected_zipf, color='r', linestyle='--',linewidth=2,alpha=0.5)
plt.ylabel('Frequency')
plt.title('Top 500 tokens in tweets')

On the X-axis is the rank of the frequency from highest rank from left up to 500th rank to the right. Y-axis is the frequency observed in the corpus (in this case, “Sentiment140” dataset). One thing to note is that the actual observations in most cases does not strictly follow Zipf’s distribution, but rather follow a trend of “near-Zipfian” distribution.

Even though we can see the plot follows the trend of Zipf’s Law, but it looks like it has more area above the expected Zipf curve in higher ranked words.

Another way to plot this is on a log-log graph, with X-axis being log(rank), Y-axis being log(frequency). By plotting on a log-log scale the result will yield roughly linear line on the graph.

from pylab import *
counts = term_freq_df.total
tokens = term_freq_df.index
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
plt.figure(figsize=(8,6))
plt.ylim(1,10**6)
plt.xlim(1,10**6)
loglog(ranks, frequencies, marker=".")
plt.plot([1,frequencies[0]],[frequencies[0],1],color='r')
title("Zipf plot for tweets tokens")
xlabel("Frequency rank of token")
ylabel("Absolute frequency of token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-2), 25).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]], 
                 verticalalignment="bottom",
                 horizontalalignment="left")

Again we see a roughly linear curve, but deviating above the expected line on higher ranked words, and at the lower ranks we see the actual observation line lies below the expected linear line.

At least, we proved that even the tweet tokens follow “near-Zipfian” distribution, but this introduced me to a curiosity about the deviation from the Zipf’s Law. Even though the law itself states that the actual observation follows “near-Zipfian” rather than strictly bound to the law, but is the area we observed above the expected line in higher ranks just by chance? Or does it mean that tweets use frequent words more heavily than other text corpora? Is there statistically significant difference compared to other text corpora?

Even though all of these sounds like very interesting research subjects, but it is beyond the scope of this project, and I will have to move to the next step of data visualisation.

Tweet Tokens Visualisation

After having seen how the tokens are distributed through the whole corpus, the next question in my head is how different the tokens in two different classes(positive, negative). This time, the stop words will not help much, because the same high-frequency words (such as “the”, “to”) will equally frequent in both classes. If these stop words dominate both of the classes, I won’t be able to have a meaningful result. So, I decided to remove stop words, and also will limit the max_features to 10,000 with countvectorizer.

I will not go through the countvectorizing steps since this has been done in a similar way in my previous blog post. But it will be in my Jupyter Notebook that I will share at the end of this post. Anyway, after countvectorizing now we have token frequency data for 10,000 tokens without stop words, and it looks as below.

Let’s see what are the top 50 words in negative tweets on a bar chart.

y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos, term_freq_df2.sort_values(by='negative', ascending=False)['negative'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df2.sort_values(by='negative', ascending=False)['negative'][:50].index,rotation='vertical')
plt.ylabel('Frequency')
plt.xlabel('Top 50 negative tokens')
plt.title('Top 50 tokens in negative tweets')

Even though some of the top 50 tokens can provide some information about the negative tweets, some neutral words such as “just”, “day”, are one of the most frequent tokens. Even though these are the actual high-frequency words, but it is difficult to say that these words are all important words in negative tweets that characterises the negative class.

Let’s also take a look at top 50 positive tokens on a bar chart.

y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos, term_freq_df2.sort_values(by='positive', ascending=False)['positive'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df2.sort_values(by='positive', ascending=False)['positive'][:50].index,rotation='vertical')
plt.ylabel('Frequency')
plt.xlabel('Top 50 positive tokens')
plt.title('Top 50 tokens in positive tweets')

Again, neutral words like “just”, “day”, are quite high up in the rank.

What if we plot the negative frequency of a word on X-axis, and the positive frequency on Y-axis?

import seaborn as sns
plt.figure(figsize=(8,6))
ax = sns.regplot(x="negative", y="positive",fit_reg=False, scatter_kws={'alpha':0.5},data=term_freq_df2)
plt.ylabel('Positive Frequency')
plt.xlabel('Negative Frequency')
plt.title('Negative Frequency vs Positive Frequency')

Most of the words are below 10,000 on both X-axis and Y-axis, and we cannot see meaningful relations between negative and positive frequency.

In order to come up with a meaningful metric which can charaterise important tokens in each class, I borrowed a metric presented by Jason Kessler in PyData 2017 Seattle. In the talk, he presented a Python library called Scattertext. Even though I did not make use of the library, the metrics used in the Scattertext as a way of visualising text data are very useful in filtering meaningful tokens from the frequency data.

Let’s explore what we can get out of frequency of each token. Intuitively, if a word appears more often in one class compared to another, this can be a good measure of how much the word is meaningful to characterise the class. In the below code I named it as ‘pos_rate’, and as you can see from the calculation of the code, this is defined as

term_freq_df2['pos_rate'] = term_freq_df2['positive'] * 1./term_freq_df2['total']
term_freq_df2.sort_values(by='pos_rate', ascending=False).iloc[:10]

Words with highest pos_rate have zero frequency in the negative tweets, but overall frequency of these words are too low to consider it as a guideline for positive tweets.

Another metric is the frequency a word occurs in the class. This is defined as

term_freq_df2['pos_freq_pct'] = term_freq_df2['positive'] * 1./term_freq_df2['positive'].sum()
term_freq_df2.sort_values(by='pos_freq_pct', ascending=False).iloc[:10]

But since pos_freq_pct is just the frequency scaled over the total sum of the frequency, the rank of pos_freq_pct is exactly same as just the positive frequency.

What we can do now is to combine pos_rate, pos_freq_pct together to come up with a metric which reflects both pos_rate and pos_freq_pct. Even though both of these can take a value ranging from 0 to 1, pos_rate has much wider range actually spanning from 0 to 1, while all the pos_freq_pct values are squashed within the range smaller than 0.015. If we average these two numbers, pos_rate will be too dominant, and will not reflect both metrics effectively.

So here we use harmonic mean instead of arithmetic mean. “Since the harmonic mean of a list of numbers tends strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.” The harmonic mean H of the positive real number x1,x2,…xn is defined as

from scipy.stats import hmeanterm_freq_df2['pos_hmean'] = term_freq_df2.apply(lambda x: (hmean([x['pos_rate'], x['pos_freq_pct']])                                                               if x['pos_rate'] > 0 and x['pos_freq_pct'] > 0 else 0), axis=1)                                                       term_freq_df2.sort_values(by='pos_hmean', ascending=False).iloc[:10]

The harmonic mean rank seems like the same as pos_freq_pct. By calculating the harmonic mean, the impact of small value (in this case, pos_freq_pct) is too aggravated and ended up dominating the mean value. This is again exactly same as just the frequency value rank and doesn’t provide a much meaningful result.

What we can try next is to get the CDF (Cumulative Distribution Function) value of both pos_rate and pos_freq_pct. CDF can be explained as “distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x”. By calculating CDF value, we can see where the value of either pos_rate or pos_freq_pct lies in the distribution in terms of cumulative manner. In the below result of the code, we can see a word “welcome” with pos_rate_normcdf of 0.995625, and pos_freq_pct_normcdf of 0.999354. This means roughly 99.56% of the tokens will take a pos_rate value less than or equal to 0.91535, and 99.99% will take a pos_freq_pct value less than or equal to 0.001521.

Next, we calculate a harmonic mean of these two CDF values, as we did earlier. By calculating the harmonic mean, we can see that pos_normcdf_hmean metric provides a more meaningful measure of how important a word is within the class.

from scipy.stats import normdef normcdf(x):
    return norm.cdf(x, x.mean(), x.std())term_freq_df2['pos_rate_normcdf'] = normcdf(term_freq_df2['pos_rate'])term_freq_df2['pos_freq_pct_normcdf'] = normcdf(term_freq_df2['pos_freq_pct'])term_freq_df2['pos_normcdf_hmean'] = hmean([term_freq_df2['pos_rate_normcdf'], term_freq_df2['pos_freq_pct_normcdf']])term_freq_df2.sort_values(by='pos_normcdf_hmean',ascending=False).iloc[:10]

Next step is to apply the same calculation to the negative frequency of each word.

term_freq_df2['neg_rate'] = term_freq_df2['negative'] * 1./term_freq_df2['total']term_freq_df2['neg_freq_pct'] = term_freq_df2['negative'] * 1./term_freq_df2['negative'].sum()term_freq_df2['neg_hmean'] = term_freq_df2.apply(lambda x: (hmean([x['neg_rate'], x['neg_freq_pct']])                                                                if x['neg_rate'] > 0 and x['neg_freq_pct'] > 0                                                                else 0), axis=1)                                                       term_freq_df2['neg_rate_normcdf'] = normcdf(term_freq_df2['neg_rate'])term_freq_df2['neg_freq_pct_normcdf'] = normcdf(term_freq_df2['neg_freq_pct'])term_freq_df2['neg_normcdf_hmean'] = hmean([term_freq_df2['neg_rate_normcdf'], term_freq_df2['neg_freq_pct_normcdf']])term_freq_df2.sort_values(by='neg_normcdf_hmean', ascending=False).iloc[:10]

Now let’s see how the values are converted into a plot. In order to compare, I will first plot neg_hmean vs pos_hmean, and neg_normcdf_hmean vs pos_normcdf_hmean.

plt.figure(figsize=(8,6))
ax = sns.regplot(x="neg_hmean", y="pos_hmean",fit_reg=False, scatter_kws={'alpha':0.5},data=term_freq_df2)
plt.ylabel('Positive Rate and Frequency Harmonic Mean')
plt.xlabel('Negative Rate and Frequency Harmonic Mean')
plt.title('neg_hmean vs pos_hmean')

Not much difference from the just frequency of positive and negative. How about the CDF harmonic mean?

plt.figure(figsize=(8,6))
ax = sns.regplot(x="neg_normcdf_hmean", y="pos_normcdf_hmean",fit_reg=False, scatter_kws={'alpha':0.5},data=term_freq_df2)
plt.ylabel('Positive Rate and Frequency CDF Harmonic Mean')
plt.xlabel('Negative Rate and Frequency CDF Harmonic Mean')
plt.title('neg_normcdf_hmean vs pos_normcdf_hmean')

It seems like the harmonic mean of rate CDF and frequency CDF has created an interesting pattern on the plot. If a data point is near to the upper left corner, it is more positive, and if it is closer to the bottom right corner, it is more negative.

It is good that the metric has created some meaningful insight out of frequency, but with text data, showing every token as just a dot is lacking important information on which token each data point represents. With 10,000 points, it is difficult to annotate all of the points on the plot. For this part, I have tried several methods and came to a conclusion that it is not very practical or feasible to directly annotate data points on the plot.

So I took an alternative method of an interactive plot with Bokeh. Bokeh is an interactive visualisation library for Python, which creates graphics in style of D3.js. Bokeh can output the result in HTML format or also within the Jupyter Notebook. And below is the plot created by Bokeh.

from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import LinearColorMapper
from bokeh.models import HoverTooloutput_notebook()
color_mapper = LinearColorMapper(palette='Inferno256', low=min(term_freq_df2.pos_normcdf_hmean), high=max(term_freq_df2.pos_normcdf_hmean))p = figure(x_axis_label='neg_normcdf_hmean', y_axis_label='pos_normcdf_hmean')p.circle('neg_normcdf_hmean','pos_normcdf_hmean',size=5,alpha=0.3,source=term_freq_df2,color={'field': 'pos_normcdf_hmean', 'transform': color_mapper})hover = HoverTool(tooltips=[('token','@index')])
p.add_tools(hover)
show(p)

Since the interactive plot can’t be inserted to Medium post, I attached a picture, and somehow the Bokeh plot is not showing on the GitHub as well. So I am sharing this with the link you can access.

With above Bokeh plot, you can see what token each data point represents by hovering over the points. For example, the points in the top left corner show tokens like “thank”, “welcome”, “congrats”, etc. And some of the tokens in bottom right corner are “sad”, “hurts”, “died”, “sore”, etc. And the color of each dot is organised in “Inferno256” color map in Python, so yellow is the most positive, while black is the most negative, and the color gradually goes from black to purple to orange to yellow, as it goes from negative to positive.

Depending on which model I will use later for classification of positive and negative tweets, this metric can also come in handy.

Next phase of the project is the model building. In this case, a classifier that will classify each tweet into either negative or positive class. I will keep sharing my progress through Medium.

Thank you for reading, and you can find the Jupyter Notebook from below link.

https://github.com/tthustla/twitter_sentiment_analysis_part3/blob/master/Capstone_part3-Copy2.ipynb

Another twitter sentiment analysis with python part 3

Zipf’s Law

Tweet Tokens Visualisation

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội