We are just a Loquacious lot. 2019 Kenyan Social Beat

Herman Wandabwa
8 min readJan 22, 2020
The city of Nairobi at night as seen from the International Space Station.

More often than not, I get to overthink about Kenya, my motherland. I think of how ambitious, sometimes combative and often kind hearted the citizenry are in general. Not sure if Miguna Miguna fits this bill but, probably yes, in his right. This gave me this idea of coming up with a data perspective description of our nature as Kenyans, at least for 2019. Actually this popped up as I thought hard about my life on board a Boeing 717 flight from Honolulu to Kahului, Maui for the 53rd HICSS conference. These planes sometimes are not for the faint hearted. For those in academia and research, more so in the Information Systems and Computer Science domains, please purpose to attend it. This years was the 53rd version and was being held at the Grand Wailea, a true definition of the island’s opulence. A copy of my paper is here and is about profiling online sports bettors for interested parties. Not sure if its accessible outside some academia network but try. One other interesting presentation by Jevin West titled Calling BS in an Age of Misinformation is worth listening to.

Back to the questions of interest:-

1. Are we able to deduce the nature of Kenyans based on their daily chatter? Do they talk about substantive issues?
2. Are they topically consistent in their talk over time?

To sum the business problem in few words; “What do Kenyans actually talk about in a typical year? Does it define who they are?” 2019 was the year of interest in this instance.

Data Collection

For those who have read my articles, I’m sure you’ll be guessing tweets to be the dataset of choice. Some friend was quite concerned as to why I incline towards tweets when most of the tweeters are the middle class who do not mirror the entire society thus some bias in the dataset. Unfortunately, there is no indicative study to support his assertion, more so in the Kenyan domain. All in all, tweets evolve in terms of topics which to a large extent is based on the news of the day. Weighty issues in the Kenyan domain have been highlighted throughout the year thus was expected to be the chatter by citizenry.

We collected 1,210,969 unique tweets in this study. I had two CSVs as I could not fit all records in one CSV. Just as I indicated in this article , please have a look at Jeffersons GitHub repo on how to download tweets based on different parameters.

Collecting 10M tweets disseminated from/near Nairobi in 2019 is simply done as in the below code. I replaced Nairobi with Kenya in my setup to collect tweets originating from Kenya.

Processing and Modelling

Straight to the process. I imported the necessary packages and related CSV files. The process just like in my other articles follows the below pattern. I have two large CSVs that I could not combine externally, so concatenating them once imported as dataframes to me was the plausible option. The process is shown in the below code.

What you’ll note at the end is that, there will be about 37 columns added to the initial ones. Most will be empty. Therefore, I chose just the essential ones as opposed to loading the entire dataframe in memory as below. In the same process, I removed duplicates, empty tweets etc.

Its always a good idea to be a bit curious with the dataset. I was interested in finding out the most favourited and retweeted tweet of 2019 in the Kenyan twitterspace. The process is simple. Every time a tweet is favourited or retweeted, a count score is added to it in the dataset by Twitter. The one with the highest score definitely depicts that it touched the nerves of many users in the Kenyan context. Coincidentally, the most retweeted and favourited tweets are all from the same user. I don’t know the user personally, but @SylviaWanjira_’s tweet below

had 67505 favorites and 16838 retweets respectively as shown in the below code.

The cleanup process I shared earlier is as in the below code:-

Tokenization

I’ll be looking at terms and patterns in the tweets. The best way to get around this is to tokenize the cleaned up text. This simply means dividing each cleaned up tweet into individual words. The below function encapsulates this process:-

The function to tokenize the tweets is then called in the below code:-

The output of this process should be as in the below output.

The tokens column above will be the input in our model. Empty records will be removed as they do not add any value to the model at the end.

Quarterly Data

The best way in my opinion to know what users talk about over time and subsequent nature of Kenyans is to divide the data in quarters. Thus, the data is grouped per three months from January to December 2019. To get these subsets, the date column is converted using the quarter function in the dataframe as in the below code.

Converting Tokens to Lists and Modelling

For the purpose of modelling, each subset (in quarters) was converted to a list for easier manipulation as shown in the below code. That’s the best way to build a dictionary and eventually model it.

N-Gram Modelling

N-Grams are simply a contiguous sequence of n items from text or speech. The items can be phonemes, syllables, letters, words or base pairs. Tweets are generally not linguistically rich based on the sparseness of their dictionary. Therefore, its hard to get commonalities in three contiguous words i.e. tri-grams. Bi-grams (co-occurrence of two common words in the dataset)were chosen in building topic models to represent the chatter in the Kenyan Twitterspace.

A corpus of bi-grams from the dictionary of words in each of the subsets is generated as below. Thereafter, a corpus that’ll be fed in the topic model is then generated.

Latent Dirichlet Allocation (LDA) was the topic modelling approach in this case. An in-depth guide to this modelling process is described here. Training the LDA models for each of the subsets involves an input of the corpus and definition of the number of topics. I chose the 10 most representative topics for each quarter though LDA gives you the option to define as many as you want. However, there are ways of determining the most optimal number of topics for the dataset. You can approach the same problem as a clustering problem and look at the elbow convergence metric. The elbow point should indicate the most probable number of topics, the same way it happens in clustering. This can then be the input in the LDA model.

The output of the below code after training, which takes sometime depending on the dictionary size and computational power should be as below for topics in Quarter two (between April — June 2019):-

Each topic is represented by the 10 most representative words. As per the model, Classic 105, Mainaand Kingangi in Topic 2, Easter combined with soccer teams like Arsenal and Liverpool in Topic 6 are more representative i.e. have dominant words that define them.

Topical Visualizations

The above visualizations as much as are just fine are quite challenging to interpret. Therefore, we made use of the pyLDAvis package to visualize the generated topics better. The process is as simple as in the below code:-

These type of visualizations are quite unique since they not only show the dominant terms in a topic, but also show the influence each word has over the topic.

Topic 1, Quarter 1 Output

Topic 1 is the most dominant topic in the subset as its the largest circle among the ten. The most relevant term here is Kenya based on its overall frequency in the dataset as well as in this topic. Government is another dominant term here and it can be assumed that the talk in this topic was on the Kenyan government or something close. One downside of this approach is that the user has to figure out the final topical interpretations which may differ per individuals. You can hover the mouse on the circles and individual words and their influence over the topic will be shown as above. The relevance metric can be adjusted too. Influential terms will be shown when the relevance metric is adjusted to a value close to 1. Please click on the below link to get a feel of how the adjustments and output look like. Remember to hover the cursor over each of the shapes to find out the dominant words under each topic in the links below.

http://www.beyondanalyticx.com/external/lda_q1.html

The rest of the visualizations for the different subsets are in the below links. Please take time to go through each one of them for a better overview of the topics that Kenyans discussed over the year, more so the dominant ones.

Quarter 2 Outputs

http://www.beyondanalyticx.com/external/lda_q2.html

Quarter 3 Outputs

http://www.beyondanalyticx.com/external/lda_q3.html

Quarter 4 Outputs

http://www.beyondanalyticx.com/external/lda_q4.html

Conclusion

The discourse in the subsets is quite consistent over the quarters. From the outputs, we can infer the below points:-

  1. The most dominant topic across the subsets is about Kenya and government to a large extent. However, no specifics can be noted i.e. what issue about the very government as most of the surrounding terms are not so succinct in quantifying the topic.
  2. Quarter 4 topics are quite balanced and are about festivities, i.e. Christmas which was expected. The government was safe at this time as there was no dominant chatter around this topic. Everybody was busy enjoying the holidays.
  3. To a large extent, Kenyans just tend to talk and a lot thus can be termed as loquacious. I didn’t know such a term even existed. From the chatter, we just tend to talk about the same things over and over with very few specifics pointed out around the same topics.
  4. KOT has been a lethal online force when it comes to articulating issues affecting Kenya. I expected more topical inclinations that mirror their chatter. However, this didn't seem to be the case. Probable reason is that their tweets were probably very small compared to the rest, thus their topical relevance was mellowed down over time.

--

--