Julien's data blog

TOPIC MODELLING - PART 2

What you are about to read is the second part in a series of articles dedicated to Topic modelling. Please note that like the previous one, this article contains snippets of text and code from my dissertation thesis: “Information extraction of short text snippets using Topic Modelling algorithms.” The text that follows was written by me in 2021, and will mainly center around the theoretical aspect of Topic Modelling. Research papers, when cited, are properly referenced.

Tue, Dec 29, 2020

INTRODUCTION TO TF-IDF

Before reading the following article, you should probably take a look at my introduction to Topic Modelling. Please note that this blog post contains snippets of text and code from my dissertation thesis: “Information extraction of short text snippets using Topic Modelling algorithms.” While the next articles will likely be more focused on how to extract meaningful topics from unstructured data as well as on some frameworks that can be used to evaluate our results, this article will be mainly focused on the theoretical aspect.

Tue, Dec 1, 2020

USING TWEEPY TO EXTRACT CONTENT FROM TWITTER

This article will show you how to quickly start fetching tweets from any public Twitter handle or hashtag, or get a list of followers and friends (following) for any public Twitter handle. Hey Twitter, I’d like to get access to your API In order to get access to the Twitter API, you will need a developer account. After you apply, it might take a little while to get a response from Twitter, but you should be able to easily generate your keys once your request is accepted.

Thu, Oct 15, 2020

COUNTING ELEMENTS IN PANDAS

Many programming languages will allow developers to count the number of elements within a list / array by looping through that list, and checking if each of the element is present as a key within a dictionary or map. names = ["Ana","John","Ana","Ana","Mary","John"] def countNames(data): result = {} for name in names: if name in result: result[name] += 1 else: result[name] = 1 print(result) countNames(names) This process can be a bit tedious, and to sort the dictionary by the most commonly found element, we have to add another line of code before our print() statement:

Thu, Jul 30, 2020

JOYPY, A MATPLOTLIB WRAPPER FOR RIDGELINE PLOTS

Visualising the distribution of a given variable within a dataset is both extremely useful, and pretty simple. Whether you choose Python, R, Julia, Excel, or whichever language / framework you want to, creating a violin plot is probably one of the first things you will learn and then iterate over when working with numerical data. Ridgeline, or joy plots? Where things can get a bit more complicated, is when trying to compare the density of numerous variables.

Sat, Jul 11, 2020

TOPIC MODELLING - PART 1

What follows is the first in a series of articles dedicated to Topic modelling. Please note that this article contains snippets of text and code from my dissertation thesis: “Information extraction of short text snippets using Topic Modelling algorithms.” While the next articles will likely be more focused on how to extract meaningful topics from unstructured data as well as on some frameworks that can be used to evaluate our results, this article will be mainly centered around the theoretical aspect.

Wed, Jul 1, 2020