This article is the second part of a series started here in which we want to build a Machine Learning model to predict suicidal tendencies. The outcome will be a web app where the users will be invited to describe how they feel. Upon the prediction result, the users will be shown empathy and invited to take appropriate actions to seek help.
In this article, we will explore the choices I made to further clean and preprocess the data scraped in the first part. We will also get more familiar with our data.
Disclaimer: This blog post on suicidal tendency detection is for educational purposes only. It is not meant to be a reliable, highly accurate mental illness diagnosis system, nor has it been professionally or academically vetted.
The Data Science Process is made of several steps, as illustrated in the picture above, taken from Andrew Ng’s excellent specialization course on Machine Learning Engineering For Production.
In the previous article, we defined the scope of our project and aggregated data.
In this article, we will prepare our dataset before fitting a Machine Learning model.
To take full advantage of this article, you can download the complete code here.
Please note that the whole notebook might take some time to run in its entirety, close to 1 hour, depending on your hardware.
HTML Special Character Removal
When starting to work on this second part, I realized that despite using the Reddit Cleaner
package previously, some HTML special characters remained in the dataset. We can remove them easily with the HTML
Python package.
If you are curious about the HTML
package, you can find the documentation here.
From this point onwards, all cleaning and preprocessing tasks will be done in a new column called prep_txt
as a way to be able to control and improve text processing against the original post.
Spelling Correction For Text Improvement
Garbage in, garbage out
is often said to emphasize the fact that even if you use the most powerful and latest state-of-art Machine Learning model to make a prediction, if your data are not of quality, you will not get anything good. This is why data cleaning and text preprocessing are essential steps before the modeling phase.
While looking at the data, I quickly noted the presence of misspellings and the use of non-standard abbreviations, and, at least partially, SMS-like spelling.
Leaving the text as it is will lead to a poor model and a poor result. This is because the algorithm can only learn from the data it is fed with. So, if a particular word is spelled in three different ways with some abbreviations, it will return as many vectors as there are spellings instead of one.
To correct the spelling, I could brute force it manually, which would take a vast amount of time but is still doable on 20,000 rows of data. Another way is to do it algorithmically.
Spelling Correction With pyspellchecker
There are a lot of researches on spelling corrections and exciting solutions to be explored.
Initially, I found Peter Norvig’s interesting solution and the pyspellchecker
library, which was my initial choice. Data Science is an iterative process, and I wanted to try his approach before looking for the next one.
However, the algorithm was still running after more than two hours and a half. I decided that it would not be suitable for my use case. If it takes so long on 20,000 rows, it will not be practical on a dataset with millions of rows, especially if I want to experiment with different hypotheses and that I want to do it quickly.
Spelling Correction With SymSpell
This is how I found about Wolf Garbe’s SymSpell
library. His solution is more accurate and one million times faster than pyspellchecker
.
SymsSpell
is an algorithm for finding all strings in a short time within a fixed edit distance from an extensive list of strings.
The Symmetric Delete spelling correction algorithm reduces the difficulty of generating edit candidates and the dictionary quest for the difference in distance.
It is six orders of magnitude faster (than the standard approach by Peter Norvig’s with deletes + transposes + replaces + inserts) and is language independent.
The SymSpell
algorithm exploits the fact that the edit distance between two terms is symmetrical. We can combine and meet in the middle by transforming the correct dictionary terms to erroneous strings and converting the incorrect input term to the right strings.
For more information, you can check the Github repository of the project here.
Another interesting point, TextBlob
also has a spellcheck()
function, which to the contrary of SymSpell
, considers the context. Still, according to the benchmarks I found, the accuracy of TextBlob
is lower than SymSpell
.
In this project, I decided to use SymSpell
.
Spelling Correction With NeuSpell
I also found an interesting paper about the NeuSpell library, which combined different models with higher accuracy than SymSpell
. However, I decided to first explore the results with SymSpell
, which is my solution of choice.
Eventually, if we need to iterate through the data again, I might explore NeuSpell
to compare.
In the meantime, you can read the paper here and access the project on GitHub here.
Notes On Spelling Correction, Numbers, And Abbreviations
While SymSpell
is an efficient algorithm, it is not a perfect solution, and manual correction is still required. For example, if in the text appears bf, SymSpell
will correct it as of instead of boyfriend. This is because the closest word to bf in terms of spelling is of and has the highest probability to be the right choice as of is a widespread word.
Since SymSpell
cannot interpret abbreviations, I decided to build a custom dictionary to correct abbreviations and words that are not fixed or processed as desired.
To avoid capital letters and as a step towards text normalization, I lowercased the text dataset.
I realized that removing the punctuation will also remove the & character. So, a&e will become ae and will not be corrected as accident and emergency. I solved it by running my custom dictionary before punctuation removal.
We also need to convert numbers to words because SymSpell would automatically remove them. I often see plain removal of numbers justified by stating that they do not add value to the data. However, I prefer to be careful here and be sure that data are irrelevant before discarding them. This is why I decided to convert numbers to words with the inflector package. The documentation can be found here. However, some numbers haven’t been processed, so I decided to add them manually in my initial custom dictionary.
Data Visualization
Visualizing data is essential to have a better understanding of the problem at hand. I found that plotting the distribution of word count could not show at a glance the most critical words in each category. I found word clouds a better way to tell the story of each category. Below, the word cloud of the depression class:
And here is the word cloud for the SuicideWatch class:
The words appearing from the SuicideWatch class, compared to the depression category, are, not surprisingly, overall darker and more brutal. Both word clouds show words related to emotions and feelings. While both word clouds paint sadness and hurt, the word cloud from depression does show some positive words related to love and does not include words associated with the immediate end of one’s life.
Feature Engineering
The next step is to perform feature engineering. I was wondering if we could build relevant features based on the text itself. For example, I wondered if the length of the text tends to be different between the depression and SuicideWatch categories. To explore this, I created a feature to calculate the number of words per post and ran a visualization to evaluate the feature.
The plot shows that posts from SuicidalWatch tend to be shorter, but we also notice that the distribution is right-skewed with many concise posts.
Therefore, I performed a logarithmic transformation on the distribution to reduce the skewness of the distribution, which will help the model to learn from our data.
There are probably more options to explore, but let’s experiment with this feature first. The difference between the two categories is not that important, and the weight of this feature as a predictor might not be significant.
To take full advantage of this article, you can download the complete code here.
Text Preprocessing With Stemming and Lemmatization
At this stage, the preprocessing is not entirely done yet. We need to apply stemming and Lemmatization before vectorizing our data, which means converting the words into numbers so that the machine learning algorithms will learn from the data.
For each of these steps, we will create different columns. The reason for this is because we will not feed our model with every preprocessing method. We will have to test each of them to see which one performs better. I discarded the n-gram method because I do not see how arbitrary bi-gram or tri-gram would be relevant to build an accurate model. However, having associations through correlations of words would be suitable.
Stemming methods
Stemming is the process of reducing a word to its base word or stem so that the words of a similar kind lie under a common stem. It is an essential step of Natural Language Processing to allow a model to make sense of the words and learn something from the data.
Stemming With Porter Stemmer
Porter’s Stemming method is one of the most common methods for stemming. Stemming refers to reducing the word to its stem or root.
After using Porter’s stemming method and TF-IDF
, we get 14623 features.
Stemming With Snowball Stemmer
The SnowBall stemmer
is another way to perform stemming and is also known as Porter 2
as it is an improvement of the Porter stemming
method. It is also computationally faster.
The Snowball Method
is often recommended because it is faster than the Porter Stemmer
and offers good results. Porter himself admitted that the SnowBall stemmer
is an improvement over his initial algorithm.
After using Snowball stemmer
and TF-IDF
, we get 14376 features.
Stemming With Lancaster Stemmer
Lancaster Stemmer
is a method more aggressive than the classic Porter Stemmer
. My take is that it may increase the frequency of some keywords and offer higher accuracy, based on this. But it might also trim the words down too much, and many short words could become obfuscated. This would lead to a decrease in inaccuracy. At this stage, it is only a hypothesis for which we will have the answer in the following article.
After performing Lancaster Stemming
and TF-IDF
, we get 11843 features.
The main drawback of stemming is that it does not consider if the word is being used as a noun or a verb in the context. For this reason, Lemmatization
might be a better option as it keeps this fact in consideration and will return a different output, depending on whether the word is used as a verb or a noun.
For this project, I used the NLTK
package to perform stemming. The documentation is available here.
Lemmatization
The lemmatization
step usually brings a higher accuracy than stemming; the reason for this is that it considers the context and performs morphological analysis of the words. Still, it is also a more computationally expensive solution. If speed is an issue, Lemmatization
might not be relevant for your project. However, in our case, we will explore the results to see if it is helpful to build a good machine learning model.
We will use SpaCy
as it is superior to NLTK
in terms of speed and accuracy. If you want to learn more about SpaCy
, you can find the documentation here.
The lemmatization
process, combined with TF-IDF
, gives us 18167 features.
To illustrate the principles mentioned above, let’s visualize the output of a single blog post for each method.
As input, let’s use an extract of a blog post after SymSpell
and without punctuation from the depression Subreddit:
so i have been with my boyfriend for five months, and he already told me he was depressed to this week nothing particular happened but i can now feel he is bothered by it
After Porter Stemming
:
so i have been with my boyfriend for five month and he alreadi told me he wa depress to thi week noth particular happen but i can now feel he is bother by it
After Snowball Stemming
:
so i have been with my boyfriend for five month and he alreadi told me he was depress to this week noth particular happen but i can now feel he is bother by it
After Lancaster Stemming
:
so i hav been with my boyfriend for fiv month and he already told me he was depress to thi week noth particul hap but i can now feel he is both by it
After Lemmatization
:
so I have be with my boyfriend for five month and he already tell I he be depressed to this week nothing particular happen but I can now feel he be bother by it
In this example, the difference between the different stemming methods is not so big, but as previously mentioned, the Lancaster stemming algorithm is the most aggressive one and can result in a loss of accuracy.
Note how the Lemmatization
method is able to track the conjugated verbs and their grammatical tenses with reasonable accuracy.
Vectorization With TF-IDF
Once the stemming
and Lemmatization
steps are done, we need to convert the text to numbers.
To perform this step, we can use CountVectorizer
or TF-IDF
.
CountVectorizer
transforms a given text into a vector based on each word’s frequency (count) that occurs in the entire text.
TF-IDF
(term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document.
This is done by multiplying how many times a word appears in a document and the inverse document frequency of the word across a set of documents.
In this project, we will use TF-IDF
as it is likely to increase the accuracy of our model. We will perform this step with the help of the Sklearn
library.
As a parameter, we will also remove the stopwords, which will help reduce the dimensions of our model by removing words that are important for language purposes but do not carry a lot of weight in terms of predictive power. We will create a variable for each column.
Interestingly, the more aggressive the method of stemming, the less feature we have, which is not necessarily a bad thing. It can be tempting to take advantage of our modern computational power and throw as many features as possible with the hope of building a good model, but it is not the way it works. However, let’s see what model we can create with our set of features at this stage. Eventually, we might try to simplify the model by algorithmically reducing the number of features during the model optimization stage.
The last step is to concatenate the vectorized DataFrames with the original ones and save them for modeling. However, exporting the DataFrame as CSV
files takes a very long time due to the number of rows and columns. Therefore, we save them as feather
files, which is a format optimized for large DataFrames. The file is processed quickly and is much smaller than a CSV
file.
Closing Thoughts
In this article, we explored how to clean text data, including ways to manage abbreviations and misspellings. We also went through different ways of normalizing text data, and finally, we convert the text to numbers using TF-IDF
.
If you haven’t read the first part of this project, you can check the article here
Now that we completed the cleaning and preprocessing of our text data, in the following article, we will explore different Machine Learning models and gain some insights into our current hypothesis.
To take full advantage of this article, you can download the complete code here.