Machine Learning For Mental Healthcare Part 2

This article is the second part of a series started here in which we want to build a Machine Learning model to predict suicidal tendencies. The outcome will be a web app where the users will be invited to describe how they feel. Upon the prediction result, the users will be shown empathy and invited to take appropriate actions to seek help.

In this article, we will explore the choices I made to further clean and preprocess the data scraped in the first part. We will also get more familiar with our data.

Disclaimer: This blog post on suicidal tendency detection is for educational purposes only. It is not meant to be a reliable, highly accurate mental illness diagnosis system, nor has it been professionally or academically vetted.

The Data Science Process is made of several steps, as illustrated in the picture above, taken from Andrew Ng’s excellent specialization course on Machine Learning Engineering For Production.

In the previous article, we defined the scope of our project and aggregated data.

In this article, we will prepare our dataset before fitting a Machine Learning model.

To take full advantage of this article, you can download the complete code here.
Please note that the whole notebook might take some time to run in its entirety, close to 1 hour, depending on your hardware.

HTML Special Character Removal

When starting to work on this second part, I realized that despite using the Reddit Cleaner package previously, some HTML special characters remained in the dataset. We can remove them easily with the HTML Python package.

If you are curious about the HTML package, you can find the documentation here.

From this point onwards, all cleaning and preprocessing tasks will be done in a new column called prep_txt as a way to be able to control and improve text processing against the original post.

Spelling Correction For Text Improvement

Garbage in, garbage out

is often said to emphasize the fact that even if you use the most powerful and latest state-of-art Machine Learning model to make a prediction, if your data are not of quality, you will not get anything good. This is why data cleaning and text preprocessing are essential steps before the modeling phase.

While looking at the data, I quickly noted the presence of misspellings and the use of non-standard abbreviations, and, at least partially, SMS-like spelling.

Leaving the text as it is will lead to a poor model and a poor result. This is because the algorithm can only learn from the data it is fed with. So, if a particular word is spelled in three different ways with some abbreviations, it will return as many vectors as there are spellings instead of one.

To correct the spelling, I could brute force it manually, which would take a vast amount of time but is still doable on 20,000 rows of data. Another way is to do it algorithmically.

Spelling Correction With pyspellchecker

There are a lot of researches on spelling corrections and exciting solutions to be explored. Initially, I found Peter Norvig’s interesting solution and the pyspellchecker library, which was my initial choice. Data Science is an iterative process, and I wanted to try his approach before looking for the next one.

However, the algorithm was still running after more than two hours and a half. I decided that it would not be suitable for my use case. If it takes so long on 20,000 rows, it will not be practical on a dataset with millions of rows, especially if I want to experiment with different hypotheses and that I want to do it quickly.

Spelling Correction With SymSpell

This is how I found about Wolf Garbe’s SymSpell library. His solution is more accurate and one million times faster than pyspellchecker.

SymsSpell is an algorithm for finding all strings in a short time within a fixed edit distance from an extensive list of strings.

The Symmetric Delete spelling correction algorithm reduces the difficulty of generating edit candidates and the dictionary quest for the difference in distance.

It is six orders of magnitude faster (than the standard approach by Peter Norvig’s with deletes + transposes + replaces + inserts) and is language independent.

The SymSpell algorithm exploits the fact that the edit distance between two terms is symmetrical. We can combine and meet in the middle by transforming the correct dictionary terms to erroneous strings and converting the incorrect input term to the right strings.

For more information, you can check the Github repository of the project here.

Another interesting point, TextBlob also has a spellcheck() function, which to the contrary of SymSpell, considers the context. Still, according to the benchmarks I found, the accuracy of TextBlob is lower than SymSpell.

In this project, I decided to use SymSpell.

Spelling Correction With NeuSpell

I also found an interesting paper about the NeuSpell library, which combined different models with higher accuracy than SymSpell. However, I decided to first explore the results with SymSpell, which is my solution of choice.

Eventually, if we need to iterate through the data again, I might explore NeuSpell to compare. In the meantime, you can read the paper here and access the project on GitHub here.

Notes On Spelling Correction, Numbers, And Abbreviations

While SymSpell is an efficient algorithm, it is not a perfect solution, and manual correction is still required. For example, if in the text appears bf, SymSpell will correct it as of instead of boyfriend. This is because the closest word to bf in terms of spelling is of and has the highest probability to be the right choice as of is a widespread word.

Since SymSpell cannot interpret abbreviations, I decided to build a custom dictionary to correct abbreviations and words that are not fixed or processed as desired. To avoid capital letters and as a step towards text normalization, I lowercased the text dataset.

I realized that removing the punctuation will also remove the & character. So, a&e will become ae and will not be corrected as accident and emergency. I solved it by running my custom dictionary before punctuation removal.

We also need to convert numbers to words because SymSpell would automatically remove them. I often see plain removal of numbers justified by stating that they do not add value to the data. However, I prefer to be careful here and be sure that data are irrelevant before discarding them. This is why I decided to convert numbers to words with the inflector package. The documentation can be found here. However, some numbers haven’t been processed, so I decided to add them manually in my initial custom dictionary.

Data Visualization

Visualizing data is essential to have a better understanding of the problem at hand. I found that plotting the distribution of word count could not show at a glance the most critical words in each category. I found word clouds a better way to tell the story of each category. Below, the word cloud of the depression class:

And here is the word cloud for the SuicideWatch class:

The words appearing from the SuicideWatch class, compared to the depression category, are, not surprisingly, overall darker and more brutal. Both word clouds show words related to emotions and feelings. While both word clouds paint sadness and hurt, the word cloud from depression does show some positive words related to love and does not include words associated with the immediate end of one’s life.

Feature Engineering

The next step is to perform feature engineering. I was wondering if we could build relevant features based on the text itself. For example, I wondered if the length of the text tends to be different between the depression and SuicideWatch categories. To explore this, I created a feature to calculate the number of words per post and ran a visualization to evaluate the feature.

The plot shows that posts from SuicidalWatch tend to be shorter, but we also notice that the distribution is right-skewed with many concise posts.

Therefore, I performed a logarithmic transformation on the distribution to reduce the skewness of the distribution, which will help the model to learn from our data.

There are probably more options to explore, but let’s experiment with this feature first. The difference between the two categories is not that important, and the weight of this feature as a predictor might not be significant.

To take full advantage of this article, you can download the complete code here.

Text Preprocessing With Stemming and Lemmatization

At this stage, the preprocessing is not entirely done yet. We need to apply stemming and Lemmatization before vectorizing our data, which means converting the words into numbers so that the machine learning algorithms will learn from the data.

For each of these steps, we will create different columns. The reason for this is because we will not feed our model with every preprocessing method. We will have to test each of them to see which one performs better. I discarded the n-gram method because I do not see how arbitrary bi-gram or tri-gram would be relevant to build an accurate model. However, having associations through correlations of words would be suitable.

Stemming methods

Stemming is the process of reducing a word to its base word or stem so that the words of a similar kind lie under a common stem. It is an essential step of Natural Language Processing to allow a model to make sense of the words and learn something from the data.

Stemming With Porter Stemmer

Porter’s Stemming method is one of the most common methods for stemming. Stemming refers to reducing the word to its stem or root.

After using Porter’s stemming method and TF-IDF, we get 14623 features.

Stemming With Snowball Stemmer

The SnowBall stemmer is another way to perform stemming and is also known as Porter 2 as it is an improvement of the Porter stemming method. It is also computationally faster. The Snowball Method is often recommended because it is faster than the Porter Stemmer and offers good results. Porter himself admitted that the SnowBall stemmer is an improvement over his initial algorithm.

After using Snowball stemmer and TF-IDF, we get 14376 features.

Stemming With Lancaster Stemmer

Lancaster Stemmer is a method more aggressive than the classic Porter Stemmer. My take is that it may increase the frequency of some keywords and offer higher accuracy, based on this. But it might also trim the words down too much, and many short words could become obfuscated. This would lead to a decrease in inaccuracy. At this stage, it is only a hypothesis for which we will have the answer in the following article.

After performing Lancaster Stemming and TF-IDF, we get 11843 features.

The main drawback of stemming is that it does not consider if the word is being used as a noun or a verb in the context. For this reason, Lemmatization might be a better option as it keeps this fact in consideration and will return a different output, depending on whether the word is used as a verb or a noun. For this project, I used the NLTK package to perform stemming. The documentation is available here.

Lemmatization

The lemmatization step usually brings a higher accuracy than stemming; the reason for this is that it considers the context and performs morphological analysis of the words. Still, it is also a more computationally expensive solution. If speed is an issue, Lemmatization might not be relevant for your project. However, in our case, we will explore the results to see if it is helpful to build a good machine learning model.

We will use SpaCy as it is superior to NLTK in terms of speed and accuracy. If you want to learn more about SpaCy, you can find the documentation here.

The lemmatization process, combined with TF-IDF, gives us 18167 features.

To illustrate the principles mentioned above, let’s visualize the output of a single blog post for each method.

As input, let’s use an extract of a blog post after SymSpell and without punctuation from the depression Subreddit:

so i have been with my boyfriend for five months, and he already told me he was depressed to this week nothing particular happened but i can now feel he is bothered by it

After Porter Stemming:

so i have been with my boyfriend for five month and he alreadi told me he wa depress to thi week noth particular happen but i can now feel he is bother by it

After Snowball Stemming:

so i have been with my boyfriend for five month and he alreadi told me he was depress to this week noth particular happen but i can now feel he is bother by it

After Lancaster Stemming:

so i hav been with my boyfriend for fiv month and he already told me he was depress to thi week noth particul hap but i can now feel he is both by it

After Lemmatization:

so I have be with my boyfriend for five month and he already tell I he be depressed to this week nothing particular happen but I can now feel he be bother by it

In this example, the difference between the different stemming methods is not so big, but as previously mentioned, the Lancaster stemming algorithm is the most aggressive one and can result in a loss of accuracy.

Note how the Lemmatization method is able to track the conjugated verbs and their grammatical tenses with reasonable accuracy.

Vectorization With TF-IDF

Once the stemming and Lemmatization steps are done, we need to convert the text to numbers. To perform this step, we can use CountVectorizer or TF-IDF.

CountVectorizer transforms a given text into a vector based on each word’s frequency (count) that occurs in the entire text. TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document. This is done by multiplying how many times a word appears in a document and the inverse document frequency of the word across a set of documents.

In this project, we will use TF-IDF as it is likely to increase the accuracy of our model. We will perform this step with the help of the Sklearn library.

As a parameter, we will also remove the stopwords, which will help reduce the dimensions of our model by removing words that are important for language purposes but do not carry a lot of weight in terms of predictive power. We will create a variable for each column.

Interestingly, the more aggressive the method of stemming, the less feature we have, which is not necessarily a bad thing. It can be tempting to take advantage of our modern computational power and throw as many features as possible with the hope of building a good model, but it is not the way it works. However, let’s see what model we can create with our set of features at this stage. Eventually, we might try to simplify the model by algorithmically reducing the number of features during the model optimization stage.

The last step is to concatenate the vectorized DataFrames with the original ones and save them for modeling. However, exporting the DataFrame as CSV files takes a very long time due to the number of rows and columns. Therefore, we save them as feather files, which is a format optimized for large DataFrames. The file is processed quickly and is much smaller than a CSV file.

Closing Thoughts

In this article, we explored how to clean text data, including ways to manage abbreviations and misspellings. We also went through different ways of normalizing text data, and finally, we convert the text to numbers using TF-IDF.

If you haven’t read the first part of this project, you can check the article here

Now that we completed the cleaning and preprocessing of our text data, in the following article, we will explore different Machine Learning models and gain some insights into our current hypothesis.