Text Classification-Naive Bayes from Scratch

November 13, 2022

Problem statement:

Out Dataset consists of sentences of 6 types ( Responsibility, Requirement, Softskill, Skill, Experience and Education ). We have to classify sentences given to us into these 6 categories. Out Training Dataset consists of sentences and its type which we will feed into the model to learn and then we make our predictions on Test Dataset.

wget

             https://i.stack.imgur.com/uaPM4.png

Naive Bayes learning is known as Supervised Learning since we are feeding data to our model to learn from them.

Challenge: We will apply Naive Bayes Classifier, not from library, we will build the classifier from scratch.

Let us dive in.

Solution: Naive Bayes Classifier assumes that all the data is conditionally independent. This may not be true in some cases of test classification but it gives amazing accuracy while doing text classification.

Naive Bayes Formula : P(A|B) = P(B|A) P(A) / P(B)

A=Sentences, B=Class We want to find out the probability of a sentence given it is from a specific class.

Let us look at our code.

We begin by doing necessary Imports.

wget

Let’s save our data into DataFrame:

wget

We see that our Test Dataset does not have a Type Column so we use it after building our dataset to see which Sentence is predicted as which Type.

We drop null values form our DataFrame: We see that out dataset size has decreased below

wget

Plotting Sentence Type and number of sentences of each type:

wget

Maximum Sentences of Type Responsibility ( 15257) Minimum Sentences of Type Education ( 4540 )

Cleaning and Pre-Processing our data: Remove_punt function is used to remove any HTML tags if any, URLs if any and non-alphanumeric Characters if any. This function is called in our Extraction function to every sentence we have in out Training Dataset. Extraction function -> we will create a stop words list (imported from nltk) containing Stop Words(English) in general. We generate a new DataFrame which consists of only the Sentences and their respective Types. Sequentially, we apply the Remove_punt option mentioned above, convert all sentences to lower case ( And is interpreted differently then and) and then most importantly we lemmatize the words. Lemmatization - converting different forms of the word into a single word interpreting the same meaning ( strong, strongly, stronger is interpreted as strong).

wget

Our DataFrame size is not affected. Two columns only since we took only sentences and type.

Encoding Type into labels:

Our model understands math instead of words. wget

We have used Sklearn’s LabelEncoder to make our work eaiser. Below is what Type is encoded to which label in our model.

wget

Splitting our Train dataset :

wget

Transforming words into tokens and building a dictionary of them:

wget

We import CountVectorizer function from Sklearn and use it to count the number of words and convert them into vectors depending on its frequency. Vocab consists of only the actual words. We build w_counts as per the labels given the word.

Fitting our data:

wget

We fit our data (X), according to labels ( [0,1,2,3,4,5]). Calculation of prior probability is done for each label. Let us predict our validation samples and see the accuracy.

Predicting without Laplace Smoothening-

wget

White Space Tokenizer is used to split the string into tokens on White space. We buiild our Predict Method to predict the class of the sentence supplied (X_val). For each sentence we are converting them into tokens of words. We were getting a Math Error because a was becoming 0. We received the accuracy of 25% on our test set. That is low. Let us apply Laplace Smoothening.

Predicting with Laplace Smoothening-

wget

Here, there is not much change from the last Predict Function. We are adding 1 to each word that is not in vocabulary. to save us from math error ( 0 divide by ).

Comparison with Sklearn.NaiveBayes MultinomialNB: Now Let us compare our model to the one from Library:

wget

We get an accuracy of 68% from Sklearn’s Multinomial Naive Bayes Classifier. Let us predict our Test set on both classifiers (One built from Scratch, One from Sklearn Library)

wget

Preprocessing Test Dataset:

wget

Since, we have only one column, this function is almost same as the one above for Training Dataset with little details changed. Let us print first 10 rows from our dataset.

wget

Let us compare the predictions from both our models on the above 10 entries from Test set.

wget <-Model from scratch Model from sklearn -> wget

We see that out of 10, 7 of the predictions are same in both the Models.

Contribution:

Comparison of Model Accuracy using Laplace Smoothening.
Pre-processing and cleaning the dataset, dropping NA values.
Comparison of Model Accuracy of the model built from Scratch and Naive Bayes imported from Sklearn.
Comparison of Model Predictions of the models.
Increasing Max_features in Counter Vectorizer may result in good performance but it’ll take almost 21 minutes to run.

Contribution explanation:

I have used Analytics vidya’s Naive Bayes Classifier from Scratch to perform Sentiment Analysis [3] as a reference for Text Classification. We begin by importing the dataset. I have used DataFrame Referencing[2],Lemmatization([4], [1]), Removing Punctuations [5] to perform Data Pre-preprocessing and Data Cleaning . Then I have used Sklearn’s Label Encoder to encode our categorical labels[6]. Usage of count Vectorizer[7] to build the vocabulary and White Space tokenizer[8] to separate the words from sentences for our Predictions.

References:

[1] https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/

[2] https://www.youtube.com/watch?v=8PN1eXQGZ9c&ab_channel=GeeksforGeeks

[3] https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/

[4] https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

[5] https://thewebdev.info/2021/10/23/how-to-remove-punctuation-with-python-pandas/

[6] https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

[7] https://stackoverflow.com/questions/47898326/how-vectorizer-fit-transform-work-in-sklearn

[8] https://www.educative.io/answers/what-is-whitespacetokenizer-in-python