Home
About
- Travis(Weixing) Li
- Learn More
- Email
- LinkedIn
- Github
Posts
- All Posts
- All Tags
Projects

NLP - Text Classifier

23 Apr 2019

Reading time ~1 minute

Overview

In this project, I present a Naive-Bayes-Logistic-Regression model to classify the sentence data on Quora.com as sincere or insincere questions, with the performance (f1 and AUC score) comparing to the complex RNN model. To feed into that machine learning model, the text data are transformed into Document-Term Matrix (Bag of Words). Besides, other topics like TF-IDF matrix, Lemmatization are also discussed.

Challengs

The main challenge is to build a machine learning model with far less training time than RNN but at the same time with the performance as good as a comlex RNN;

Solutions

Transformed the input data of Logistic regression model into theoretical prior probability by Naive Bayes function and then using L2 regularization to penalize those weights different from the theoretical inferencing.

Details

Since I’ve already written an article about it on Medium.com, please check it out through the link above. I will display some basic data explorations here:
The data looks like this: The first column is for unique identifier, which is useless into this problem. The second columns is the sentence and the third column is the binary target (0 as sincere question, 1 as insincere question) data

The dataset is unbalanced, so for the metrics used to evaluate the model, instead of accuracy, the F1-score and AUC will be more proper to use. I also draw some wordclouds to look at which word or which pairs of word appear more often in Sincere or Insincere questions. The distribution of the data