Keywords

Logistic Regression, Multinomial Naive Bayes, KNearest Neighbor, Extreme Gradient Boosting, Bag of Words, Term Frequency-Inverse Document Frequency

Abstract

This study compares some of the popular machine learning techniques like Logistic Regression, Multinomial Naive Bayes, K-Nearest Neighbor, and Extreme Gradient Boosting to classify the tweets into three different categories: cyberbullying based on religion, cyberbullying based on ethnicity, or no cyberbullying. First, various data-cleaning approaches are used to clean the tweet data. After the data is clean and ready, the word embedding techniques, such as a bag of words and term frequency-Inverse document frequency, are used to convert the words into mathematical vectors. Finally, the model will be fitted using the combination of the above-mentioned word embedding techniques and machine learning algorithms.

Course Name

STA 5703 Data Mining 1

Instructor Name

Dr. Rui Xie

College

College of Sciences

Included in

Data Science Commons

Share

COinS