NLP, Natural Language Processing, Cyberbullying, Twitter, classifcation, TF-IDF, bag-of-words


Cyberbullying refers to the act of bullying using electronic means and the internet. In recent years, this act has been identifed to be a major problem among young people and even adults. It can negatively impact one’s emotions and lead to adverse outcomes like depression, anxiety, harassment, and suicide, among others. This has led to the need to employ machine learning techniques to automatically detect cyberbullying and prevent them on various social media platforms. In this study, we want to analyze the combination of some Natural Language Processing (NLP) algorithms (such as Bag-of-Words and TFIDF) with some popular machine learning algorithms (such as Logistic Regression (LR), Naive Bayes (NB), K-Nearest Neighbor (KNN), and Extreme Gradient Boosting( XGboost)) to detect cyberbullying on Twitter. The NLP methods were employed to extract features from tweets and convert them to numerical vectors and these features were analyzed with the machine learning algorithms. Comparing their performances and accuracy, the Extreme Gradient Boosting( XGboost) model emerged as the best-performing classifer irrespective of whether it uses features from bag-of-words or TF-IDF.


Spring 2024

Course Name

STA 5703 Data Mining 1

Instructor Name

Xie, Rui


College of Sciences

Accessibility Status

PDF accessibility verified using Adobe Acrobat Pro Accessibility Checker