Keywords
Automated content moderation, Vision-Language models, CLIP, Prompt engineering
Abstract
Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings.
Completion Date
2024
Semester
Summer
Committee Chair
Dr. Gita Sukthankar
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Department of Computer Science
Degree Program
Computer Science
Format
application/pdf
Identifier
DP0028589
URL
https://purls.library.ucf.edu/go/DP0028589
Language
English
Rights
In copyright
Release Date
August 2024
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Campus Location
Orlando (Main) Campus
STARS Citation
Ahmed, Syed Hammad, "A Multimodal Framework for Automated Content Moderation of Children's Videos" (2024). Graduate Thesis and Dissertation 2023-2024. 386.
https://stars.library.ucf.edu/etd2023/386
Accessibility Status
Meets minimum standards for ETDs/HUTs