Keywords

Automated content moderation, Vision-Language models, CLIP, Prompt engineering

Abstract

Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings.

Completion Date

2024

Semester

Summer

Committee Chair

Dr. Gita Sukthankar

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Department of Computer Science

Degree Program

Computer Science

Format

application/pdf

Language

English

Rights

In copyright

Release Date

August 2024

Length of Campus-only Access

None

Access Status

Doctoral Dissertation (Open Access)

Campus Location

Orlando (Main) Campus

Accessibility Status

Meets minimum standards for ETDs/HUTs

Share

COinS