ORCID
https://orcid.org/0009-0000-0586-1916
Keywords
Large Language Models cross-layer security, Backdoor Attacks, Jailbreak Attacks, Text Watermarking, Hardware-Aware, LLM-as-a-Judge Bias
Abstract
Transformer-based Large Language Models (LLMs) have achieved remarkable success across Natural Language Processing (NLP) tasks, yet their increasing adoption has exposed critical vulnerabilities across input, model, and output layers. This dissertation systematically investigates these vulnerabilities and develops corresponding defenses, offering an end-to-end view of LLM security. First, we study model-level vulnerabilities through backdoor injection attacks that implant hidden malicious behaviors via hardware perturbations. We propose TrojBits, a hardware-aware attack that uses minimal bit-flips to insert trojans into models. Its three-stage design—Vulnerable Parameter Ranking, Hardware-Aware Attack Optimization, and Vulnerable Bit Pruning—success- fully compromises BERT and XLNet using only 64 parameters and 90 bit-flips. To counter such threats, we introduce EmbedPerturb, a defense that perturbs susceptible trojan triggers to mitigate exploitation of their original values with negligible overhead. Second, we examine prompt-level vulnerabilities through multilingual jailbreak attacks, where adversarial prompts bypass safety guardrails. Unlike prior English-focused research, we show that transliteration and chatspeak prompts can trigger unsafe responses in GPT-4 and Claude 3 Sonnet, revealing overlooked multilingual risks. We further propose SysFilter, a multilingual prompt filtering and semantic- consistency defense to neutralize such attacks. Finally, we address output-level security through watermarking, embedding detectable statistical patterns in LLM-generated text to prevent misuse. Our evaluation of four watermarking schemes under multilingual translation attacks identifies significant weaknesses in existing approaches. Building on these insights, we discuss robustness insights of the future watermarking in multilingual setting that are resilient to paraphrasing and translation. Collectively, this work bridges attack and defense perspectives, providing a unified framework for understanding and mitigating vulnerabilities in modern LLMs.
Completion Date
2025
Semester
Fall
Committee Chair
Lou, Qian
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Computer Science
Format
Identifier
DP0029718
Document Type
Thesis
Campus Location
Orlando (Main) Campus
STARS Citation
Al Ghanim, Mansour, "Addressing Vulnerabilities and Defesne Mechanisms in Transformer-Based Language Models" (2025). Graduate Thesis and Dissertation post-2024. 416.
https://stars.library.ucf.edu/etd2024/416