ORCID

https://orcid.org/0009-0000-0586-1916

Keywords

Large Language Models cross-layer security, Backdoor Attacks, Jailbreak Attacks, Text Watermarking, Hardware-Aware, LLM-as-a-Judge Bias

Abstract

Transformer-based Large Language Models (LLMs) have achieved remarkable success across Natural Language Processing (NLP) tasks, yet their increasing adoption has exposed critical vulnerabilities across input, model, and output layers. This dissertation systematically investigates these vulnerabilities and develops corresponding defenses, offering an end-to-end view of LLM security. First, we study model-level vulnerabilities through backdoor injection attacks that implant hidden malicious behaviors via hardware perturbations. We propose TrojBits, a hardware-aware attack that uses minimal bit-flips to insert trojans into models. Its three-stage design—Vulnerable Parameter Ranking, Hardware-Aware Attack Optimization, and Vulnerable Bit Pruning—success- fully compromises BERT and XLNet using only 64 parameters and 90 bit-flips. To counter such threats, we introduce EmbedPerturb, a defense that perturbs susceptible trojan triggers to mitigate exploitation of their original values with negligible overhead. Second, we examine prompt-level vulnerabilities through multilingual jailbreak attacks, where adversarial prompts bypass safety guardrails. Unlike prior English-focused research, we show that transliteration and chatspeak prompts can trigger unsafe responses in GPT-4 and Claude 3 Sonnet, revealing overlooked multilingual risks. We further propose SysFilter, a multilingual prompt filtering and semantic- consistency defense to neutralize such attacks. Finally, we address output-level security through watermarking, embedding detectable statistical patterns in LLM-generated text to prevent misuse. Our evaluation of four watermarking schemes under multilingual translation attacks identifies significant weaknesses in existing approaches. Building on these insights, we discuss robustness insights of the future watermarking in multilingual setting that are resilient to paraphrasing and translation. Collectively, this work bridges attack and defense perspectives, providing a unified framework for understanding and mitigating vulnerabilities in modern LLMs.

Completion Date

2025

Semester

Fall

Committee Chair

Lou, Qian

Degree

Doctor of Philosophy (Ph.D.)

College

College of Engineering and Computer Science

Department

Computer Science

Format

Print

Identifier

DP0029718

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS