ORCID

0000-0003-0603-0975

Keywords

Latent Representation Editing; Mechanistic Interpretability; Linear Probes; Chess Language Models; Artificial Intelligence

Abstract

This dissertation investigates the structures and mechanisms underpinning the latent space representations that emerge within Generative Pretrained Transformer (GPT) models. Addressing the broader goal of enhancing AI trustworthiness through transparency, accountability, and controllability, we focus on techniques to understand, quantify, and manipulate these latent space representations. Through a series of analyses, we examine several chess-playing GPT models as controlled testbeds, leveraging their structured decision space to explore emergent representations and decision-making processes.

Key contributions include a mechanistic analysis of the attention heads and latent representations, the development of novel metrics for evaluating intervention outcomes, and the application of linear probe classifiers to decode and edit the model's internal world representations. Analysis of the probe weight vectors reveals that the chess-playing GPT developed an emergent world model of the game that includes pieces, positions, and movement rules, and provides empirical support for the linear representation hypothesis—the idea that abstract concepts are encoded as specific directions in the model's hidden state space. Complementary analysis of the hidden state vectors demonstrates that the model's internal representations honor the Markovian property of chess.

Experimental results demonstrate that linear interventions can causally steer GPT outputs while preserving their semantic validity. Drawing on the dose-response analogy from medicine, we vary both the strength and position of interventions, showing that output quality is maximized when intervention strength follows an exponentially decaying schedule across token positions. Similar experiments using sparse autoencoders in place of linear probes yielded significantly poorer performance. These results highlight the effectiveness of simple linear probes as valuable tools for interpretability and control.

Completion Date

2025

Semester

Summer

Committee Chair

Sukthankar, Gita

Degree

Doctor of Philosophy (Ph.D.)

College

College of Sciences

Department

Computer Science

Format

PDF

Identifier

DP0029528

Language

English

Document Type

Thesis

Campus Location

Orlando (Main) Campus

Share

COinS