A multi-language video game dialogue corpus
Proposal Type
Individual Talk
Location
Hypertexts & Fictions
Start Date
July 2026
End Date
July 2026
Abstract
Keywords: corpus linguistics, video games, translation, localization, sociolinguistics
Corpus approaches to video game dialogue have recently received recognition as a legitimate source of academic study (Heritage, 2020). Rennick and Roberts (2023) recently created the Video Game Dialogue Corpus, a freely searchable online corpus consisting of 6.2 million words of dialogue representing Japanese- and Western-style role-playing genres. However, free online corpora of video game dialogue in languages besides English and Japanese (such as the “FIGS” languages—French, Italian, German, and Spanish) are lacking. Complete English-language scripts for dialogue-heavy games can be relatively easily obtainable (e.g., Final Fantasy VII Rebirth, https://finalfantasy.fandom.com/wiki/Final_Fantasy_VII_Rebirth_script); freely available non-English corpora are comparatively more difficult to obtain (requiring, e.g., transcription of YouTube “Let’s Plays”) or are otherwise not freely accessible (such as smaller corpora collected for academic study; e.g., Rivas Ginel & Theorine (2025)).
This project proposes a freely accessible and searchable online multi-language video game dialogue corpus, modeled on the Rennick & Roberts (2023) English corpus. Beyond the usefulness of maintaining an online repository for preservation of non-English digital-born texts, a multi-language corpus allows for comparative analyses beyond a single-language corpus, such as examining translations of a game across different languages, examining multiple translations of a language for a single game (e.g., official translations vs. fan translations), or examining challenges in localizing non-English languages (cf. Rivas Ginel & Theorine (2025) proposing French dialogue alternatives to grammatically gendered speech).
Aligning with current practices in building spoken corpora (Knight & Adolphs, 2022), this multi-language corpus will include rich metadata in JSON format, facilitating not only cross-linguistic analyses, but also sociolinguistic analyses within a language, such as--in addition to tags for gendered vs. neutral speech (Rennick et al., 2023)--tags for speaker age or for various interlocutor characteristics.
An initial prototype demonstrating cross-linguistic analyses of a single game’s English and French localizations will be featured.
References:
Heritage, F. (2020). Applying corpus linguistics to videogame data: Exploring the representation of gender in videogames at a lexical level. Game Studies, 20(3), ISSN:1604-7982.
Knight, D. and Adolphs, S. (2022). Building a spoken corpus: what are the basics?. In The Routledge Handbook of Corpus Linguistics (pp. 21-34). Routledge.
Rennick, S., Clinton, M., Ioannidou, E., Oh, L., Clooney, C., T., E, Healy, E., Roberts, S. G. (2023). Gender bias in video game dialogue. Royal Society Open Science 10: 221095. https://doi.org/10.1098/rsos.221095
Rennick, S. and Roberts, S. G. (2023). The Video Game Dialogue Corpus. Corpora, 19(1), pp. 93-106. doi: 10.3366/cor.2024.0299.
Rivas Ginel, M. I., & Theroine, S. (2025). Getting Gender Right in Video Games, Simplifying for Inclusivity. MediAzioni, 47, A143-A156. https://doi.org/10.6092/issn.1974-4382/22591
A multi-language video game dialogue corpus
Hypertexts & Fictions
Keywords: corpus linguistics, video games, translation, localization, sociolinguistics
Corpus approaches to video game dialogue have recently received recognition as a legitimate source of academic study (Heritage, 2020). Rennick and Roberts (2023) recently created the Video Game Dialogue Corpus, a freely searchable online corpus consisting of 6.2 million words of dialogue representing Japanese- and Western-style role-playing genres. However, free online corpora of video game dialogue in languages besides English and Japanese (such as the “FIGS” languages—French, Italian, German, and Spanish) are lacking. Complete English-language scripts for dialogue-heavy games can be relatively easily obtainable (e.g., Final Fantasy VII Rebirth, https://finalfantasy.fandom.com/wiki/Final_Fantasy_VII_Rebirth_script); freely available non-English corpora are comparatively more difficult to obtain (requiring, e.g., transcription of YouTube “Let’s Plays”) or are otherwise not freely accessible (such as smaller corpora collected for academic study; e.g., Rivas Ginel & Theorine (2025)).
This project proposes a freely accessible and searchable online multi-language video game dialogue corpus, modeled on the Rennick & Roberts (2023) English corpus. Beyond the usefulness of maintaining an online repository for preservation of non-English digital-born texts, a multi-language corpus allows for comparative analyses beyond a single-language corpus, such as examining translations of a game across different languages, examining multiple translations of a language for a single game (e.g., official translations vs. fan translations), or examining challenges in localizing non-English languages (cf. Rivas Ginel & Theorine (2025) proposing French dialogue alternatives to grammatically gendered speech).
Aligning with current practices in building spoken corpora (Knight & Adolphs, 2022), this multi-language corpus will include rich metadata in JSON format, facilitating not only cross-linguistic analyses, but also sociolinguistic analyses within a language, such as--in addition to tags for gendered vs. neutral speech (Rennick et al., 2023)--tags for speaker age or for various interlocutor characteristics.
An initial prototype demonstrating cross-linguistic analyses of a single game’s English and French localizations will be featured.
References:
Heritage, F. (2020). Applying corpus linguistics to videogame data: Exploring the representation of gender in videogames at a lexical level. Game Studies, 20(3), ISSN:1604-7982.
Knight, D. and Adolphs, S. (2022). Building a spoken corpus: what are the basics?. In The Routledge Handbook of Corpus Linguistics (pp. 21-34). Routledge.
Rennick, S., Clinton, M., Ioannidou, E., Oh, L., Clooney, C., T., E, Healy, E., Roberts, S. G. (2023). Gender bias in video game dialogue. Royal Society Open Science 10: 221095. https://doi.org/10.1098/rsos.221095
Rennick, S. and Roberts, S. G. (2023). The Video Game Dialogue Corpus. Corpora, 19(1), pp. 93-106. doi: 10.3366/cor.2024.0299.
Rivas Ginel, M. I., & Theroine, S. (2025). Getting Gender Right in Video Games, Simplifying for Inclusivity. MediAzioni, 47, A143-A156. https://doi.org/10.6092/issn.1974-4382/22591
