This corpus-based, qualitative descriptive study examines the vocabulary in digital science resources for middle school students in the United States. In brief, two corpora, each of approximately 2.7 million tokens, were assembled: the Digital Science Corpus (DSC) and the Digital Fiction Corpus (DFC). The 3,456 digital science resources included in the DSC were selected based on the findings of a detailed survey of 91 U.S.-based middle school teachers. In this study, AntWordProfile (Anthony, 2021), AntConc (Anthony, 2019), and WordSmith Tools (Scott, 2020) were used to (a) lexically profile the corpus to determine the vocabulary load of vocabulary in the corpus, (b) lexically profile the corpus to estimate the extent to which a combination of well-known word lists (GSL+AWL+ EAP Science List, the top 570 AVL word families, GSL+MSVL for Science) might help students to reach text coverage that could result in reasonable comprehension of the texts in the corpus (i.e., lexical coverage), and (c) create a Digital Science List (DSL) that captures the most frequent words types in the corpus. The word types in the DSL were validated with the Digital Fiction Corpus (DFC), a corpus formed from an approximately equal number of tokens as the DSC but gathered from fiction novels. The findings of this study show that the top 570 word families in the AVL (Gardner & Davies, 2014) provide 75% more lexical coverage in the digital corpus than the 570 word families in the older AWL (Coxhead, 2000) (10.07% vs. 5.76%). To reach a threshold of 95% coverage that is conventionally deemed to facilitate minimal reading comprehension (Laufer, 2020), middle school (MS) students must recognize the first 6,000 most frequent BNC/COCA (Nation, 2012) word families plus proper nouns or the first 11,000 most frequent BNC/COCA word families without proper nouns. Furthermore, to reach 98% coverage for optimal reading comprehension of digital science texts requires recognizing words within the 19,000 most frequent word families in the BNC/COCA plus proper nouns. In contrast, the GSL, AWL, and EAP Science List with far fewer word families ( < 3,000) offer a striking 88.35% lexical coverage across the corpus, while the GSL and the MSVL for Science with fewer than 2,500 word families offer a remarkable 87.79% lexical coverage across the corpus. The DSL produced from this research identified 412 types based on seven corpus-based and judgment-based criteria. The lexical profiling analysis of the DSL across the DSC revealed that the DSL provides 8.64% lexical coverage. While the DSL can be used as a teaching and learning tool in middle school classrooms, this list is specifically helpful for second language (L2) because it contains 136 general high-frequency types with a specialized meaning (e.g., dating, work, etc.). The study addresses methodological, theoretical, and pedagogical implications so that middle school learners can gain better support in their science vocabulary development and achieve better science reading comprehension of digital science texts.


If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu

Graduation Date





Folse, Keith


Doctor of Philosophy (Ph.D.)


College of Community Innovation and Education


School of Teacher Education

Degree Program

Education; Teaching English to Speakers of Other Languages Track




CFE0008798; DP0026077





Release Date

December 2026

Length of Campus-only Access

5 years

Access Status

Doctoral Dissertation (Campus-only Access)

Restricted to the UCF community until December 2026; it will then be open access.