Ask a Librarian

Threre are lots of ways to contact a librarian. Choose what works best for you.

HOURS TODAY

10:00 am - 4:00 pm

Reference Desk

CONTACT US BY PHONE

(802) 656-2022

Voice

(802) 503-1703

Text

MAKE AN APPOINTMENT OR EMAIL A QUESTION

Schedule an Appointment

Meet with a librarian or subject specialist for in-depth help.

Email a Librarian

Submit a question for reply by e-mail.

WANT TO TALK TO SOMEONE RIGHT AWAY?

Library Hours for Thursday, November 21st

All of the hours for today can be found below. We look forward to seeing you in the library.
HOURS TODAY
8:00 am - 12:00 am
MAIN LIBRARY

SEE ALL LIBRARY HOURS
WITHIN HOWE LIBRARY

MapsM-Th by appointment, email govdocs@uvm.edu

Media Services8:00 am - 7:00 pm

Reference Desk10:00 am - 4:00 pm

OTHER DEPARTMENTS

Special Collections10:00 am - 6:00 pm

Dana Health Sciences Library7:30 am - 11:00 pm

 

CATQuest

Search the UVM Libraries' collections

UVM Theses and Dissertations

Browse by Department
Format:
Online
Author:
Gray, Tyler John
Dept./Program:
Mathematics & Statistics
Year:
2019
Degree:
Ph. D.
Abstract:
Written language provides a snapshot of linguistic, cultural, and current events information for a given time period. Aggregating these snapshots by studying many texts over time reveals trends in the evolution of language, culture, and society. The ever-increasing amount of electronic text, both from the digitization of books and other paper documents to the increasing frequency with which electronic text is used as a means of communication, has given us an unprecedented opportunity to study these trends. In this dissertation, we use hundreds of thousands of books spanning two centuries scanned by Google, and over 100 billion messages, or 'tweets', posted to the social media platform, Twitter, over the course of a decade to study the English language, as well as study the evolution of culture and society as inferred from the changes in language. We begin by studying the current state of verb regularization and how this compares between the more formal writing of books and the more colloquial writing of tweets on Twitter. We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books, and also for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables. Next, we study stretchable words, a fundamental aspect of spoken language that, until the advent of social media, was rarely observed within written language. We examine the frequency distributions of stretchable words and introduce two central parameters that capture their main characteristics of balance and stretch. We explore their dynamics by creating visual tools we call 'balance plots' and 'spelling trees'. We also discuss how the tools and methods we develop could be used to study mistypings and misspellings, and may have further applications both within and beyond language. Finally, we take a closer look at the English Fiction n-gram dataset created by Google. We begin by explaining why using token counts as a proxy of word, or more generally, 'n-gram', importance is fundamentally flawed. We then devise a method to rebuild the Google Books corpus so that meaningful linguistic and cultural trends may be reliably discerned. We use book counts as the primary ranking for an n-gram and use subsampling to normalize across time to mitigate the extraneous results created by the underlying exponential increase in data volume over time. We also combine the subsampled data over a number of years as a method of smoothing. We then use these improved methods to study linguistic and cultural evolution across the last two centuries. We examine the dynamics of Zipf distributions for n-grams by measuring the churn of language reflected in the flux of n-grams across rank boundaries. Finally, we examine linguistic change using wordshift plots and a rank divergence measure with a tunable parameter to compare the language of two different time periods. Our results address several methodological shortcomings associated with the raw Google Books data, strengthening the potential for cultural inference by word changes.