Discrete sequences are the building blocks for many real-world problems in domains including genomics, e-commerce, and social sciences. While there are machine learning methods to classify and cluster sequences, they fail to explain what makes groups of sequences distinguishable. Although in some cases having a black box model is sufficient, there is a need for increased explainability in research areas focused on human behaviors. For example, psychologists are less interested in having a model that predicts human behavior with high accuracy and more concerned with identifying differences between actions that lead to divergent human behavior. This dissertation presents techniques for understanding differences between classes of discrete sequences. We leveraged our developed approaches to study two online collaborative environments: GitHub, a software development platform, and Minecraft, a multiplayer online game. The first approach measures the differences between groups of sequences by comparing k-gram representations of sequences using the silhouette score and characterizing the differences by analyzing the distance matrix of subsequences. The second approach discovers subsequences that are significantly more similar to one set of sequences vs. other sets. This approach, which is called contrast motif discovery, first finds a set of motifs for each group of sequences and then refines them to include the motifs that distinguish that group from other groups of sequences. Compared to existing methods, our technique is scalable and capable of handling long event sequences. Our first case study is GitHub. GitHub is a social coding platform that facilitates distributed, asynchronous collaborations in open source software development. It has an open API to collect metadata about users, repositories, and the activities of users on repositories. To study the dynamics of teams on GitHub, we focused on discrete event sequences that are generated when GitHub users perform actions on this platform. Specifically, we studied the differences that automated accounts (aka bots) make on software development processes and outcomes. We trained black box supervised learning methods to classify sequences of GitHub teams and then utilized our sequence analysis techniques to measure and characterize differences between event sequences of teams with bots and teams without bots. Teams with bots have relatively distinct event sequences from teams without bots in terms of the existence and frequency of short subsequences. Moreover, teams with bots have more novel and less repetitive sequences compared to teams with no bots. In addition, we discovered contrast motifs for human-bot and human-only teams. Our analysis of contrast motifs shows that in human-bot teams, discussions are scattered throughout other activities while in human-only teams discussions tend to cluster together. For our second case study, we applied our sequence mining approaches to analyze player behavior in Minecraft, a multiplayer online game that supports many forms of player collaboration. As a sandbox game, it provides players with a large amount of flexibility in deciding how to complete tasks; this lack of goal-orientation makes the problem of analyzing Minecraft event sequences more challenging than event sequences from more structured games. Using our approaches, we were able to measure and characterize differences between low-level sequences of high-level actions and despite variability in how different players accomplished the same tasks, we discovered contrast motifs for many player actions. Finally, we explored how the level of player collaboration affects the contrast motifs.
Doctor of Philosophy (Ph.D.)
College of Engineering and Computer Science
Length of Campus-only Access
Doctoral Dissertation (Open Access)
Saadat, Samaneh, "Analyzing User Behavior in Collaborative Environments" (2020). Electronic Theses and Dissertations, 2020-. 411.