word2vec Workshop – May 30, 2017


The “word embedding” approach to modeling semantic relations in language–especially large collections of texts–is one of the newest and most exciting applications of computational methods to the study of culture as represented in texts. Word embedding is a so-called shallow “neural network” approach that mathematically models the semantic relations between words in a corpus of texts, maps those relations onto a visualizable graphical “space,” and makes it possible to understand the intricate relations of analogy and opposition that are the deep logic of a language and the culture it expresses. In an often used example, word embedding is thus able to derive through algorithmic means such analogies as the following in a corpus of texts: “King” is to “man” as “Queen” is to “woman.” In essence, word embedding holds out the tantalizing promise of showing the relations of conceptual similarity and difference in language than can illuminate how a society “thinks” about things through its discourse.

Currently, “Word2vec” is the methodology that is most often used to produce word embeddings. A word vector provides “a spatial analogy to relationships between words” (Ben Schmidt), offering new ways to represent relationships within a corpus of texts and to understand the formation and operation of discourses.

Teddy Roland is Ph.D. student in the UCSB English Dept. Formerly, he was a M.A. student at University of Chicago and a research assistant in the Chicago Text Lab. He is also a past and present lecturer in the UC Berkeley Digital Humanities initiative’s Data Science program and summer institute. He is one of UCSB’s experts in word embedding, and has given workshops on word embedding in international venues.

Using word2vec, this workshop will demonstrate both how to create and to interpret word embedding models. By spatially representing textual relations, word2vec presents an opportunity to see anew connections across heterogeneous corpora, to generate statistical distributions of keywords, terms, and themes among a cohort of text, and to chart out as yet undiscovered word relations with predictive modeling.

Important note: this workshop assumes no prior experience with coding.

For participants with Python experience who wish to follow along and experiment on their own machines, the Jupyter Notebook and resources for the workshop can be downloaded here: https://github.com/teddyroland/BBB-Word2Vec/archive/master.zip

Note that all of the required packages are pre-installed with the Anaconda platform for Python 3.6 (https://www.continuum.io/downloads).