Topic Modeling in the Humanities

The cat is out of the bag: The Journal of Digital Humanities (2:1), special issue on topic modeling, has been released. It’s a fairly apt phrase, because the process of editing the issue felt a bit like stuffing a cat in a bag. When Elijah Meeks approached the JDH editors about he and I guest editing an issue on topic modeling, I don’t think either of us quite realized exactly what that would entail. This post is not about the issue or its contents; Elijah and I already wrote that introduction, where we trace the history of topic modeling in the humanities and frame the articles in the issue. Instead, I’d like to take a short post waxing a bit more reflexive than is usual for this blog, discussing my first experience guest editing a journal and how it all came together. Elijah’s similar post can be found here.

We began with the idea that topic modeling’s relationship to the humanities was just now reaching an important historical moment. Discussions were fast-paced, interesting, and spread across a wide array of media. Better still, humanists were contributing to the understanding of a machine learning algorithm! If that isn’t exciting to you, then… well, you’re probably a normal, well-functioning human being. But we found it exciting, and we thought the JDH, with its catch-the-good post-publication model, would be the perfect place to bring it all together. We quickly realized the difficulty in in stuffing the DH/Topic Modeling cat into the JDH bag.

Firstly, there was just so much of it out there. Discussions meandered between twitter and blogs and conferences; no snapshot of the conversation could ever be fully inclusive. We threw around a bunch of ideas, including a 20-person Google+ Hangout Panel discussing the benefits and pitfalls of the approach, but most of our ideas proved fairly untenable. Help came from the editors of the JDH,  particularly Joan Fragaszy Troyano, who tirelessly worked with us and helped us to get everything organized and together, while allowing us the freedom to take the issue where we wanted it to go. She was able to help us set up something new to the journal, a space which would aggregate tweets and comments about the issue in the month following its release, which Elijah and I will then put together and release as a community appendix in May, hoping to capture some of the rich interchange on topic modeling.

From Graham & Milligan's review of MALLET.
Topic model from Graham & Milligan’s review of MALLET.

One particularly troublesome difficulty, which we never resolved to our liking, was one of gender and representation. It has been pointed out before that the JDH was not as diverse or gender-balanced as we might want it to be, despite most of its staff being women. The editors have pointed out that DH is unfortunately homogeneous, and have worked to increase representation in their issues. Even after realizing the homogeneity in our issue (only two of our initially selected contributors were women, and all were white), we were unable to find other authors who both fit within the theme of the issue and were interested in contributing. I’m certain we must have missed someone crucial, for which I humbly apologize, but I honestly don’t know the best way to remedy this situation. Others have spoken much more eloquently on the subject and have had much better ideas than I ever could. If we had more time and space in the issue, diversity is the one area I would hope to improve.

Once the contributors were selected, the process of getting everything perfect began. Some articles, like Goldstone’s and Underwood’s piece on topic modeling the PMLA, were complete enough that we were happy putting the piece up as-is. One of our contributors was a bit worried, due to the post-publication process and the lack of standard peer-review, that this was more akin to a vanity press than a scholarly publication. I disagree (and hopefully we convinced the contributor to disagree as well); the JDH has several layers of peer review, as the editors and DH community filter the best available pieces through increasingly fine steps, until the selected articles represent the best of what was recently and publicly released. The pieces then went through a rigorous review process from the editorial staff. The original and greatly expanded posts particularly went through several iterations over a matter of months so they would fit as well as possible, and be the best they could be. Because of this process, we actually fell a bit behind schedule, but the resulting quality made the delays worth it.

I cannot stress enough how supportive the JDH editorial staff has been in making this issue work, particularly Joan, who helped Elijah and I figure out what we were doing and nudged us when we needed to be nudged, which was more frequently than I like admitting. I hope you all like the issue as much we do, and will contribute to the conversation on twitter or in blogs. If you post anything about the issue, just share a link in a tweet and comment and we’ll be sure to include you in the appendix.

Happy modeling!

p.s. I am sad that my favorite line of my and Elijah’s editorial was edited, though it was for good reason. The end of the first paragraph now reads “Were a critic of digital humanities to dream up the worst stereotype of the field, he or she would likely create something very much like this, and then name a popular implementation of it after a hammer.” The line (written by Elijah) originally read “Were Stanley Fish [emphasis added] to dream up the worst stereotype of the field, he would likely create something very much like this, and then name a popular implementation of it after a hammer.” The new version is more understandable to a wider audience, but I know some of my readers will appreciate this one more.

Topic nets

I’m sorry. I love you (you know who you are, all of you). I really do. I love your work, I think it’s groundbreaking and transformative, but the network scientist / statistician in me twitches uncontrollably whenever he sees someone creating a network out of a topic model by picking the top-topics associated with each document and using those as edges in a topic-document network. This is meant to be a short methodology post for people already familiar with LDA and already analyzing networks it produces, so I won’t bend over backwards trying to re-explain networks and topic modeling. Most of my posts are written assuming no expert knowledge, so I apologize if in the interest of brevity this one isn’t immediately accessible.

MALLET, the go-to tool for topic modeling with LDA, outputs a comma separated file where each row represents a document, and each pair of columns is a topic that document is associated with. The output looks something like

        Topic 1 | Topic 2 | Topic 3  | ...
Doc 1 | 0.5 , 1 | 0.2 , 5 | 0.1  , 2 | ...
Doc 2 | 0.4 , 6 | 0.3 , 1 | 0.06 , 3 | ...
Doc 3 | 0.6 , 2 | 0.4 , 3 | 0.2  , 1 | ...
Doc 4 | 0.5 , 5 | 0.3 , 2 | 0.01 , 6 | ...

Each pair is the amount a document is associated with a certain topic followed by the topic of that association. Given a list like this, it’s pretty easy to generate a bimodal/bipartite network (a network of two types of nodes) where one variety of node is the document, and another variety of node is a topic. You connect each document to the top three (or n) topics associated with that document and, voila, a network!

The problem here isn’t that a giant chunk of the data is just being thrown away (although there are more elegant ways to handle that too), but the way in which a portion of the data is kept. By using the top-n approach, you lose the rich topic-weight data that shows how some documents are really only closely associated with one or two documents, whereas others are closely associated with many. In practice, the network graph generated by this approach will severely skew the results, artificially connecting documents which are topical outliers toward the center of the graph, and preventing documents in the topical core from being represented as such.

In order to account for this skewing, an equally simple (and equally arbitrary) approach can be taken whereby you only take connections that are over weight 0.2 (or whatever, m). Now, some documents are related to one or two topics and some are related to several, which more accurately represents the data and doesn’t artificially skew network measurements like centrality.

The real trouble comes when a top-n topic network is converted from a bimodal to a unimodal network, where you connect documents to one another based on the topics they share. That is, if Document 1 and Document 4 are both connected to Topics 4, 2, and 7, they get a connection to each other of weight 3 (if they were only connected to 2 of the same topics, they’d get a connection of weight 2, and so forth). In this situation, the resulting network will be as much an artifact of the choice of n as of the underlying document similarity network. If you choose different values of n, you’ll often get very different results.

bimodal to unimodal network. via.

In this case, the solution is to treat every document as a vector of topics with associated weights, making sure to use all the topics, such that you’d have a list that looks somewhat like the original topic CSV, except this time ordered by topic number rather than individually for each document by topic weight.

      T1, T2, T3,...
Doc4(0.2,0.3,0.1,...)
Doc5(0.6,0.2,0.1,...)
...

From here you can use your favorite correlation or distance finding algorithm (cosine similarity, for example) to find the distance from every document to every other document. Whatever you use, you’ll come up with a (generally) symmetric matrix from every document to every other document, looking a bit like this.

      Doc1|Doc2|Doc3,...
Doc1  1   |0.3 |0.1
Doc2  0.3 |1   |0.4 
Doc3  0.1 |0.4 |1
...

If you chop off the bottom left or top right triangle of the matrix, you now have a network of document similarity which takes the entire topic model into account, not just the first few topics. From here you can set whatever arbitrary m thresholds seem legitimate to visually represent the network in an uncluttered way, for example only showing documents that are more than 50% topically similar to one another, while still being sure that the entire richness of the underlying topic model is preserved, not just the first handful of topical associations.

Of course, whether this method is any more useful than something like LSA in clustering documents is debatable, but I just had to throw my 2¢ in the ring regarding topical networks. Hope it’s useful.

Topic Modeling for Humanists: A Guided Tour

It’s that time again! Somebody else posted a really clear and enlightening description of topic modeling on the internet. This time it was Allen Riddell, and it’s so good that it inspired me to write this post about topic modeling that includes no actual new information, but combines a lot of old information in a way that will hopefully be useful. If there’s anything I’ve missed, by all means let me know and I’ll update accordingly.

Introducing Topic Modeling

Topic models represent a class of computer programs that automagically extracts topics from texts. What a topic actually is will be revealed shortly, but the crux of the matter is that if I feed the computer, say, the last few speeches of President Barack Obama, it’ll come back telling me that the president mainly talks about the economy, jobs, the Middle East, the upcoming election, and so forth. It’s a fairly clever and exceptionally versatile little algorithm that can be customized to all sorts of applications, and a tool that many digital humanists would do well to have in their toolbox.

From the outset it’s worth clarifying some vocabulary, and mentioning what topic models can and cannot do. “LDA” and “Topic Model” are often thrown around synonymously, but LDA is actually a special case of topic modeling in general produced by David Blei and friends  in 2002. It was not the first topic modeling tool, but is by far the most popular, and has enjoyed copious extensions and revisions in the years since. The myriad variations of topic modeling have resulted in an alphabet soup of names that might be confusing or overwhelming to the uninitiated; ignore them for now. They all pretty much work the same way.

When you run your text through a standard topic modeling tool, what comes out the other end first is several lists of words. Each of these lists is supposed to be a “topic.” Using the example from before of presidential addresses, the list might look like:

  1. Job Jobs Loss Unemployment Growth
  2. Economy Sector Economics Stock Banks
  3. Afghanistan War  Troops Middle-East Taliban Terror
  4. Election Romney Upcoming President
  5. … etc.

The computer gets a bunch of texts and spits out several lists of words, and we are meant to think those lists represent the relevant “topics” of a corpus. The algorithm is constrained by the words used in the text; if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription of your dream of bear-fights and big caves, the algorithm will tell you nothing about your father and your mother; it’ll only tell you things about bears and caves. It’s all text and no subtext. Ultimately, LDA is an attempt to inject semantic meaning into vocabulary; it’s a bridge, and often a helpful one. Many dangers face those who use this bridge without fully understanding it, which is exactly what the rest of this post will help you avoid.

Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.

Learning About Topic Modeling

The pathways to topic modeling are many and more, and those with different backgrounds and different expertise will start at different places. This guide is for those who’ve started out in traditional humanities disciplines and have little background in programming or statistics, although the path becomes more strenuous as we get closer Blei’s original paper on LDA (as that is our goal.) I will try to point to relevant training assistance where appropriate. A lot of the following posts repeat information, but there are often little gems in each which make them all worth reading.

No Experience Necessary

The following posts, read in order, should be completely understandable to pretty much everyone.

The Fable

Perhaps the most interesting place to start is the stylized account of topic modeling by Matt Jockers, who weaves a tale of authors sitting around the LDA buffet, taking from it topics with which to write their novels. According to Jockers, the story begins in a quaint town, . . .

somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

The blog post is a fun read, and gets at the general idea behind the process of a topic model without delving into any of the math involved. Start here if you are a humanist who’s never had the chance to interact with topic models.

A Short Overview

Clay Templeton over at MITH wrote a short, less-stylized overview of topic modeling which does a good job discussing the trio of issues currently of importance: the process of the model, the software itself, and applications in the humanities.

In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.

Templeton’s piece is concise, to the point, and offers good examples of topic models used for applications you’ll actually care about. It won’t tell you any more about the process of topic modeling than Jockers’ article did, but it’ll get you further into the world of topic modeling as it is applied in the humanities.

An Example: The American Political Science Review

Now that you know the basics of what a topic model actually is, perhaps the best thing is to look at an actual example to ground these abstract concepts. David Blei’s team shoved all of the journal articles from The American Political Science Review into a topic model, resulting in a list of 20 topics that represent the content of that journal. Click around on the page; when you click one of the topics, it sends you to a page listing many of the words in that topic, and many of the documents associated with it. When you click on one of the document titles, you’ll get a list of topics related to that document, as well as a list of other documents that share similar topics.

This page is indicative of the sort of output topic modeling will yield on a corpus. It is a simple and powerful tool, but notice that none of the automated topics have labels associated with them. The model requires us to make meaning out of them, they require interpretation, and without fully understanding the underlying algorithm, one cannot hope to properly interpret the results.

First Foray into Formal Description

Written by yours truly, this next description of topic modeling begins to get into the formal process the computer goes through to create the topic model, rather than simply the conceptual process behind it. The blog post begins with a discussion of the predecessors to LDA in an attempt to show a simplified version of how LDA works, and then uses those examples to show what LDA does differently. There’s no math or programming, but the post does attempt to bring up relevant vocabulary and define them in terms familiar to those without programming experiencing.

With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.

Only the first half of this post is relevant to our topic modeling guided tour. The second half, a section on topic modeling and network analysis, discusses various extended uses that are best left for later.

Computational Process

Ted Underwood provides the next step in understanding what the computer goes through when topic modeling a text.

. . . it’s a long step up from those posts to the computer-science articles that explain “Latent Dirichlet Allocation” mathematically. My goal in this post is to provide a bridge between those two levels of difficulty.

Computer scientists make LDA seem complicated because they care about proving that their algorithms work. And the proof is indeed brain-squashingly hard. But the practice of topic modeling makes good sense on its own, without proof, and does not require you to spend even a second thinking about “Dirichlet distributions.” When the math is approached in a practical way, I think humanists will find it easy, intuitive, and empowering. This post focuses on LDA as shorthand for a broader family of “probabilistic” techniques. I’m going to ask how they work, what they’re for, and what their limits are.

His is the first post that talks in any detail about the iterative process going into algorithms like LDA, as well as some of the assumptions those algorithms make. He also shows the first formula appearing in this guided tour, although those uncomfortable with formulas need not fret. The formula is not essential to understanding the post, but for those curious, later posts will explicate it. And really, Underwood does a great job of explaining a bit about it there.

Be sure to read to the very end of the post. It discusses some of the important limitations of topic modeling, and trepidations that humanists would be wise to heed.  He also recommends reading Blei’s recent article on Probabilistic Topic Models, which will be coming up shortly in this tour.

Computational Process From Another Angle

It may not matter whether you read this or the last article by Underwood first; they’re both first passes to what the computer goes through to generate topics, and they explain the process in slightly different ways. The highlight of Edwin Chen’s blog post is his section on “Learning,” followed a section expanding that concept.

And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability).

This post both explains the meaning of these statistical notations, and tries to actually step the reader through the process using a metaphor as an example, a bit like Jockers’ post from earlier but more closely resembling what the computer is going through. It’s also worth reading through the comments on this post if there are parts that are difficult to understand.

This ends the list of articles and posts that require pretty much no prior knowledge. Reading all of these should give you a great overview of topic modeling, but you should by no means stop here. The following section requires a very little bit of familiarity with statistical notation, most of which can be found at this Wikipedia article on Bayesian Statistics.

Some Experience Required

Not much experience! You can even probably ignore most of the formulae in these posts and still get quite a great deal out of them. Still, you’ll get the most out of the following articles if you can read signs related to probability and summation, both of which are fairly easy to look up on Wikipedia. The dirty little secret of most papers that include statistics is that you don’t actually need to understand all of the formulae to get the gist of the article. If you want to  fully understand everything below, however, I’d highly suggest taking an introductory course or reading a textbook on Bayesian statistics. I second Allen Riddell in suggesting Hoff’s A First Course in Bayesian Statistical Methods (2009), Kruschke’s Doing Bayesian Data Analysis (2010), or Lee’s Bayesian Statistics: An Introduction (2004). My own favorite is Kruschke’s; there are puppies on the cover.

Return to Blei

David Blei co-wrote the original LDA article, and his descriptions are always informative. He recently published a great introduction to probabilistic topic models for those not terribly familiar with it, and although it has a few formulae, it is the fullest computational description of the algorithm, gives a brief overview of Bayesian statistics, and provides a great framework with which to read the following posts in this series. Of particular interest are the sections on “LDA and Probabilistic Models” and “Posterior Computation for LDA.”

LDA and other topic models are part of the larger field of probabilistic modeling. In generative probabilistic modeling, we treat our data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both the observed and hidden random variables. We perform data analysis by using that joint distribution to compute the conditional distribution of the hidden variables given the observed variables. This conditional distribution is also called the posterior distribution.

Really, read this first. Even if you don’t understand all of it, it will make the following reads easier to understand.

Back to Basics

The post that inspired this one, by Allen Riddell, explains the mixture of unigrams model rather than the LDA model, which allows Riddell to back up and explain some important concepts. The intended audience of the post is those with an introductory background in Bayesian statistics but it offers a lot even to those who do not have that. Of particular interest is the concrete example he uses, articles from German Studies journals, and how he actually walks you through the updating procedure of the algorithm as it infers topic and document distributions.

The second move swaps the position of our ignorance. Now we guess which documents are associated with which topics, making the assumption that we know both the makeup of each topic distribution and the overall prevalence of topics in the corpus. If we continue with our example from the previous paragraph, in which we had guessed that “literary” was more strongly associated with topic two than topic one, we would likely guess that the seventh article, with ten occurrences of the word “literary”, is probably associated with topic two rather than topic one (of course we will consider all the words, not just “literary”). This would change our topic assignment vector to z=(1,1,1,1,1,1,2,1,1,1,2,2,2,2,2,2,2,2,2,2). We take each article in turn and guess a new topic assignment (in many cases it will keep its existing assignment).

The last section, discussing the choice of number of topics, is not essential reading but is really useful for those who want to delve further.

Some Necessary Concepts in Text Mining

Both a case study and a helpful description, David Mimno’s recent article on Computational Historiography from ACM Transactions on Computational Logic goes through a hundred years of Classics journals to learn something about the field (very similar Riddell’s article on German Studies). While the article should be read as a good example of topic modeling in the wild, of specific interest to this guide is his “Methods” section, which includes an important discussion about preparing text for this sort of analysis.

In order for computational methods to be applied to text collections, it is first necessary to represent text in a way that is understandable to the computer. The fundamental unit of text is the word, which we here define as a sequence of (unicode) letter characters. It is important to distinguish two uses of word: a word type is a distinct sequence of characters, equivalent to a dictionary headword or lemma; while a word token is a specific instance of a word type in a document. For example, the string “dog cat dog” contains three tokens, but only two types (dog and cat).

What follows is a description of the primitive objects of a text analysis, and how to deal with variations in words, spelling, various languages, and so forth. Mimno also discusses smoothed distributions and word distance, both important concepts when dealing with these sorts of analyses.

Further Reading

By now, those who managed to get through all of this can probably understand most of the original LDA paper by Blei, Ng, and Jordan (most of it will be review!), but there’s a lot more out there than that original article. Mimno has a wonderful bibliography of topic modeling articles, and they’re tagged by topic to make finding the right one for a particular application that much easier.

Applications: How To Actually Do This Yourself

David Blei’s website on topic modeling has a list of available software, as does a section of Mimno’s Bibliography. Unfortunately, almost everything in those lists requires some knowledge of programming, and as yet I know of no really simple implementation of topic modeling. There are a few implementations for humanists that are supposed to be released soon, but to my knowledge, at the time of this writing the simplest tool to run your text through is called MALLET.

MALLET is a tool that does require a bit of comfort with the command-line, though it’s really just the same four commands or so over and over again. It’s a fairly simply software to run once you’ve gotten the hang of it, but that first part of the learning curve could be a bit more like a learning cliff.

On their website, MALLET has a link called “Tutorial” – don’t click it. Instead, after downloading and installing the software, follow the directions on the “Importing Data” page. Then, follow the directions on the “Topic Modeling” page. If you’re a Windows user, Shawn Graham, Ian Milligan, and I wrote a tutorial on how to get it running when you run into a problem (and if this is your first time, you will), and it also includes directions for Macs. Unfortunately, a more detailed tutorial is beyond the scope of this tour, but between these links you’ve got a good chance of getting your first topic model up and running.

Examples in the DH World

There are a lot of examples of topic modeling out there, and here are some that I feel are representative of the various uses it can be put to. I’ve already mentioned David Mimno’s computational historiography of classics journals, as well as Allen Riddell’s similar study of German Studies publications. Both papers are good examples of using topic modeling as a meta-analysis of a discipline. Turning the gaze towards our collective navels, Matt Jockers used LDA to find what’s hot in the Digital Humanities, and Elijah Meeks has a great process piece looking at topics in definitions of digital humanities and humanities computing.

Lisa Rhody has an interesting exploratory topical analysis of poetry, and Rob Nelson as well discusses (briefly) making an argument via topic modeling applied to poetry, which he expands in this New York Times blog post. Continuing in the literary vein, Ted Underwood talks a bit about the relationship of words to topics, as well as a curious find linking topic models and family relations.

One of the great and oft-cited examples of topic modeling in the humanities is Rob Nelson’s Mining the Dispatch, which looks at the changing discussion during the American Civil War through an analysis of primary texts. Just as Nelson looks at changing topics in the news over time, so too does Newman and Block  in an analysis of eighteenth century newspapers, as well as Yang, Torget, and Mihalcea in a more general look at topic modeling and newspapers. In another application using primary texts, Cameron Blevins uses MALLET to run an in-depth analysis of an eighteenth century diary.

Future Directions

This is not actually another section of the post. This is your conscience telling you to go try topic modeling for yourself.

Topic Modeling and Network Analysis

According to Google Scholar, David Blei’s first topic modeling paper has received 3,540 citations since 2003. Everybody’s talking about topic models. Seriously, I’m afraid of visiting my parents this Hanukkah and hearing them ask “Scott… what’s this topic modeling I keep hearing all about?” They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.

Since shortly after Blei’s first publication, researchers have been looking into the interplay between networks and topic models. This post will be about that interplay, looking at how they’ve been combined, what sorts of research those combinations can drive, and a few pitfalls to watch out for. I’ll bracket the big elephant in the room until a later discussion, whether these sorts of models capture the semantic meaning for which they’re often used. This post also attempts to introduce topic modeling to those not yet fully converted aware of its potential.

Citations to Blei (2003) from ISI Web of Science. There are even two citations already from 2012; where can I get my time machine?

A brief history of topic modeling

In my recent post on IU’s awesome alchemy project, I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) during the discussion of topic models. They’re intimately related, though LSA has been around for quite a bit longer. Without getting into too much technical detail, we should start with a brief history of LSA/LDA.

The story starts, more or less, with a tf-idf matrix. Basically, tf-idf ranks words based on how important they are to a document within a larger corpus. Let’s say we want a list of the most important words for each article in an encyclopedia.

Our first pass is obvious. For each article, just attach a list of words sorted by how frequently they’re used. The problem with this is immediately obvious to anyone who has looked at word frequencies; the top words in the entry on the History of Computing would be “the,” “and,” “is,” and so forth, rather than “turing,” “computer,” “machines,” etc. The problem is solved by tf-idf, which scores the words based on how special they are to a particular document within the larger corpus. Turing is rarely used elsewhere, but used exceptionally frequently in our computer history article, so it bubbles up to the top.

LSA and pLSA

LSA utilizes these tf-idf scores 1 within a larger term-document matrix. Every word in the corpus is a different row in the matrix, each document has its own column, and the tf-idf score lies at the intersection of every document and word. Our computing history document will probably have a lot of zeroes next to words like “cow,” “shakespeare,” and “saucer,” and high marks next to words like “computation,” “artificial,” and “digital.” This is called a sparse matrix because it’s mostly filled with zeroes; most documents use very few words related to the entire corpus.

With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. 2 It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.

To get an idea of the sort of fantastic outputs you can get with LSA, do check out the implementation over at The Chymistry of Isaac Newton.

Newton Project LSA

The method was significantly improved by Puzicha and Hofmann (1999), who did away with the linear algebra approach of LSA in favor of a more statistically sound probabilistic model, called probabilistic latent semantic analysis (pLSA). Now is the part of the blog post where I start getting hand-wavy, because explaining the math is more trouble than I care to take on in this introduction.

Essentially, pLSA imagines an additional layer between words and documents: topics. What if every document isn’t just a set of words, but a set of topics? In this model, our encyclopedia article about computing history might be drawn from several topics. It primarily draws from the big platonic computing topic in the sky, but it also draws from the topics of history, cryptography, lambda calculus, and all sorts of other topics to a greater or lesser degree.

Now, these topics don’t actually exist anywhere. Nobody sat down with the encyclopedia, read every entry, and decided to come up with the 200 topics from which every article draws. pLSA infers topics based on what will hereafter be referred to as black magic. Using the dark arts, pLSA “discovers” a bunch of topics, attaches them to a list of words, and classifies the documents based on those topics.

LDA

Blei et al. (2003) vastly improved upon this idea by turning it into a generative model of documents, calling the model Latent Dirichlet allocation (LDA). By this time, as well, some sounder assumptions were being made about the distribution of words and document length — but we won’t get into that. What’s important here is the generative model.

Imagine you wanted to write a new encyclopedia entry, let’s say about digital humanities. Well, we now know there are three elements that make up that process, right? Words, topics, and documents. Using these elements, how would you go about writing this new article on digital humanities?

First off, let’s figure out what topics our article will consist of. It probably draws heavily from topics about history, digitization, text analysis, and so forth. It also probably draws more weakly from a slew of other topics, concerning interdisciplinarity, the academy, and all sorts of other subjects. Let’s go a bit further and assign weights to these topics; 22% of the document will be about digitization, 19% about history, 5% about the academy, and so on. Okay, the first step is done!

Now it’s time to pull out the topics and start writing. It’s an easy process; each topic is a bag filled with words. Lots of words. All sorts of words. Let’s look in the “digitization” topic bag. It includes words like “israel” and “cheese” and “favoritism,” but they only appear once or twice, and mostly by accident. More importantly, the bag also contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of “scanner.”

LDA Model from Blei (2011)

So here you are, you’ve dragged out your digitization bag and your history bag and your academy bag and all sorts of other bags as well. You start writing the digital humanities article by reaching into the digitization bag (remember, you’re going to reach into that bag for 22% of your words), and you pull out “OCR.” You put it on the page. You then reach for the academy bag and reach for a word in there (it happens to be “teaching,”) and you throw that on the page as well. Keep doing that. By the end, you’ve got a document that’s all about the digital humanities. It’s beautiful. Send it in for publication.

Alright, what now?

So why is the generative nature of the model so important? One of the key reasons is the ability to work backwards. If I can generate an (admittedly nonsensical) document using this model, I can also reverse the process an infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from.

Another factor contributing to the success of LDA is the ability to extend the model. In this case, we assume there are only documents, topics, and words, but we could also make a model that assumes authors who like particular topics, or assumes that certain documents are influenced by previous documents, or that topics change over time. The possibilities are endless, as evidenced by the absurd number of topic modeling variations that have appeared in the past decade. David Mimno has compiled a wonderful bibliography of many such models.

While the generative model introduced by Blei might seem simplistic, it has been shown to be extremely powerful. When a newcomer sees the results of LDA for the first time, they are immediately taken by how intuitive they seem. People sometimes ask me “but didn’t it take forever to sit down and make all the topics?” thinking that some of the magic had to be done by hand. It wasn’t. Topic modeling yields intuitive results, generating what really feels like topics as we know them 3, with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.

Topic Modeling and Networks

Topic models can interact with networks in multiple ways. While a lot of the recent interest in digital humanities has surrounded using networks to visualize how documents or topics relate to one another, the interfacing of networks and topic modeling initially worked in the other direction. Instead of inferring networks from topic models, many early (and recent) papers attempt to infer topic models from networks.

Topic Models from Networks

The first research I’m aware of in this niche was from McCallum et al. (2005). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (Steyvers et al., 2004), which assumes topics are formed based on the mixtures of authors writing a paper. McCallum et al. extended that model for directed messages in their Author-Recipient-Topic (ART) Model. In ART, it is assumed that topics of letters, e-mails or direct messages between people can be inferred from knowledge of both the author and the recipient. Thus, ART takes into account the social structure of a communication network in order to generate topics. In a later paper (McCallum et al., 2007), they extend this model to one that infers the roles of authors within the social network.

Dietz et al. (2007) created a model that looks at citation networks, where documents are generated by topical innovation and topical inheritance via citations. Nallapati et al. (2008) similarly creates a model that finds topical similarity in citing and cited documents, with the added ability of being able to predict citations that are not present. Blei himself joined the fray in 2009, creating the Relational Topic Model (RTM) with Jonathan Chang, which itself could summarize a network of documents, predict links between them, and predict words within them. Wang et al. (2011) created a model that allows for “the joint analysis of text and links between [people] in a time-evolving social network.” Their model is able to handle situations where links exist even when there is no similarity between the associated texts.

Networks from Topic Models

Some models have been made that infer networks from non-networked text. Broniatowski and Magee (2010 & 2011) extended the Author-Topic Model, building a model that would infer social networks from meeting transcripts. They later added temporal information, which allowed them to infer status hierarchies and individual influence within those social networks.

Many times, however, rather than creating new models, researchers create networks out of topic models that have already been run over a set of data. There are a lot of benefits to this approach, as exemplified by the Newton’s Chymistry project highlighted earlier. Using networks, we can see how documents relate to one another, how they relate to topics, how topics are related to each other, and how all of those are related to words.

Elijah Meeks created a wonderful example combining topic models with networks in Comprehending the Digital Humanities. Using fifty texts that discuss humanities computing, Elijah created a topic model of those documents and used networks to show how documents, topics, and words interacted with one another within the context of the digital humanities.

Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.

Elijah Jeff Drouin has also created networks of topic models in Proust, as reported by Elijah.

Peter Leonard recently directed me to TopicNets, a project that combines topic modeling and network analysis in order to create an intuitive and informative navigation interface for documents and topics. This is a great example of an interface that turns topic modeling into a useful scholarly tool, even for those who know little-to-nothing about networks or topic models.

If you want to do something like this yourself, Shawn Graham recently posted a great tutorial on how to create networks using MALLET and Gephi quickly and easily. Prepare your corpus of text, get topics with MALLET, prune the CSV, make a network, visualize it! Easy as pie.

Networks can be a great way to represent topic models. Beyond simple uses of navigation and relatedness as were just displayed, combining the two will put the whole battalion of network analysis tools at the researcher’s disposal. We can use them to find communities of similar documents, pinpoint those documents that were most influential to the rest, or perform any of a number of other workflows designed for network analysis.

As with anything, however, there are a few setbacks. Topic models are rich with data. Every document is related to every other document, if some only barely. Similarly, every topic is related to every other topic. By deciding to represent document similarity over a network, you must make the decision of precisely how similar you want a set of documents to be if they are to be linked. Having a network with every document connected to every other document is scarcely useful, so generally we’ll make our decision such that each document is linked to only a handful of others. This allows for easier visualization and analysis, but it also destroys much of the rich data that went into the topic model to begin with. This information can be more fully preserved using other techniques, such as multidimensional scaling.

A somewhat more theoretical complication makes these network representations useful as a tool for navigation, discovery, and exploration, but not necessarily as evidentiary support. Creating a network of a topic model of a set of documents piles on abstractions. Each of these systems comes with very different assumptions, and it is unclear what complications arise when combining these methods ad hoc.

Getting Started

Although there may be issues with the process, the combination of topic models and networks is sure to yield much fruitful research in the digital humanities. There are some fantastic tutorials out there for getting started with topic modeling in the humanities, such as Shawn Graham’s post on Getting Started with MALLET and Topic Modeling, as well as on combining them with networks, such as this post from the same blog. Shawn is right to point out MALLET, a great tool for starting out, but you can also find the code used for various models on many of the model-makers’ academic websites. One code package that stands out is Chang’s implementation of LDA and related models in R.

[zotpress collection=”H5CJBHX2″ sort=”asc” sortby=”author”]

Notes:

  1. Ted Underwood rightly points out in the comments that other scoring systems are often used in lieu of tf-idf, most frequently log entropy.
  2. Yes yes, this is a simplification of actual LSA, but it’s pretty much how it works. SVD reduces the size of the matrix to filter out noise, and then each word row is treated as a vector shooting off in some direction. The vector of each word is compared to every other word, so that every pair of words has a relatedness score between them. Ted Underwood has a great blog post about why humanists should avoid the SVD step.
  3. They’re not, of course. We’ll worry about that later.

Alchemy, Text Analysis, and Networks! Oh my!

“Newton wrote and transcribed about a million words on the subject of alchemy.” —chymistry.org

 

Beside bringing us things like calculus, universal gravitation, and perhaps the inspiration for certain Pink Floyd albums, Isaac Newton spent many years researching what was then known as “chymistry,” a multifaceted precursor to, among other things, what we now call chemistry, pharmacology, and alchemy.

Pink Floyd and the Occult: Discuss.

Researchers at Indiana University, notably William R. Newman, John A. Walsh, Dot Porter, and Wallace Hooper, have spent the last several years developing The Chymistry of Isaac Newton, an absolutely wonderful history of science resource which, as of this past month, has digitized all 59 of Newton’s alchemical manuscripts assembled by John Keynes in 1936. Among the sites features are heavily annotated transcriptions, manuscript images, often scholarly synopses, and examples of alchemical experiments. That you can try at home. That’s right, you can do alchemy with this website. They also managed to introduce alchemical symbols into unicode (U+1F700 – U+1F77F), which is just indescribably cool.

Alchemical experiments at home! http://webapp1.dlib.indiana.edu/newton/reference/mineral.do

What I really want to highlight, though, is a brand new feature introduced by Wallace Hooper: automated Latent Semantic Analysis (LSA) of the entire corpus. For those who are not familiar with it, LSA is somewhat similar LDA, the algorithm driving the increasingly popular Topic Models used in Digital Humanities. They both have their strengths and weaknesses, but essentially what they do is show how documents and terms relate to one another.

Newton Project LSA

In this case, the entire corpus of Newton’s alchemical texts is fed into the LSA implementation (try it for yourself), and then based on the user’s preferences, the algorithm spits out a network of terms, documents, or both together. That is, if the user chooses document-document correlations, a list is produced of the documents that are most similar to one another based on similar word use within them. That list includes weights – how similar are they to one another? – and those weights can be used to create a network of document similarity.

Similar Documents using LSA

One of the really cool features of this new service is that it can export the network either as CSV for the technical among us, or as an nwb file to be loaded into the Network Workbench or the Sci² Tool. From there, you can analyze or visualize the alchemical networks, or you can export the files into a network format of your choice.

Network of how Newton’s alchemical documents relate to one-another visualized using NWB.

It’s great to see more sophisticated textual analyses being automated and actually used. Amber Welch recently posted on Moving Beyond the Word Cloud using the wonderful TAPoR, and Michael Widner just posted a thought-provoking article on using Voyeur Tools for the process of paper revision. With tools this easy to use, it won’t be long now before the first thing a humanist does when approaching a text (or a million texts) is to glance at all the high-level semantic features and various document visualizations before digging in for the close read.