I’m sorry. I love you (you know who you are, all of you). I really do. I love your work, I think it’s groundbreaking and transformative, but the network scientist / statistician in me twitches uncontrollably whenever he sees someone creating a network out of a topic model by picking the top-*n *topics associated with each document and using those as edges in a topic-document network. This is meant to be a short methodology post for people already familiar with LDA and already analyzing networks it produces, so I won’t bend over backwards trying to re-explain networks and topic modeling. Most of my posts are written assuming no expert knowledge, so I apologize if in the interest of brevity this one isn’t immediately accessible.

MALLET, the go-to tool for topic modeling with LDA, outputs a comma separated file where each row represents a document, and each pair of columns is a topic that document is associated with. The output looks something like

Topic 1 | Topic 2 | Topic 3 | ... Doc 1 | 0.5 , 1 | 0.2 , 5 | 0.1 , 2 | ... Doc 2 | 0.4 , 6 | 0.3 , 1 | 0.06 , 3 | ... Doc 3 | 0.6 , 2 | 0.4 , 3 | 0.2 , 1 | ... Doc 4 | 0.5 , 5 | 0.3 , 2 | 0.01 , 6 | ...

Each pair is the amount a document is associated with a certain topic followed by the topic of that association. Given a list like this, it’s pretty easy to generate a bimodal/bipartite network (a network of two types of nodes) where one variety of node is the document, and another variety of node is a topic. You connect each document to the top three (or *n*) topics associated with that document and, voila, a network!

The problem here isn’t that a giant chunk of the data is just being thrown away (although there are more elegant ways to handle that too), but the way in which a portion of the data is kept. By using the top-*n* approach, you lose the rich topic-weight data that shows how *some* documents are really only closely associated with one or two documents, whereas others are closely associated with many. In practice, the network graph generated by this approach will severely skew the results, artificially connecting documents which are topical outliers toward the center of the graph, and preventing documents in the topical core from being represented as such.

In order to account for this skewing, an equally simple (and equally arbitrary) approach can be taken whereby you only take connections that are over weight 0.2 (or whatever, *m*). Now, some documents are related to one or two topics and some are related to several, which more accurately represents the data and doesn’t artificially skew network measurements like centrality.

The real trouble comes when a top-*n* topic network is converted from a bimodal to a unimodal network, where you connect documents to one another based on the topics they share. That is, if Document 1 and Document 4 are both connected to Topics 4, 2, and 7, they get a connection to each other of weight 3 (if they were only connected to 2 of the same topics, they’d get a connection of weight 2, and so forth). In this situation, the resulting network will be as much an artifact of the choice of *n* as of the underlying document similarity network. If you choose different values of *n*, you’ll often get very different results.

In this case, the solution is to treat every document as a vector of topics with associated weights, making sure to use

*all the topics*, such that you’d have a list that looks somewhat like the original topic CSV, except this time ordered by topic number rather than individually for each document by topic weight.

T1, T2, T3,... Doc4(0.2,0.3,0.1,...) Doc5(0.6,0.2,0.1,...) ...

From here you can use your favorite correlation or distance finding algorithm (cosine similarity, for example) to find the distance from every document to every other document. Whatever you use, you’ll come up with a (generally) symmetric matrix from every document to every other document, looking a bit like this.

Doc1|Doc2|Doc3,... Doc1 1 |0.3 |0.1 Doc2 0.3 |1 |0.4 Doc3 0.1 |0.4 |1 ...

If you chop off the bottom left or top right triangle of the matrix, you now have a network of document similarity which takes the entire topic model into account, not just the first few topics. From here you can set whatever arbitrary *m* thresholds seem legitimate to visually represent the network in an uncluttered way, for example only showing documents that are more than 50% topically similar to one another, while still being sure that the entire richness of the underlying topic model is preserved, not just the first handful of topical associations.

Of course, whether this method is any more useful than something like LSA in clustering documents is debatable, but I just had to throw my 2*¢* in the ring regarding topical networks. Hope it’s useful.

Hi Scott –

So how would you suggest transforming from the standard topic-in-docs.csv output to something that can get the vectors, etc? I follow the argument, I’m just looking for a shortcut 😉

Well, a script can be written to do it automagically, which I suppose I’ll try to get to one of these days… alternatively if you have few enough topics, it’s easy enough to use excel to create an edge list for each document-topic connection (so if there are 20 topics, there are 20 edges for each document of varying weight). Then you can import that into something like UCINet or Pajek and export the network as a matrix, such that each row is a document’s topic vector, 1-20. Once you have that list of vectors you can play with them in your favorite statistical package to do some sort of dimensionality reduction, similarity measurement, or the like.

Very useful post, exploring some important questions.

I’ve been playing around with topic networks, in collaboration with Michael Simeone, and I have to say I’m starting to understand why people take the top-n-edges approach. Networks just come out prettier when you do some version of that. I totally agree with you that it doesn’t make sense analytically; because it makes an arbitrary decision about “degree.” But it does allow you to construct networks that look like a sort of “map” of topic space, whereas the alternate approach tends to produce a pretty dense cluster of multiply interconnected nodes at the center, surrounded by a few wandering “satellites.”

In cases where I’m mainly interested in visualization rather than centrality/betweenness analysis, I confess this sort of “cheat” is a pretty big temptation.

I’m currently using a totally ad hoc version of it, where I always take the top edge for each node, plus a second edge if it’s above a .25 correlation, and then additional edges if they’re above .4 correlation (which is pretty rare in my topic-topic correlation matrix). Goofy and ad-hoc, but it produces maps that I find visually intelligible.

I certainly agree with you that for the purpose of visualizations, edge-cutting is necessary. For the sake of the best visual map possible, one both legible and accurate, my take for the m x n doc-topic matrix would be to threshold all edge weights above a certain number. For the n x n topic matrix, I’d say the most important step would be to calculate the full n x n matrix before cutting, and then do your edge cutting just before visualization (rather than edge cutting the m x n and then just generating the co-occurrence network).

In the long run the best approach would be to use one of the topic models out there specifically designed to generate a topic similarity matrix, or finding another way to generate the network (if you’re using publications, via citations, if you’re using emails, the email exchange network), and then mapping the topics atop the already existing network. A bunch of the really big topic-based network visualizations actually look at the m x m document network, and then label clusters of documents with their corresponding topics.

Interesting. I agree that citation and correspondence are a more natural fit for a network model. Topic-topic relations are probably a more natural fit for something like PCA. And, in either case, I definitely agree that it makes sense to calculate topic similarity using all your data.

Somehow in my dissertation-submitting, conference-going frenzy, I missed this and would have loved to have talked to you about it at the workshop.

I just wanted to say, very briefly, that another metric used for unimodal social network graphs is K-L divergence, which is not a representation of shared topic assignment but a representation of shared distribution over a set vocabulary. As a result, network graphs that use K-L Divergence to map topic to topic and document to document networks are very different beasts from what you’ve described above, because they focus on the shared vocabularies not shared topics. True, still not a perfect measure as it is a vector algorithm flattened into a 2 dimensional space, but it’s not the same way of calculating degree centrality as you describe above. Therefore, edge “weights” for example in the graphs are not represented by distance or spatial orientation, but by edge width. I am, however, very invested in having these visualization conversations… the challenges, the usefulness, and hope that a more formal forum for such conversations comes to pass. Great to meet you last weekend…