Friends don’t let friends calculate p-values (without fully understanding them)

I wrote this in 2012 in response to a twitter conversation with Mike Taylor, who was patient enough to read a draft of this post in late November and point out all the various ways it could be changed and improved. No doubt if I had taken the time to take his advice and change the post accordingly, this post would be a beautiful thing and well-worth reading. I’d like to thank him for his patient comments, and apologize for not acting on them, as I don’t foresee being able to take the time in the near future to get back to this post. I don’t really like seeing a post sit in my draft folder for months, so I’ll release it out to the world as naked and ugly as the day it was born, with the hopes that I’ll eventually sit down and write a better one that takes all of Mike’s wonderful suggestions into account. Also, John Kruschke apparently published a paper with a similar title; as it’s a paper I’ve read before but forgot about, I’m guessing I inadvertently stole his fantastic phrase. Apologies to John!

——————–

I recently [okay, at one point this was ‘recent’] got in a few discussions on Twitter stemming from a tweet about p-values. For the lucky and uninitiated, p-values are basically the go-to statistics used by giant swaths of the academic world to show whether or not their data are statistically significant. Generally, the idea is that if the value of p is less than 0.05, you have results worth publishing.

Introducing p-values

There’s a lot to unpack in p-values; a lot of history, of a lot of baggage, and a lot of reasons why it’s a pretty poor choice for quantitative analysis. I won’t unpack all of it, but I will give a brief introduction to how the statistic works. At its simplest, according to Wikipedia, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” No, don’t run away! That actually makes sense, just let me explain it.

We’ll start with a coin toss. Heads. Another: tails. Repeat: heads, heads, tails, heads, heads, heads, heads, tails. Tally up the score, of the 10 tosses, 7 were heads and 3 were tails. At this point, if you were a gambler, you might start guessing the coin is weighted or rigged. I mean, seriously, you toss the coin 10 times and only 3 of those were tails? What’s the probability of that, if the coin is fair?

That’s exactly what the p-value is supposed to tell us; what’s the probability of getting 3 or fewer tails out of 10 coin tosses, assuming the coin is fair? That assumption that the coin is fair is what statisticians call the null hypothesis; if we show that it’s less than 5% likely (p < 0.05) that the we’d see 3 or fewer tails in 10 coin flips, assuming a fair coin, statisticians tell us we can safely reject the null hypothesis. If this were a scientific experiment, we’d say that because p < 0.05, our results are significant; we can reject the null hypothesis that the coin is a fair one, because it’s pretty unlikely that the experiment would have turned out this way if the coin was fair. Huzzah! Callooh! Callay! We can publish our findings that this coin just isn’t fair in the world-renowned Journal of Trivial Results and Bad Bar Bets (JTRB3).

So how is this calculated? Well, the idea is that we have to look at the universe of possible results given the null hypothesis; that is, if I had a coin that was truly fair, how often will the result of my flipping it yield 7 heads and 3 tails? Obviously, given a fair coin, most scenarios will yield about a 50/50 split on heads and tails. It is, however, not completely impossible for us to get our 7/3 results. What we have to calculate is, if we flipped a fair coin in an infinite amount of experiments, what percent of those experiments would give us 7 heads and 3 tails? If it’s under 5%, p < 0.05, we can be reasonably certain that our result probably implies a stacked coin. We can assume that because in only 5% of experiments on a fair coin will the results look like the results we see.

We use 0.05, or 5%, mostly out of a long-standing social agreement. The 5% mark is low enough that we can reject the null hypothesis beyond most reasonable doubts. That said, it’s worth remembering that the difference between 4.9% and 5.1% is itself often not statistically significant, and though 5% is our agreed upon convention, it’s still pretty arbitrary. The best thing anyone’s ever written to this point was by Rosnow & Rosenthal (1989):

“We are not interested in the logic itself, nor will we argue for replacing the .05 alpha with another level of alpha, but at this point in our discussion we only wish to emphasize that dichotomous significance testing has no ontological basis. That is, we want to underscore that, surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?”

How p-values work. via wikipedia.
How p-values work. via wikipedia.

I’m really nailing this point home because p-values are so often misunderstood. Wikipedia lists seven common misunderstandings of p-values, and I can almost guarantee that a majority of scientists who use the statistic are guilty of at least some of them. I’m going to quote the misconceptions here, because they’re really important.

  1. The p-value is not the probability that the null hypothesis is true.
    In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is Lindley’s paradox.
  2. The p-value is not the probability that a finding is “merely a fluke.”
    As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
  3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor’s fallacy.
  4. The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of P-rep (which is heavily criticized)
  5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
  6. The significance level of the test (denoted as alpha) is not determined by the p-value.
    The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)
  7. The p-value does not indicate the size or importance of the observed effect (compare with effect size). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.

One problem with p-values (and it often surprises people)

Okay, I’m glad that explanation is over, because now we can get to the interesting stuff, and they’re all based around assumptions. I worded the above coin-toss experiment pretty carefully; you keep flipping a coin until you wind up with 7 heads and 3 tails, 10 flips in all. The experimental process seems pretty straight forward: you flip a coin a bunch of times, and record what you get. Not much in the way of underlying assumptions to get in your way, right?

Wrong. There are actually a number of ways this experiment could have been designed, each yielding the same exact observed data, but each as well yielding totally different p-values. I’ll focus on two of those possible experiments here. Let’s say we went into the experiment saying “Alright, coin, I’m going to flip you 10 times, and then count how often you land on heads and how often you land on tails.” That’s probably how you assumed the experiment was run, but I never actually described it like that; all I said was you wound up flipping a coin 10 times, seeing 7 heads and 3 tails. With that information, you can also have said “I’m going to flip this coin until I see 3 tails, and then stop.” If you remember the sequence above (heads, tails, heads, heads, tails, heads, heads, heads, heads, tails), that experimental design also fits our data perfectly. We kept flipping a coin until we saw 3 tails, and in the interim, we also observed 7 heads.

The problem here is in how p-values work. When you ask what the probability is that your result came from chance alone, a bunch of assumptions underlie this question, the key one here being the halting conditions of your experiment. Calculating how often 10 coin-tosses of a fair coin will result in a 7/3 split (or 8/2 or 9/1 or 10/0) will give you a different result than if you calculated how often waiting until the third tails will give you a 7/3 split (or 8/3 or 9/3 or 10/3 or 11/3 or 12/3 or…). The space of possibilities changes, the actual p-value of your experiment changes, based on assumptions built into your experimental design. If you collect data with one set of halting conditions, and it turns out your p-value isn’t as low as you’d like it to be, you can just pretend you went into your experiment with different halting conditions and, voila!, your results become significant.

The moral of the story is that p-values can be pretty sensitive to assumptions that are often left unspoken in an experiment. The same observed data can yield wildly different values. With enough mental somersaults regarding experimental assumptions, you can find that a 4/6 split in a coin toss actually resulted from flipping a coin that isn’t fair. Much of statistics is rife with unspoken assumptions about how data are distributed, how the experiment is run, etc. It is for this reason that I’m trying desperately to get quantitative humanists using non-parametric and Bayesian methods from the very beginning, before our methodology becomes canonized and set. Bayesian methods are also, of course, rife with assumptions, but at least most of those assumptions are made explicit at the outset. Really, the most important thing is to learn not only how to run a statistic, but also what it means to do so. There are appropriate times to use p-values, but those times are probably not as frequent as is often assumed.

On the importance of a single historical author

I have a dirty admission to make: I think yesterday happened. Actually. Objectively. Stuff happened. I hear that’s still a controversial statement in some corners of the humanities, but I can’t say for sure; I generally avoid those corners. And I think descriptions of the historical evidence can vary in degrees of accuracy, separating logically coherent but historically implausible conspiracy theories from more likely narratives.

At the same time, what we all think of as the past is a construct. A bunch of people – historians, cosmologists, evolutionary biologists, your grandmother who loves to tell stories, you – have all worked together to construct and reconstruct the past. Lots of pasts, actually, because no two people can ever wholly agree; everybody sees the evidence through the lens of their own historical baggage.

I’d like to preface this post with the cautious claim that I am an outsider explaining something I know less about than I should. The hats I wear are information/data scientist and historian of science, and through some accident of the past, historians and historians of science have followed largely separate cultural paths. Which is to say, neither the historian of science in me nor the information scientist has a legitimate claim to the narrative of general history and the general historical process, but I’m attempting to write one anyway. I welcome any corrections or admonishments.

The Narrativist Individual

I use in this post (and in life in general) the vocabulary definitions of Aviezer Tucker, who is doing groundbreaking work on simple stuff like defining “history” and asking questions about what we can know about the past. 1 “History,” Tucker defines, is simply stuff that happened: the past itself. “Historians” are anybody who inquires about the past, from cosmologists to historical linguists. A “historiography” is a knowledge of the past, or more concretely, something a historian has written about the past. “Historiographic research” is what we historians do when we try to find out about the past, and a “historiographic narrative” is generally the result of a lot of that research strung together. 2

Narratives are important. In the 1970s, a bunch of historians began realizing that historians create narratives when they collect their historiographic research 3; that is, people tell stories about the past, using the same sorts of literary and rhetorical devices used in many other places. History itself is a giant jumble of events and causal connections, and representing it as it actually happened would be completely unintelligible and philosophically impossible, without recreating the universe from scratch. Historians look at evidence of the past and then impose an order, a pattern, in reconstructing the events to create their own unique historiographic narratives.

The narratives historians write are inescapably linked to their own historical baggage. Historians are biased and imperfect, and they all read history through the filter of themselves. Historiographic reconstructions, then, are as much windows into the historians themselves as they are windows into the past. The narrativist turn in historiography did a lot to situate the historian herself as a primary figure in her narrative, and it became widely accepted that instead of getting closer to some ground truth of history, historians were in the business of building consistent and legible narratives, their own readings of the past, so long as they were also consistent with the evidence. Those narratives became king, both epistemologically and in practice; historical knowledge is narrative knowledge.

Because narrative knowledge is a knowledge derived from lived experience – the historian sees the past in his own unique light – this emphasized the importance of the individual in historiographic research. Because historians neither could (nor by and large were) attempting to reach an objective ground-truth about the past, any claim to knowledge rested in the lone historian and how he read the past and how he presented his narrative. What resulted was a (fairly justified, given their conceptualization of historiographic knowledge) fetishization of the individual, the autonomous historian.

When multiple authors write a historiographic narrative, something almost ineffable is lost: the individual perspective which drives the narrative argument, part of the essential claim-to-knowledge. In a recent discussion with Ben Schmidt about autonomous humanities work vs. collaboration (the original post; my post; Ben’s reply), Ben pointed out “all the Stanley Fishes out there have reason to be discomfited that DHers revel so much in doing away with not only the printed monograph, traditional peer review, and close reading, but also the very institution of autonomous, individual scholarship. Erasmus could have read the biblical translations out there or hired a translator, but he went out and learned Greek [emphasis added].” I think a large part of that drive for autonomy (beyond the same institutional that’s-how-we’ve-always-done-it inertia that lone natural scientists felt at the turn of the last century) is the situatedness-as-a-way-of-knowing that imbues historiographic research, and humanistic research in general.

I’m inclined to believe that historians need to move away from an almost purely narrative epistemology; keeping in sight that all historiographic knowledge is individually constructed, remaining aware that our overarching cultural knowledge of the past is socio-technically constructed, but not letting that cripple our efforts at coordinating research, at reaching for some larger consilience with the other historical research programs out there, like paleontology and cosmology and geology. Computational methodologies will pave the way for collaborative research both because they allow it, and because they require it.

Collaboratively Constructing Paris

This is a map of Paris.

Map of Paris with dots representing photos taken and posted on Flickr. Red dots are pictures taken by tourists, blue are by locals, and yellow are unknown. via Eric Fischer.

On top of this map of Paris are red, blue, and yellow dots. The red dots are the locations of pictures taken and posted to Flickr by tourists to Paris; blue dots are where locals took pictures; yellow dots are unknown. The resulting image maps and differentiates touristic and local space by popularity, at least among Flickr users. It is a representation that would have been staggeringly difficult for an outsider to create without this sort of data-analytic approach, and yet someone intimately familiar with the city could look at this map and not be surprised. Perhaps they could even recreate it themselves.

What is this knowledge of Paris? It’s surely not a subjective representation of the city, not unless we stretch the limits of the word beyond the internally experienced and toward the collective. Neither is it an objective map 4 of the city, external to the whims of the people milling about within. The map represents an aggregate of individual experiences, a kind of hazy middle ground within the usual boundaries we draw between subjective and objective truth. This is an epistemological and ontological problem I’ve been wondering about for some time, without being able to come up with a good word for it until a conversation with a colleague last year.

“This is my problem,” I told Charles van den Heuvel, explaining my difficulties in placing these maps and other similar projects on the -jectivity scale. “They’re not quite intersubjective, not in the way the word is usually used,” I said, and Charles just looked at me like I was missing something excruciatingly obvious. “What is it when a group of people believe or do or think common things in aggregate?”—Charles asked—”isn’t that just called culture?” I initially disagreed, but mostly because it was so obvious that I couldn’t believe I’d just passed it over entirely.

In 1976, the infamous Stanley Milgram and co-author Denise Jodelet 5 responded to Durkheim’s emphasis on “the objectivity of social facts” by suggesting “that we understand things from the actor’s point of view.” To drive this point home, Milgram decides to map Paris. People “have a map of the city [they live in] in their minds,” Milgram suggests, and their individual memories and attitudes flavor those internal representations.

This is a good example to use, for Milgram, because cities themselves are socially constructed entities; what is a city without its people who live in and build it? Milgram goes on to suggest that people’s internal representations of cities are similarly socially constructed, that “such representations are themselves the products of social interaction with the physical environment.” In the ensuing study, Milgram asks 218 subjects to draw a non-tourist map of Paris as it seems to them, including whatever features they feel relevant. “Through selection, emphasis and distortion, the maps became projections of life styles.”

Milgram then compares all the maps together, seeking what unifies them: first and foremost, the city limits and the Seine. The river is distorted in a very particular way in nearly all maps, bypassing two districts entirely and suggesting they are of little importance to those who drew the maps. The center of the city, Notre Dame and the Île de la Cité, also remains constant. Milgram opposes this to a city like New York, the subject of a later similar study, whose center shifts slowly northward as the years roll by. Many who drew maps of either New York or Paris included elements they were not intimately familiar with, but they knew were socially popular, or were frequent spots of those in their social circles. Milgram concludes “the social representations of the city are more than disembodied maps; they are mechanisms whereby the bricks, streets, and physical geography of a place are endowed with social meaning.”

It’s worth posting a large chunk of Milgram’s earlier article on the matter:

A city is a social fact. We would all agree to that. But we need to add an important corollary: the perception of a city is also a social fact, and as such needs to be studied in its collective as well as its individual aspect. It is not only what exists but what is highlighted by the community that acquires salience in the mind of the person. A city is as much a collective representation as it is an assemblage of streets, squares, and buildings. We discern the major ingredients of that representation by studying not only the mental map in a specific individual, but by seeing what is shared among individuals.

Collaboratively Constructing History

Which brings us back to the past. 6 Can collaborating historians create legitimate narratives if they are not well-founded in personal experience? What sort of historical knowledge is collective historical knowledge? To this question, I turn to blogger Alice Bell, who wrote a delightfully short post discussing the social construction of science.  She writes about scientific knowledge, simply, “Saying science is a social construction does not amount to saying science is make believe.” Alice compares knowledge not to a city, like Paris, but to a building like St. Paul’s Cathedral or a scientific compound like CERN; socially constructed, but physically there. Real. Scientific ideas are part of a similar complex.

The social construction of historiographic narratives is painfully clear even without co-authorships, in our endless circles of acknowledgements and references. Still, there seems to be a good deal of push-back against explicit collaboration, where the entire academic edifice no longer lies solely in the mind of one historian (if indeed it ever had). In some cases, this push-back is against the epistemological infrastructure that requires the person in personal narrative. In others, it is because without full knowledge of each of the moving parts in a work of scholarship, that work is more prone to failure due to theories or methodologies not adequately aligning.

Building historiography together. via the Smithsonian.

I fear this is a dangerous viewpoint, one that will likely harm both our historiographic research and our cultural relevancy, as other areas of academia become more comfortable with large-scale collaboration. Single authorship for its own sake is as dangerous as collaboration for its own sake, but it has the advantage of being a tradition. We must become comfortable with the hazy middle ground between an unattainable absolute objectivity and an unscalable personal subjectivity, willing to learn how to construct our knowledge as Parisians construct their city. The individual experiences of Parisians are without a doubt interesting and poignant, but it is the combined experiences of the locals and the tourists that makes the city what it is. Moving beyond the small and individual isn’t just getting past the rut of microhistories that historiography is still trying to escape—it is also getting past the rut of individually driven narratives and toward unified collective historiographies. We have to work together.

 

 

Notes:

  1. Tucker, Aviezer. 2004. Our Knowledge of the Past: A Philosophy of Historiography. Cambridge University Press.
  2. Tucker, Aviezer, ed. 2009. A Companion to the Philosophy of History and Historiography. http://www.wiley.com/WileyCDA/WileyTitle/productCd-1405149086.html.
  3. Kuukkanen, Jouni-Matti. 2012. “The Missing Narrativist Turn in the Histiography of Science.” History and Theory 51 (3): 340–363. doi:10.1111/j.1468-2303.2012.00632.x.
  4. of anything besides the geolocations of Flickr pictures, in and of itself not particularly interesting
  5. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, ed. Proshansky, Ittelson, and Rivlin, 104–124. New York.
    Milgram, Stanley. 1982. “Cities as Social Representations.” In Social Representations, ed. R. Farr and S. Moscovici, 289–309.
  6. As opposed to bringing us Back to the Future, which would probably be more fun.

Analyzing submissions to Digital Humanities 2013

Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a new method of matching submissions to reviewers. It’s a bidding process; reviewers take a look at the many submissions and state their reviewing preferences or, when necessary, conflicts of interest. It’s unclear the extent to which these preferences will be accommodated, as this is an experiment on their part. Bethany Nowviskie describes it here. As a potential reviewer, I just went through the process of listing my preferences, and managed to do some data scraping while I was there. How could I not? All 348 submission titles were available to me, as well as their authors, topic selections, and keywords, and given that my submission for this year is all about quantitatively analyzing DH, it was an opportunity I could not pass up. Given that these data are sensitive, and those who submitted did so under the assumption that rejected submissions would remain private, I’m opting not to release the data or any non-aggregated information. I’m also doing my best not to actually read the data in the interest of the privacy of my peers; I suppose you’ll all just have to trust me on that one, though.

So what are people submitting? According to the topics authors assigned to their 348 submissions, 65 submitted articles related to “literary studies,” trailed closely by 64 submissions which pertained to “data mining/ text mining.” Work on archives and visualizations are also up near the top, and only about half as many authors submitted historical studies (37) as those who submitted literary ones (65). This confirms my long suspicion that our current wave of DH (that is, what’s trending and exciting) focuses quite a bit more on literature than history. This makes me sad.  You can see the breakdown in Figure 1 below, and further analysis can be found after.

Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).

The majority of authors attached fewer than five topics to their submissions; a small handful included over 15.  Figure 2 shows the number of topics assigned to each document.

Figure 2: The number of topics attached to each document, in order of rank.

I was curious how strongly each topic coupled with other topics, and how topics tended to cluster together in general, so I extracted a topic co-occurrence network. That is, whenever two topics appear on the same document, they are connected by an edge (see Networks Demystified Pt. 1 for a brief introduction to this sort of network); the more times two topics co-occur, the stronger the weight of the edge between them.

Topping off the list at 34 co-occurrences were “Data Mining/ Text Mining” and “Text Analysis,” not terrifically surprising as the the latter generally requires the former, followed by “Data Mining/ Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis” and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m saying here is that Literary Studies, Mining, and Analysis seem to go hand-in-hand.

Knowing my readers, about half of you are already angry with me counting co-occurrences, and rightly so. That measurement is heavily biased by the sheer total number of times a topic is used; if “literary studies” is attached to 65 submissions, it’s much more likely that it will co-occur with any particular topic than topics (like “teaching and pedagogy”) which simply appear more infrequently. The highest frequency topics will co-occur with one another simply by an accident of magnitude.

To account for this, I measured the neighborhood overlap of each node on the topic network. This involves first finding the number of other topics  a pair of two topics shares. For example, “teaching and pedagogy” and “digital humanities – pedagogy and curriculum” each co-occur with several other of the same topics, including “programming,” “interdisciplinary collaboration,” and “project design, organization, management.” I summed up the number topical co-occurrences between each pair of topics, and then divided that total by the number of co-occurrences each node in the pair had individually. In short, I looked at which pairs of topics tended to share similar other topics, making sure to take into account that some topics which are used very frequently might need some normalization. There are better normalization algorithms out there, but I opt to use this one for its simplicity for pedagogical reasons. The method does a great job leveling the playing field between pairs of infrequently-used topics compared to pairs of frequently-used topics, but doesn’t fair so well when looking at a pair where one topic is popular and the other is not. The algorithm is well-described in Figure 3, where the darker the edge, the higher the neighborhood overlap.

Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar.

Neighborhood overlap paints a slightly different picture of the network. The pair of topics with the largest overlap was “Internet / World Wide Web” and “Visualization,” with 90% of their neighbors overlapping. Unsurprisingly, the next-strongest pair was “Teaching and Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The data might be used to suggest multiple topics that might be merged into one, and this pair seems to be a pretty good candidate. “Visualization” also closely overlaps “Data Mining/ Text Mining”, which itself (as we saw before) overlaps with “Cultural Studies” and “Literary Studies.” What we see from this close clustering both in overlap and in connection strength is the traces of a fairly coherent subfield out of DH, that of quantitative literary studies. We see a similarly tight-knit cluster between topics concerning archives, databases, analysis, the web, visualizations, and interface design, which suggests another genre in the DH community: the (relatively) recent boom of user interfaces as workbenches for humanists exploring their archives. Figure 4 represents the pairs of topics which overlap to the highest degree; topics without high degrees of pair correspondence don’t appear on the network graph.

Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.

The topics authors chose for each submission were from a controlled vocabulary. Authors also had the opportunity to attach their own keywords to submissions, which unsurprisingly yielded a much more diverse (and often redundant) network of co-occurrences. The resulting network revealed a few surprises: for example, “topic modeling” appears to be much more closely coupled with “visualization” than with “text analysis” or “text mining.” Of course some pairs are not terribly surprising, as with the close connection between “Interdisciplinary” and “Collaboration.” The graph also shows that the organizers have done a pretty good job putting the curated topic list together, as a significant chunk of the high thresholding keywords are also available in the topic list, with a few notable exceptions. “Scholarly Communication,” for example, is a frequently used keyword but not available as a topic – perhaps next year, this sort of analysis can be used to help augment the curated topic list. The keyword network appears in Figure 5. I’ve opted not to include a truly high resolution image to dissuade readers from trying to infer individual documents from the keyword associations.

Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.

There’s quite a bit of rich data here to be explored, and anyone who does have access to the bidding can easily see that the entire point of my group’s submission is exploring the landscape of DH, so there’s definitely more to come on the subject from this blog. I especially look forward to seeing what decisions wind up being made in the peer review process, and whether or how that skews the scholarly landscape at the conference.

On a more reflexive note, looking at the data makes it pretty clear that DH isn’t as fractured as some occasionally suggest (New Media vs. Archives vs. Analysis, etc.). Every document is related to a few others, and they are all of them together connected in a rich family, a network, of Digital Humanities. There are no islands or isolates. While there might be no “The” Digital Humanities, no unifying factor connecting all research, there are Wittgensteinian family resemblances  connecting all of these submissions together, in a cohesive enough whole to suggest that yes, we can reasonably continue to call our confederation a single community. Certainly, there are many sub-communities, but there still exists an internal cohesiveness that allows us to differentiate ourselves from, say, geology or philosophy of mind, which themselves have their own internal cohesiveness.

Topic nets

I’m sorry. I love you (you know who you are, all of you). I really do. I love your work, I think it’s groundbreaking and transformative, but the network scientist / statistician in me twitches uncontrollably whenever he sees someone creating a network out of a topic model by picking the top-topics associated with each document and using those as edges in a topic-document network. This is meant to be a short methodology post for people already familiar with LDA and already analyzing networks it produces, so I won’t bend over backwards trying to re-explain networks and topic modeling. Most of my posts are written assuming no expert knowledge, so I apologize if in the interest of brevity this one isn’t immediately accessible.

MALLET, the go-to tool for topic modeling with LDA, outputs a comma separated file where each row represents a document, and each pair of columns is a topic that document is associated with. The output looks something like

        Topic 1 | Topic 2 | Topic 3  | ...
Doc 1 | 0.5 , 1 | 0.2 , 5 | 0.1  , 2 | ...
Doc 2 | 0.4 , 6 | 0.3 , 1 | 0.06 , 3 | ...
Doc 3 | 0.6 , 2 | 0.4 , 3 | 0.2  , 1 | ...
Doc 4 | 0.5 , 5 | 0.3 , 2 | 0.01 , 6 | ...

Each pair is the amount a document is associated with a certain topic followed by the topic of that association. Given a list like this, it’s pretty easy to generate a bimodal/bipartite network (a network of two types of nodes) where one variety of node is the document, and another variety of node is a topic. You connect each document to the top three (or n) topics associated with that document and, voila, a network!

The problem here isn’t that a giant chunk of the data is just being thrown away (although there are more elegant ways to handle that too), but the way in which a portion of the data is kept. By using the top-n approach, you lose the rich topic-weight data that shows how some documents are really only closely associated with one or two documents, whereas others are closely associated with many. In practice, the network graph generated by this approach will severely skew the results, artificially connecting documents which are topical outliers toward the center of the graph, and preventing documents in the topical core from being represented as such.

In order to account for this skewing, an equally simple (and equally arbitrary) approach can be taken whereby you only take connections that are over weight 0.2 (or whatever, m). Now, some documents are related to one or two topics and some are related to several, which more accurately represents the data and doesn’t artificially skew network measurements like centrality.

The real trouble comes when a top-n topic network is converted from a bimodal to a unimodal network, where you connect documents to one another based on the topics they share. That is, if Document 1 and Document 4 are both connected to Topics 4, 2, and 7, they get a connection to each other of weight 3 (if they were only connected to 2 of the same topics, they’d get a connection of weight 2, and so forth). In this situation, the resulting network will be as much an artifact of the choice of n as of the underlying document similarity network. If you choose different values of n, you’ll often get very different results.

bimodal to unimodal network. via.

In this case, the solution is to treat every document as a vector of topics with associated weights, making sure to use all the topics, such that you’d have a list that looks somewhat like the original topic CSV, except this time ordered by topic number rather than individually for each document by topic weight.

      T1, T2, T3,...
Doc4(0.2,0.3,0.1,...)
Doc5(0.6,0.2,0.1,...)
...

From here you can use your favorite correlation or distance finding algorithm (cosine similarity, for example) to find the distance from every document to every other document. Whatever you use, you’ll come up with a (generally) symmetric matrix from every document to every other document, looking a bit like this.

      Doc1|Doc2|Doc3,...
Doc1  1   |0.3 |0.1
Doc2  0.3 |1   |0.4 
Doc3  0.1 |0.4 |1
...

If you chop off the bottom left or top right triangle of the matrix, you now have a network of document similarity which takes the entire topic model into account, not just the first few topics. From here you can set whatever arbitrary m thresholds seem legitimate to visually represent the network in an uncluttered way, for example only showing documents that are more than 50% topically similar to one another, while still being sure that the entire richness of the underlying topic model is preserved, not just the first handful of topical associations.

Of course, whether this method is any more useful than something like LSA in clustering documents is debatable, but I just had to throw my 2¢ in the ring regarding topical networks. Hope it’s useful.

Making pretty things with R and ggplot2

This isn’t going to be a long tutorial. I’ve just had three people asking how I made the pretty graphs on my last post about counting citations, and I’m almost ashamed to admit how easy it was. Somebody with no experience coding can (I hope) follow these steps and make themselves a pretty picture with the data provided, and understand how it was created.

library(ggplot2)
sci=read.csv("scicites.csv")
qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

That’s the whole program. Oh, also this table, saved as a csv:

[table id=3 /]

 

And that was everything I used to produce this graph:

Density graph made with R.

Quick Walkthrough

Installation

The first thing you need to make this yourself is the programming language R (an awesome language for statistical analysis) installed on your machine, which you can get here. Download it and install it; it’s okay, Ill wait. Now, R by itself is not fun to code in, so my favorite program to use when writing R code is called RStudio, so go install that too. Now you’re going to have to install the visualization package, which is called ggplot2. You do this from within RStudio itself, so open up the newly installed program. If you’re running Windows Vista or 7, don’t open it up the usual way; right click on the icon and click ‘Run as administrator’ – you need to do this so it’ll actually let you install the package. Once you’ve opened up RStudio, at the bottom of the program there’s a section of your screen labeled ‘Console’, with a blinking text cursor. In the console, type install.packages(“ggplot2”) and hit enter. Congratulations, ggplot2 is now installed.

Now download this R file (‘Save as’) that I showed you before and open it in RStudio (‘File -> Open File’). It should look a lot like that code at the beginning of the post. Now go ahead and download the csv shown above as well, and be sure to put it in the same directory 1 you put the R code. Once you’ve done that, in RStudio click ‘Tools -> Set Working Directory -> To Source File Location’), which will help R figure out where the csv is that you just downloaded.

Before I go on explaining what each line of the code does, run it and see what happens! Near the top of your code on the right side, there should be a list of buttons, on the left one that says ‘Run’ and on the right one that says ‘Source’. Click the button that says ‘Source‘. Voila, a pretty picture!

Code

Now to go through the code itself, we’ll start with line 1. library(ggplot2) just means that that we’re going to be using ggplot2 to make the visualization, and lets R know to look for it when it’s about to put out the graphics.

Line 2 is fairly short as well, sci=read.csv(“scicites.csv”), and it creates a new variable called sci which contains the entire csv file you downloaded earlier. read.csv(“scicites.csv”) is a command that tells R to read the csv file in the parentheses, and setting the variable sci as equal to that read file just saves it.

Line 3 is where the magic happens.

qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

The entire line is surrounded by the parenthetical command qplot() which is just our way of telling R “hey, plot this bit in here!” The first thing inside the parentheses is Cited, which you might recall was one of the columns in the CSV file. This is telling qplot() what column of data it’s going to be plotting, in this case, the number of citations that papers have received. Then, we tell qplot() where that data is coming from with the command data = sci, which sets what table the data column is coming from. After that geom=”density” appears. geom is short for ‘Geometric Object’ and it sets what the graph will look like. In this case we’re making a density graph, so we give it “density”, but we could just as easily have used something like “histogram” or “line”.

The next bit is fill=YearRange, which you might recall was another column in the csv. This is a way of breaking the data we’re using into categories; in this case, the data are categorized by which year range they fall into. fill is one way of categorizing the data by filling in the density blobs with automatically assigned colors; another way would be to replace fill with color. Try it and see what happens. After the next comma is log=”x”, which puts the x-axis on a log scale, making the graph a bit easier to read. Take a look at what the graph looks like if you delete that part of it.

Now we have a big chunk of code devoted to labels. xlab=”Number of Citations”, ylab=”Density”, main=”Density of citations per 8 years”. As can probably be surmised, xlab corresponds to the label on the x-axis, ylab corresponds to the label on the y-axis, and main corresponds to the title of the graph. The very last part, before the closing parentheses, is alpha=I(.5)alpha sets the transparency of the basic graph elements, which is why the colored density blobs all look a little bit transparent. I set their transparency to .5 so they’d each still be visible behind the others. You can set the value between 0 and 1, with the former being completely transparent and the latter being completely opaque.

There you have it, easy-peasy. Play around with the csv, try adding your own data, and take a look at this chapter from “ggplot2: Elegant Graphics for Data Analysis” to see what other options are available to you.

Notes:

  1. thanks Andy for the correction!

How many citations does a paper have to get before it’s significantly above baseline impact for the field?

[Note: This blog post was originally hidden because it’s not aimed at my usual audience. I decided to open it up because, hey, I guess it’s okay for all you humanists and data scientists to know that one of the other hats I wear is that of an informetrician. Another reason I kept it hidden is because I’m pretty scared of how people use citation impact ratings to evaluate research for things like funding and tenure, often at the expense of other methods that ought be used when human livelihoods are at stake. So please don’t do that.]

It depends on the field, and field is defined pretty loosely. This post is in response to a twitter conversation between mrgunn, myself, and some others. mrgunn thinks citation data ought to be freely available, and I agree with him, although I believe data is difficult enough to gather and maintain that a service charge for access is fair, if a clever free alternative is lacking. I’d love to make a clever free alternative (CiteSeerX already is getting there), but the best data still comes from expensive sources like ISI’s Web of Science or Scopus.

At any rate, the question is an empirical one, and one that lots of scientometricians have answered in a number of ways. I’m going to perform my own SSA (Super-Stupid Analysis) here, and I won’t bother taking statistical regression models or Bayesian inferences into account, because you can get a pretty good sense of “impact” (if you take citations to be a good proxy for impact, which is debatable – I won’t even get into using citations as a proxy for quality) using some fairly simple statistics. For the mathy and interested, a forthcoming paper by Evans, Hopkins, and Kaube treats the subject more seriously in Universality of Performance Indicators based on Citation and Reference Counts.

I decided to use the field of Scientometrics, because it’s fairly self-contained (and I love being meta), and I drew my data from ISI’s Web of Science. I retrieved all articles published in the journal Scientometrics up until 2009, which is a nicely representative sample of the field, and then counted the number of citations to each article in a given year. Keep in mind that if you’re wondering how much your Scientometrics paper stood out above its peers in citations with this chart, you have to use ISI’s citation count to your paper, otherwise you’re comparing apples to something else that isn’t apples.

Figure 1. Histogram of citations to papers, with the height of each bar representing the number of papers cited x times. The colors break down the bars by year. (Click to enlarge)
Figure 2. Same as Figure 1, but with the x axis on a log scale.

According to Figure 1 and Figure 2 (Fig. 2 is the same as Fig. 1 but with the x axis on a log scale to make the data a bit easier to read), it’s immediately clear that citations aren’t normally distributed. This tells us right away that some basic statistics simply won’t tell us much with regards to this data. For example, if we take the average number of citations per paper, by adding up each paper’s citation count and dividing it by the total number of papers, we get 7.8 citations per paper. However, because the data are so skewed to one side, over 70% of the papers in the set fall below that average (that is, 70% of papers are cited fewer than 7 times). In this case, a slightly better measurement would be the median, which is 4. That is, about half the papers have fewer than four citations. About a fifth of the papers have no citations at all.

If we look at the colors of Figure 1, which breaks down each bar by year, we can see that the data aren’t really evenly distributed by years, either. Figure 3 breaks this down a bit better.

Figure 3. Number of papers to articles in the journal Scientometrics, colored by number of citations each received.

In Figure 3, you can see the amount of papers published in a given year, and the colors represent how many citations each paper got that year, with the red end of the spectrum showing papers cited very little, and the violet end of the spectrum showing highly cited papers. Immediately we see that the most recent papers don’t have many highly cited articles, so the first thing we should do is normalize by year. That is, an article published this year shouldn’t be placed against the same standards as an article that’s had twenty years to slowly accrue citations.

To make these data a bit easier to deal with, I’ve sliced the set into 8-year chunks. There are smarter ways to do this, but like I said, we’re keeping the analysis simple for the sake of presentation. Figure 4 is the same as Figure 3, but separated out into the appropriate time slices.

Figure 4. Same as figure 3, but separated into 8 year time slices.

Now, to get back to the original question, mrgunn asked how many citations a paper needs to be above the fold. Intuitively, we’d probably call a paper highly impactful if it’s in the blue or violet sections of its time slice (sorry for those of you who are colorblind, I just mean the small top-most area). There’s another way to look at these data that’ll make it a bit easier to eyeball how much more citations a paper’s received than its peers; a density graph. Figure 5 shows just that.

Figure 5. Each color blob represents a time slice, with the height at any given point representing the proportion of papers in that chunk of time which have x citations. The x axis is on a log scale.

Looking at Figure 5, it’s easy to see that a paper published before 2008 with fewer than half a dozen citations is clearly below the norm. If the paper were published after 2008, it could be above the norm even if it had only a small handful of citations. A hundred citations is clearly “highly impactful” regardless of the year the paper was published. To get a better sense of papers that are above the baseline, we can take a look at the actual numbers.

The table below (excuse the crappy formatting, I’ve never tried to embed a big table in WP before) shows the percent of papers which have x citations or fewer in a given time slice. That is, 24% of papers published before 1984 have no citations to them, 31% of papers published before 1984 have 0 or 1 citations to them, 40% of papers published before 1984 have 0, 1, or 2 citations to them, and so forth. That means if you published a paper in Scientometrics  in 1999 and ISI’s Web of Science says you’ve received 15 citations, it means your paper has received more citations than 80% of the other papers published between 1992 and 2000.

[table id=2 /]

 

The conversation also brought up the point of whether this should be a clear binary at the ends of the spectrum (paper A is low impact because it received only a handful of citations, paper B is high impact because it received 150, but we can’t really tell anything in between), or whether we could get a more nuanced few of the spectrum. A combined qualitative/quantitative analysis would be required for a really good answer to that question, but looking at the numbers in the table above, we can see pretty quickly that while 1 citation is pretty different from 2 citations, 38 citations is pretty much the same as 39. That is, the “jitter” of precision probably increases exponentially the more citations you’ve received, such that with very few citations the “impact” precision is quite high, and that precision gets exponentially lower the more citations you’ve received.

All this being said, I do agree with mrgunn that a free and easy to use resource for this sort of analysis would be good. However, because citations often don’t equate to quality, I’d be afraid this tool would just make it easier and more likely for people to make sweeping and inaccurate quality measurements for the purpose of individual evaluations.

On Simplicity

You can build complex arguments on a very simple foundation
Ted Underwood

Celestial Navigation

The world is full of very complex algorithms honed to solve even more complex problems. When you use your phone as a GPS, it’s not a simple matter of triangulating signals from towers or satellites. Because your GPS receiver has to know the precise time (in nanoseconds) at the satellites it’s receiving signals from, and because the satellites are moving at various speeds and orbiting at an altitude where the force of gravity is significantly different, calculating times gets quite complicated due to the effects of relativity. The algorithms that allow the GPS to work have to take relativity into account, often on the fly, and without those complex algorithms the GPS would not be nearly so precise.

Precision and complexity go hand-in-hand, and often the relationship between the two is non-linear. That is, a little more precision often requires lot more complexity. Ever-higher levels of precision get exponentially more difficult to achieve. The traditional humanities is a great example of this; a Wikipedia article can sum up most of what one needs to know regarding, say, World War I, but it takes many scholars many lifetimes to learn  everything. And the more that’s already been figured out, the more work we need to do to find the smallest next piece to understand.

This level of precision is often important, insightful, and necessary to make strides in a field. Whereas before an earth-centered view of the universe was good enough to aid in navigation and predict the zodiac and the seasons, a heliocentric model was required for more precise predictions of the movements of the planets and stars. However, these complex models are not always the best ones for a given situation, and sometimes simplicity and a bit of cleverness can go much further than whatever convoluted equation yields the most precise possible results.

Sticking with the example of astronomy, many marine navigation schools still teach the geocentric model; not because they don’t realize the earth moves, but because navigation is simply easier when you pretend the earth is fixed and everything moves around it. They don’t need to tell you exactly when the next eclipse will be, they just need to figure out where they are. Similarly, your phone can usually pinpoint you within a few blocks by triangulating itself between cellphone towers, without ever worrying about satellites or Einstein.

Geocentric celestial navigation chart from a class assignment.

Whether you need to spend the extra time figuring out relativistic physics or heliocentric astronomical models depends largely on your purpose. If you’re just trying to find your way from Miami to New York City, and for some absurd reason you can only rely on technology you’ve created yourself for navigation, the simpler solution is probably the best way to go.

Simplicity and Macroanalysis

If I’ve written over-long on navigation, it’s because I believe it to be a particularly useful metaphor for large-scale computational humanities. Franco Moretti calls it distant reading, Matthew Jockers calls it macroanalysis, and historians call it… well, I don’t think we’ve come up with a name for it yet. I’d like to think we large-scale computational historians share a lot in spirit with big history, though I rarely see that connection touched on. My advisor likes shifting the focus from what we’re looking at to what we’re looking through, calling tools that help us see the whole of something macroscopes, as opposed to the microscopes which help us reach ever-greater precision.

Whatever you choose to call it, the important point is the shifting focus from precision to contextualization. Rather than exploring a particular subject with ever-increasing detail and care, it’s important to sometimes take a step back and look at how everything fits together. It’s a tough job because ‘everything’ is really quite a lot of things, and it’s easy to get mired in the details. It’s easy to say “well, we shouldn’t look at the data this way because it’s an oversimplification, and doesn’t capture the nuance of the text,” but capturing the nuance of the text isn’t the point. I have to admit, I sent an email to that effect to Matthew Jockers regarding his recent DH2012 presentation,  suggesting that time and similarity were a poor proxy for influence. But that’s not the point, and I was wrong in doing so, because the data still support his argument of authors clustering stylistically and thematically by gender, regardless of whether he calls the edges ‘influence’ or ‘similarity.’

I wrote a post a few months back on halting conditions, figuring out that point when adding more and more detailed data stops actually adding to the insight and instead just contributes to the complexity of the problem. I wrote

Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.

And this is the crux of the matter. If we’re trying to contextualize our data, if we’re trying to open our arms to collect everything that’s available, we need to keep it simple. We need a map, a simple way to navigate the deluge that is human history and creativity. This map will not be hand-drawn with rulers and yard-sticks, it will be taken via satellite, where only the broadest of strokes are clear. Academia, and especially the humanities, fetishizes the particular at the expense of the general. General knowledge is overview knowledge, is elementary knowledge. Generalist is a dirty word lobbed at popular authors who wouldn’t know a primary source if it fell on their head from the top shelf, covering them in dust and the smell of old paper.

Generality is not a vice. Simplicity can, at times, be a virtue. Sometimes you just want to know where the hell you are.

For these maps, a reasonable approximation is often good enough for most situations. Simple triangulation is good enough to get you from Florida to New York, and simply counting the number of dissertations published at ProQuest in a given year for a particular discipline is good enough to approximate the size of one discipline compared to another. Both lack nuance and are sure to run you into some trouble at the small scale, but often that scale is not necessary for the task at hand.

Two situations clearly shout for reasonable approximations; general views and contextualization. In the image below Matthew Jockers showed that formal properties of novels tend to split around the genders of their authors; that is, men wrote differently and about different things than women.

Network graph of 19th century novels, with nodes (novels) colored according to the gender of their authors.

Of course this macroanalysis lacks a lot of nuance, and one can argue for years which set of measurements might yield the best proxy for novel similarity, but as a base approximation the split is so striking that there is little doubt the apparent split is indicative of something interesting actually going on. Jockers has successfully separated signal from noise. This is a great example of how a simple approximation is good enoughto provide a general overview, a map offering one useful view of the literary landscape.

Beyond general overviews and contextualizations, simple models and quantifications can lead to surprisingly concrete and particular results. Take Strato, a clever observer who died around 2,300 years ago. There’s a lot going on after a rainstorm. The sun glistens off the moist grass, little insects crawl their way out of the ground, water pours from the corners of the roof. Each one of these events are themselves incredibly complex and can be described in a multitude of ways; with water pouring from a roof, for example, you can describe the thickness of the stream, or the impression the splash makes on the ground below, or the various murky colors it might take depending on where it’s dripping from. By isolating one property of the pouring rainwater, the fact that it tends to separate into droplets as it gets closer to the ground, Strato figured out that the water moved faster the further it had fallen. That is, falling bodies accelerate. Exactly measuring that acceleration, and quite how it worked, would elude humankind for over a thousand years, but a very simple observation that tended to hold true in many situations was good enough to discover a profound property of physics.

A great example of using simple observations to look at specific historical developments is Ben Schmidt’s Poor man’s sentiment analysis. By looking at words that occur frequently after the word ‘free’ in millions of books,  Ben is able to show the decreasing centrality of ‘freedom of movement’ after its initial importance in the 1840s, or the drop in the use of ‘free men’ after the fall of slavery. Interesting changes are also apparent in the language of markets and labor which both fit well with our understanding of the history of the concepts, and offer new pathways to study, especially around inflection points.

Ben Schmidt looks at the words that appear most frequently following the word ‘free’ in the google ngrams corpus.

 Toward Simplicity

Complex algorithms and representations are alluring. Ben Fry says of network graphs

Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it.

And the same is often true of algorithms; the more complex the algorithm, the more we feel it somehow ‘fits’ the data, because we know our data are so complex to begin with. Oftentimes, however, the simplest solution is good enough (and often really good) for the broadest number of applications.

If there is any take-home message of this post, as a follow-up to my previous one on Halting Conditions, it’s that diminishing returns doesn’t just apply to the amount of data you’ve collected, it also applies to the analysis you plan on running them through. More data aren’t always better, and newer, more precise, more complex algorithms aren’t always better.

Spend your time coming up with clever, simpler solutions so you have more time to interpret your results soundly.

Topic Modeling for Humanists: A Guided Tour

It’s that time again! Somebody else posted a really clear and enlightening description of topic modeling on the internet. This time it was Allen Riddell, and it’s so good that it inspired me to write this post about topic modeling that includes no actual new information, but combines a lot of old information in a way that will hopefully be useful. If there’s anything I’ve missed, by all means let me know and I’ll update accordingly.

Introducing Topic Modeling

Topic models represent a class of computer programs that automagically extracts topics from texts. What a topic actually is will be revealed shortly, but the crux of the matter is that if I feed the computer, say, the last few speeches of President Barack Obama, it’ll come back telling me that the president mainly talks about the economy, jobs, the Middle East, the upcoming election, and so forth. It’s a fairly clever and exceptionally versatile little algorithm that can be customized to all sorts of applications, and a tool that many digital humanists would do well to have in their toolbox.

From the outset it’s worth clarifying some vocabulary, and mentioning what topic models can and cannot do. “LDA” and “Topic Model” are often thrown around synonymously, but LDA is actually a special case of topic modeling in general produced by David Blei and friends  in 2002. It was not the first topic modeling tool, but is by far the most popular, and has enjoyed copious extensions and revisions in the years since. The myriad variations of topic modeling have resulted in an alphabet soup of names that might be confusing or overwhelming to the uninitiated; ignore them for now. They all pretty much work the same way.

When you run your text through a standard topic modeling tool, what comes out the other end first is several lists of words. Each of these lists is supposed to be a “topic.” Using the example from before of presidential addresses, the list might look like:

  1. Job Jobs Loss Unemployment Growth
  2. Economy Sector Economics Stock Banks
  3. Afghanistan War  Troops Middle-East Taliban Terror
  4. Election Romney Upcoming President
  5. … etc.

The computer gets a bunch of texts and spits out several lists of words, and we are meant to think those lists represent the relevant “topics” of a corpus. The algorithm is constrained by the words used in the text; if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription of your dream of bear-fights and big caves, the algorithm will tell you nothing about your father and your mother; it’ll only tell you things about bears and caves. It’s all text and no subtext. Ultimately, LDA is an attempt to inject semantic meaning into vocabulary; it’s a bridge, and often a helpful one. Many dangers face those who use this bridge without fully understanding it, which is exactly what the rest of this post will help you avoid.

Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.

Learning About Topic Modeling

The pathways to topic modeling are many and more, and those with different backgrounds and different expertise will start at different places. This guide is for those who’ve started out in traditional humanities disciplines and have little background in programming or statistics, although the path becomes more strenuous as we get closer Blei’s original paper on LDA (as that is our goal.) I will try to point to relevant training assistance where appropriate. A lot of the following posts repeat information, but there are often little gems in each which make them all worth reading.

No Experience Necessary

The following posts, read in order, should be completely understandable to pretty much everyone.

The Fable

Perhaps the most interesting place to start is the stylized account of topic modeling by Matt Jockers, who weaves a tale of authors sitting around the LDA buffet, taking from it topics with which to write their novels. According to Jockers, the story begins in a quaint town, . . .

somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

The blog post is a fun read, and gets at the general idea behind the process of a topic model without delving into any of the math involved. Start here if you are a humanist who’s never had the chance to interact with topic models.

A Short Overview

Clay Templeton over at MITH wrote a short, less-stylized overview of topic modeling which does a good job discussing the trio of issues currently of importance: the process of the model, the software itself, and applications in the humanities.

In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.

Templeton’s piece is concise, to the point, and offers good examples of topic models used for applications you’ll actually care about. It won’t tell you any more about the process of topic modeling than Jockers’ article did, but it’ll get you further into the world of topic modeling as it is applied in the humanities.

An Example: The American Political Science Review

Now that you know the basics of what a topic model actually is, perhaps the best thing is to look at an actual example to ground these abstract concepts. David Blei’s team shoved all of the journal articles from The American Political Science Review into a topic model, resulting in a list of 20 topics that represent the content of that journal. Click around on the page; when you click one of the topics, it sends you to a page listing many of the words in that topic, and many of the documents associated with it. When you click on one of the document titles, you’ll get a list of topics related to that document, as well as a list of other documents that share similar topics.

This page is indicative of the sort of output topic modeling will yield on a corpus. It is a simple and powerful tool, but notice that none of the automated topics have labels associated with them. The model requires us to make meaning out of them, they require interpretation, and without fully understanding the underlying algorithm, one cannot hope to properly interpret the results.

First Foray into Formal Description

Written by yours truly, this next description of topic modeling begins to get into the formal process the computer goes through to create the topic model, rather than simply the conceptual process behind it. The blog post begins with a discussion of the predecessors to LDA in an attempt to show a simplified version of how LDA works, and then uses those examples to show what LDA does differently. There’s no math or programming, but the post does attempt to bring up relevant vocabulary and define them in terms familiar to those without programming experiencing.

With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.

Only the first half of this post is relevant to our topic modeling guided tour. The second half, a section on topic modeling and network analysis, discusses various extended uses that are best left for later.

Computational Process

Ted Underwood provides the next step in understanding what the computer goes through when topic modeling a text.

. . . it’s a long step up from those posts to the computer-science articles that explain “Latent Dirichlet Allocation” mathematically. My goal in this post is to provide a bridge between those two levels of difficulty.

Computer scientists make LDA seem complicated because they care about proving that their algorithms work. And the proof is indeed brain-squashingly hard. But the practice of topic modeling makes good sense on its own, without proof, and does not require you to spend even a second thinking about “Dirichlet distributions.” When the math is approached in a practical way, I think humanists will find it easy, intuitive, and empowering. This post focuses on LDA as shorthand for a broader family of “probabilistic” techniques. I’m going to ask how they work, what they’re for, and what their limits are.

His is the first post that talks in any detail about the iterative process going into algorithms like LDA, as well as some of the assumptions those algorithms make. He also shows the first formula appearing in this guided tour, although those uncomfortable with formulas need not fret. The formula is not essential to understanding the post, but for those curious, later posts will explicate it. And really, Underwood does a great job of explaining a bit about it there.

Be sure to read to the very end of the post. It discusses some of the important limitations of topic modeling, and trepidations that humanists would be wise to heed.  He also recommends reading Blei’s recent article on Probabilistic Topic Models, which will be coming up shortly in this tour.

Computational Process From Another Angle

It may not matter whether you read this or the last article by Underwood first; they’re both first passes to what the computer goes through to generate topics, and they explain the process in slightly different ways. The highlight of Edwin Chen’s blog post is his section on “Learning,” followed a section expanding that concept.

And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability).

This post both explains the meaning of these statistical notations, and tries to actually step the reader through the process using a metaphor as an example, a bit like Jockers’ post from earlier but more closely resembling what the computer is going through. It’s also worth reading through the comments on this post if there are parts that are difficult to understand.

This ends the list of articles and posts that require pretty much no prior knowledge. Reading all of these should give you a great overview of topic modeling, but you should by no means stop here. The following section requires a very little bit of familiarity with statistical notation, most of which can be found at this Wikipedia article on Bayesian Statistics.

Some Experience Required

Not much experience! You can even probably ignore most of the formulae in these posts and still get quite a great deal out of them. Still, you’ll get the most out of the following articles if you can read signs related to probability and summation, both of which are fairly easy to look up on Wikipedia. The dirty little secret of most papers that include statistics is that you don’t actually need to understand all of the formulae to get the gist of the article. If you want to  fully understand everything below, however, I’d highly suggest taking an introductory course or reading a textbook on Bayesian statistics. I second Allen Riddell in suggesting Hoff’s A First Course in Bayesian Statistical Methods (2009), Kruschke’s Doing Bayesian Data Analysis (2010), or Lee’s Bayesian Statistics: An Introduction (2004). My own favorite is Kruschke’s; there are puppies on the cover.

Return to Blei

David Blei co-wrote the original LDA article, and his descriptions are always informative. He recently published a great introduction to probabilistic topic models for those not terribly familiar with it, and although it has a few formulae, it is the fullest computational description of the algorithm, gives a brief overview of Bayesian statistics, and provides a great framework with which to read the following posts in this series. Of particular interest are the sections on “LDA and Probabilistic Models” and “Posterior Computation for LDA.”

LDA and other topic models are part of the larger field of probabilistic modeling. In generative probabilistic modeling, we treat our data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both the observed and hidden random variables. We perform data analysis by using that joint distribution to compute the conditional distribution of the hidden variables given the observed variables. This conditional distribution is also called the posterior distribution.

Really, read this first. Even if you don’t understand all of it, it will make the following reads easier to understand.

Back to Basics

The post that inspired this one, by Allen Riddell, explains the mixture of unigrams model rather than the LDA model, which allows Riddell to back up and explain some important concepts. The intended audience of the post is those with an introductory background in Bayesian statistics but it offers a lot even to those who do not have that. Of particular interest is the concrete example he uses, articles from German Studies journals, and how he actually walks you through the updating procedure of the algorithm as it infers topic and document distributions.

The second move swaps the position of our ignorance. Now we guess which documents are associated with which topics, making the assumption that we know both the makeup of each topic distribution and the overall prevalence of topics in the corpus. If we continue with our example from the previous paragraph, in which we had guessed that “literary” was more strongly associated with topic two than topic one, we would likely guess that the seventh article, with ten occurrences of the word “literary”, is probably associated with topic two rather than topic one (of course we will consider all the words, not just “literary”). This would change our topic assignment vector to z=(1,1,1,1,1,1,2,1,1,1,2,2,2,2,2,2,2,2,2,2). We take each article in turn and guess a new topic assignment (in many cases it will keep its existing assignment).

The last section, discussing the choice of number of topics, is not essential reading but is really useful for those who want to delve further.

Some Necessary Concepts in Text Mining

Both a case study and a helpful description, David Mimno’s recent article on Computational Historiography from ACM Transactions on Computational Logic goes through a hundred years of Classics journals to learn something about the field (very similar Riddell’s article on German Studies). While the article should be read as a good example of topic modeling in the wild, of specific interest to this guide is his “Methods” section, which includes an important discussion about preparing text for this sort of analysis.

In order for computational methods to be applied to text collections, it is first necessary to represent text in a way that is understandable to the computer. The fundamental unit of text is the word, which we here define as a sequence of (unicode) letter characters. It is important to distinguish two uses of word: a word type is a distinct sequence of characters, equivalent to a dictionary headword or lemma; while a word token is a specific instance of a word type in a document. For example, the string “dog cat dog” contains three tokens, but only two types (dog and cat).

What follows is a description of the primitive objects of a text analysis, and how to deal with variations in words, spelling, various languages, and so forth. Mimno also discusses smoothed distributions and word distance, both important concepts when dealing with these sorts of analyses.

Further Reading

By now, those who managed to get through all of this can probably understand most of the original LDA paper by Blei, Ng, and Jordan (most of it will be review!), but there’s a lot more out there than that original article. Mimno has a wonderful bibliography of topic modeling articles, and they’re tagged by topic to make finding the right one for a particular application that much easier.

Applications: How To Actually Do This Yourself

David Blei’s website on topic modeling has a list of available software, as does a section of Mimno’s Bibliography. Unfortunately, almost everything in those lists requires some knowledge of programming, and as yet I know of no really simple implementation of topic modeling. There are a few implementations for humanists that are supposed to be released soon, but to my knowledge, at the time of this writing the simplest tool to run your text through is called MALLET.

MALLET is a tool that does require a bit of comfort with the command-line, though it’s really just the same four commands or so over and over again. It’s a fairly simply software to run once you’ve gotten the hang of it, but that first part of the learning curve could be a bit more like a learning cliff.

On their website, MALLET has a link called “Tutorial” – don’t click it. Instead, after downloading and installing the software, follow the directions on the “Importing Data” page. Then, follow the directions on the “Topic Modeling” page. If you’re a Windows user, Shawn Graham, Ian Milligan, and I wrote a tutorial on how to get it running when you run into a problem (and if this is your first time, you will), and it also includes directions for Macs. Unfortunately, a more detailed tutorial is beyond the scope of this tour, but between these links you’ve got a good chance of getting your first topic model up and running.

Examples in the DH World

There are a lot of examples of topic modeling out there, and here are some that I feel are representative of the various uses it can be put to. I’ve already mentioned David Mimno’s computational historiography of classics journals, as well as Allen Riddell’s similar study of German Studies publications. Both papers are good examples of using topic modeling as a meta-analysis of a discipline. Turning the gaze towards our collective navels, Matt Jockers used LDA to find what’s hot in the Digital Humanities, and Elijah Meeks has a great process piece looking at topics in definitions of digital humanities and humanities computing.

Lisa Rhody has an interesting exploratory topical analysis of poetry, and Rob Nelson as well discusses (briefly) making an argument via topic modeling applied to poetry, which he expands in this New York Times blog post. Continuing in the literary vein, Ted Underwood talks a bit about the relationship of words to topics, as well as a curious find linking topic models and family relations.

One of the great and oft-cited examples of topic modeling in the humanities is Rob Nelson’s Mining the Dispatch, which looks at the changing discussion during the American Civil War through an analysis of primary texts. Just as Nelson looks at changing topics in the news over time, so too does Newman and Block  in an analysis of eighteenth century newspapers, as well as Yang, Torget, and Mihalcea in a more general look at topic modeling and newspapers. In another application using primary texts, Cameron Blevins uses MALLET to run an in-depth analysis of an eighteenth century diary.

Future Directions

This is not actually another section of the post. This is your conscience telling you to go try topic modeling for yourself.

Networks Demystified 3: The Power Law Rant

Dear humanists, scientists, music-makers, and dreamers of dreams,

Stop. Please, please, please stop telling me about the power law / Pareto distribution / Zipf’s law you found in your data.  I know it’s exciting, and I know power laws are sexy these days, but really the only excuse to show off your power law is if it’s followed by:

  1. How it helps you predict/explain something you couldn’t before.
  2. Why a power law is so very strange as opposed to some predicted distribution (e.g., normal, uniform, totally bodacious, etc.).
  3. A comparison against other power law distributions, and how parameters differ across them.
  4. Anything else that a large chunk of the scholarly community hasn’t already said about power laws.

Alright, I know not all of you have broken this rule, and many of you may not be familiar enough with what I’m talking about to care, so I’m going to give a quick primer here on power law and scale-free distributions (they’re the same thing). If you want to know more, read Newman’s (2005) seminal paper.

The take-home message of this rant will be that the universe counts in powers rather than linear progressions, and thus in most cases a power law is not so much surprising as it is overwhelmingly expected. Reporting power laws in your data is a bit like reporting furry ears on your puppy; often true, but not terribly useful. Going further to describe the color, shape, and floppiness of those ears might be just enough to recognize the breed.

The impetus for this rant is my having had to repeat it at nearly every conference I’ve attended over the last year, and it’s getting exhausting.

Even though the content here looks kind of mathy, anybody should be able to approach this with very minimal math experience. Even so, the content is kind of abstract, and it’s aimed mostly at those people who will eventually look for and find power laws in their data. It’s an early intervention sort of rant. 

I will warn that I conflate some terms below relating to probability distributions and power laws 1 in order to ease the reader into the concept. In later posts I’ll tease apart the differences between log-normal and power laws, but for the moment let’s just consider them similar beasts.

This is actually a fairly basic and necessary concept for network analysis, which is why I’m lumping this into Networks Demystified; you need to know about this before learning about a lot of recent network analysis research.

Introducing Power Laws

The Function

One of the things I’ve been thanked for with regards to this blog is keeping math and programming out of it; I’ll try to stick to that in this rant, but you’ll (I hope) forgive the occasional small formula to help us along the way. The first one is this:

f(x) = xn

The exponent, n, is what’s called a parameter in this function. It’ll be held constant, so let’s arbitrarily set n = -2 to get:

f(x) = x-2

When n = -2, if we make x the set of integers from 1 to 10, we wind up getting a table of values that looks like this:

x  |  f(x) = x^(-2)
--------------------
1  |  1
2  |  0.25
3  |  0.1111
4  |  0.0625
5  |  0.04
6  |  0.0277
7  |  0.0204
8  |  0.0156
9  |  0.0123
10 |  0.01

When plotted with x values along the horizontal axis and f(x) values along the vertical axis, the result looks like this:

Figure 1

Follow so far? This is actually all the math you need, so if you’re not familiar with it, take another second to make sure you have it down before going forward. That curve the points make in Figure 1, starting up and to the left and dipping down quickly before it shoots to the right, is the shape that people see right before they shout, joyously and from the rooftops, “We have a power law!” It can be an exciting moment, but hold your enthusiasm for now.

There’s actually a more convenient way of looking at these data, and that’s on a log-log plot. What I’ve made above is a linear plot, so-called because the numbers on the horizontal and vertical axes increase linearly. For example, on the horizontal axis, 1 is just as far away from 2 as 2 is from 3, and 3 is just as far away from 4 as 9 is from 10. We can transform the graph by stretching the the low numbers on the axes really wide and pressing the higher numbers closer together, such that 1 is further from 2 than 2 is from 3, 3 is further from 4 than 4 is from 5, and so forth.

If you’re not familiar with logs, that’s okay, the take-home message is that what we’re doing is not changing the data, we’re just changing the graph. We stretch and squish certain parts of the graph to make the data easier to read, and the result looks like this:

Figure 2

This is called a log-log plot because the horizontal and vertical axes grow logarithmically rather than linearly;  the lower numbers are further apart than the higher ones. Note that with this new graph shows the same data as before but, instead of having that steep curve, the data appear on a perfectly straight diagonal line. It turns out that data which changes exponentially appears straight on a log-log plot. This is really helpful, because it allows us to eyeball any dataset thrown on a log-log plot, and if the points look like they’re on a straight line, it’s pretty likely that the best-fit line for the data is some function involving an exponent. That is, a graph that looks like Figure 2 usually has one of those equations I showed above behind it; a “power law.” We generally care about the best-fit line because it either tells us something about the underlying dataset, or it allows us to predict new values that we haven’t observed yet. Why precisely you’d want to fit a line to data is beyond the scope of this post.

To sum up: a function – f(x) = x-2 – produces a set of points along a curved line that falls rapidly (exponentially quickly, actually). That set of points is, by definition, described by a power law; a power function produced it, so it must be. When plotted on a log-log scale, power-law fitting data looks like a straight diagonal line.

The Distribution

When people talk about power laws, they’re usually not actually talking about simple functions like the one I showed above. In the above function ( f(x) =  x-2 ), every x has a single corresponding value. For example, as we saw in the table, when x = 1, f(x) = 1. When x = 2, f(x) = 0.25. Rinse, repeat. Each x has one and only one f(x). When people invoke power laws, however, they usually say the distribution of data follows a power law.

Stats people will get angry with me for fudging some vocabulary in the interest of clarity; you should probably listen to them instead of me. Pushing forward, though, it’s worth dwelling over the subject of probability distributions. If you don’t think you’re familiar with probability distributions, you’re wrong. Let’s start with an example most of my readers will probably be familiar with:

Figure 3

The graph is fairly straightforward. Out of a hundred students, it looks like one student has a grade between 50 and 55, three students have a grade between 55 and 60, four students have a grade between 60 and 65, about eight students have a grade between 65 and 70, etc. The astute reader will note that a low B average is most common, with other grades decreasingly common on either side of the peak. This is indicative of a normal distribution, something I’ll touch on again momentarily.

Instead of saying there are a hundred students represented in this graph, let’s say instead that the numbers on the left represent a percentage. That is, 20% of students have a grade between 80 and 85. If we stack up the bars on top of each other, then, they would just reach 100%. And, if we were to pick a student at random from a hat (advice: always keep your students in a hat), the height of each bar corresponds to the probability that the student you picked at random would have the corresponding grade.

More concretely, when I reach into my student hat, 20% of the uncomfortably-packed students have a low B average, so there’s a 20% chance that the student I pick has a low B. There’s about a 5% chance that the student has a high A. That’s why this is a graph of probability distribution, because the height of each bar represents the probability that a particular value will be seen when data is chosen at random. Notice particularly how different the above graph is from the first one we discussed, Figure 1. For example, every x value (the grade group) corresponds to multiple other data points; that is, about 20% of the students correspond to a low B. That’s because the height of the data points, the value on the vertical axis, is a measure of the frequency of a particular value rather than just two bits of information about one data point.

The grade distribution above can be visualized much like the function from before (remember, that  f(x) = x-2 bit?), take a look:

Figure 4

This graph and Figure 3 show the same data, except this one is broken up into individual data points. The horizontal axis is just a list of 100 students, in order of their grades, pretending that each number on that axis is their student ID number. They’re in order of their rank, which is generally standard practice.  The vertical axis measures each student’s grade.

We can see near the bottom left that only four students scored below sixty, which is the same thing we saw in the degree distribution graph above. The entire point of this is to say that, when we talk about power laws, we’re talking graphs of distributions rather than graphs of data points. While figures 3 and 4 show the same data, figure 3 is the interesting one here, and we say that figure 3 is described by a normal distribution. A normal distribution looks like a bell curve, with a central value with very high frequency (say, students receiving a low B), and decreasing frequencies on either side (few high As and low Fs).

There are other datasets which present power law probability distributions. For example, we know that city population sizes tend to be distributed along a power law 2. That is, city populations are not normally distributed, with most cities having a around the same population and a few cities having some more or some less than the average. Instead, most cities have quite a small population, and very few cities have quite high populations. There are only a few New Yorks, but there are plenty of Bloomington, Indianas.

If we were to plot the probability distribution of cities – that is, essentially, the percent of some national population living in various size cities, we’d get a graph like Figure 5.

Figure 5

This graph should be seen as analogous to Figure 3, the probability distribution of student grades. We see that nearly 20% of cities have a population of about a thousand, another 10% of cities have populations around two thousand, about 1% of cities have a population of twenty thousand, and the tiniest fraction of cities have more than a hundred thousand residents.

If I were to reach into my giant hat filled with cities in the fictional nation graphed here (let’s call it Narnia, because it seems to be able to fit into small places), there’d be a 20% chance that the city I picked up had only a thousand residents, and a negligible chance that the city would be the hugely populated one on the far right. Figure 5 shows that cities, unlike grades, are not normally distributed but instead are distributed along some sort of power function. This can be seen more clearly when the distribution is graphed on a log-log plot like Figure 2, stretching the axes exponentially.

Figure 6

A straight line! A power law! Time to sing from the heavens, right? I mean hey, that’s really cool. We expect things to be normally distributed, with a lot of cities being medium-sized and fewer cities being smaller or larger on either side, and yet city sizes are distributed exponentially, where a few cities have huge populations and increasing numbers of cities hold increasingly smaller portions of the population.

Not so fast. Exponentially shrinking things appear pretty much everywhere, and it’s often far more surprising that something is normally or uniformly distributed. Power laws are normal, in that they appear everywhere. When dealing with differences of size and volume, like city populations  or human wealth, the universe tends to enjoy numbers increasing exponentially rather than linearly.

Orders of Magnitude and The Universe

I hope everyone in the whole world has seen the phenomenal 1977 video, Powers of Ten. It’s a great perspective on our place in the universe, zooming out from a picnic to the known universe in about five minutes before zooming back down to DNA and protons. Every ten seconds the video features a zooming level another of order of magnitude smaller or greater, going from ten to a hundred to a thousand meters. This is by no means the same as a power law (it’s the difference of 10x rather than x10), but the point in either case is that understanding the scale of the universe is a lot easier in terms of exponents than in terms of addition or multiplication.

Zipf’s Law

Zipf’s law is pretty cool, guys. It says that most languages are dominated by the use of a few words over and over again, with more rare words exponentially less frequently used. The top-most ranking word in English, “the,” is used twice as much second-most-frequent word, “of,” which itself is used nearly twice as much as “and.” In fact, in one particularly large corpus, only 135 words comprise half of all words used. That is, if we took a favorite book and removed all but the top hundred  words used, half the book would still exist. This law holds true for most languages. Power law!

The Long Tail

A few years ago, Wired editor Chris Anderson wrote an article (then a blog, then a book) called “The Long Tail,” which basically talked about the business model of internet giants like Amazon. Because Amazon serves such a wide audience, it can afford to carry books that almost nobody reads, those niche market books that appeal solely to underwater basket weavers or space taxidermists or Twilight readers (that’s a niche market, right? People aren’t really reading those books?).

Local booksellers could never stock those books, because the cost of getting them and storing them would overwhelm the number of people in the general vicinity who would actually buy them. However, according to Anderson, because Amazon’s storage space and reach is nigh-infinite, having these niche market books for sale actually pays off tremendously.

Take a look at Figure 7. It’s not actually visualizing a distribution like in Figure 5; it’s more like Figure 4 with the students and the grades. Pretend each tick on the horizontal axis is a different book, and their heights corresponds to how many times each book is bought. They’re ordered by rank, so the bestselling books are over at the left, and as you go further to the right of the graph, the books clearly don’t do as well.

Figure 7: http://en.wikipedia.org/wiki/File:Long_tail.svg

One feature worth noting is that, in the graph above, the area of green is equal to the area of yellow. That means a few best-sellers comprise 50% of the books bought on Amazon. A handful of books dominate the market, and they’re the green side of Figure 7.

However, that means the other 50% of the market is books that are rarely purchased. Because Amazon is able to store and sell books from that yellow section, it can double its sales. Anderson popularized the term “long tail” as that yellow part of the graph. To recap, we now have cities, languages, and markets following power laws.

Pareto Principle

The Pareto Principle, also called the 80/20 rule, is brought up in an absurd amount of contexts (often improperly). It says that 80% of land in Italy is held by 20% of the population; 20% of pea pods contain 80% of the peas; the richest 20% control 80% of the income (as of 1989…); 80% of complaints in business come from 20% of the clients; the list goes on. Looking at Figure 7, it’s easy to see how such a principle plays out.

Scale-Free Networks

Yay! We finally get to the networks part of this Networks Demystified post. Unfortunately it’ll be rather small, but parts 4 and 5 of Networks Demystified will be about Small World and Scale Free networks more extensively, and this post sort of has to come first.

If you’ve read Parts I and II of Networks Demystified, then you already know the basics. Here’s what you need in brief: Networks are stuff and relationships between them. I like to call the stuff nodes and the relationships edges. A node’s degree is how many times it’s connected to an edge. Read the beginning of Demystified Part II for more detail, but that should be all you need to know for the next example. It turns out that the node degree distribution of many networks follows a power law (surprise!), and scholarly citation networks are no exception.

Citation Networks

If you look at scholarly articles, you can pretend each paper is a node and each citation is an edge going from the citing to the cited article. In a turn of events that will shock no-one, some papers are cited very heavily while most are, if they’re lucky, cited once or twice. A few superstar papers attract the majority of citations. In network terms, a very few nodes have a huge degree – are attached to many edges – and the amount of edges per node decreases exponentially as you get to less popular papers. Think of it like Figure 5, where cities with exponentially higher population (horizontal axis) are exponentially more rare (vertical axis). It turns out this law holds true for almost all citation networks.

Preferential Attachment

The concept of preferential attachment is very old and was independently discovered by many, although the term itself is recent. I’ll get into the history of it and relevant citations in my post on scale-free networks, what matters here is the effect itself. In a network, the idea goes, nodes that are already linked to many others will be more likely to collect even more links. The rich get richer. In terms of citation networks, the mechanism by which nodes preferentially attach to other nodes is fairly simple to discern.

A few papers, shortly after publication, happen to get a lot of people very excited; the rest of the papers published at the same time are largely ignored. Those few initial papers are cited within months of their publication and then researchers come across the new papers in which the original exciting papers were cited. That is, papers that are heavily cited have a greater chance of being noticed, which in turn increases their chances of being even more heavily cited as time goes on. The rich get richer.

This is also an example of a positive feedback loop, a point which I’ll touch on in some greater detail in the next section. This preferential attachment appears in all sorts of evolving systems, from social networks to the world wide web. And, in each of these cases, power laws appear present in the networks.

Feedback Loops

Wikipedia defines positive feedback as “a process in which the effects of a small disturbance on a system include an increase in the magnitude of the perturbation.” We’ve all heard what happens when somebody takes a microphone too close to the speaker it’s projecting on. The speaker picks up some random signal from the mic and projects it back into the mic, which is then re-amplified through the speaker, picked up again by the mic, and so on until everyone in the room has gone deaf or the speaker explodes.

Wikipedia seemingly-innocuously adds “[w]hen there is more positive feedback than there are stabilizing tendencies, there will usually be exponential growth of any oscillations or divergences from equilibrium.” In short, feedback tends to lead to exponential changes. And systems that are not sufficiently stabilized (hint: most of them aren’t) will almost inevitably fall into positive feedback loops, which will manifest as (you guessed it) power laws.

Once more with the chorus: power laws.

Benford’s Law

By this point I’m sure you all believe me about the ubiquity (and thus impotence) of power laws. Given that, what I’m about to say shouldn’t surprise you but, if you haven’t heard of it before, I promise you it will still come as a shock. It’s called Benford’s Law, and what it says is equal parts phenomenal and absurd.

Consult your local post office records (it’s okay, I’ll wait) and find a list of every street address in your county. Got it? Good. Now, get rid of everything but the street numbers; that is, if one address reads “1600 Pennsylvania Avenue, Northwest Washington, DC 20500,” get rid of everything but the “1600.” We’re actually going to go even further, get rid of everything but the first digit of the street address.

Now, you’re left with a long list of single-digits that used to be the first digits of street addresses. If you were to count all of those digits, you would find that digit “1” is used significantly more frequently than digit “2.” Further, digit “2” appears significantly more frequently than digit “3.” This trend continues down to digit “9.”

That’s odd, but not terribly shocking. It gets more shocking when you find out that the same process (looking at the first digits of numbers and seeing that 1 is used more than 2, 2 more than 3, 3 more than 4, etc.) holds true for the area of rivers, physical constants, socio-economic data, the heights of skyscrapers, death rates, numbers picked at random from issues of Readers’ Digest, etc., etc., etc. In an absurdly wide range of cases, the probability distribution of leading digits in lists of numbers shows lower digits appear exponentially more frequently than higher digits. In fact, when looking at the heights of the world’s tallest buildings, it holds true no matter the scale; that is, if heights are measured in yards, miles, kilometers, or furlongs, Benford’s law still holds.

So what’s going on here? It turns out that if you take the logarithm of all the numbers in these sets, the numbers turn out to be uniformly distributed 3. I won’t unpack logs here (I’ve already gone on too long, haven’t I?), but this basically means that if you take into account powers and exponents, all of these numbers are actually uniformly distributed. The universe tends to filter random numbers into exponential equations before we have a chance to count them, but once we do count them, all we see are power laws.

Conclusion

About four thousand words ago, I wrote that reporting power laws in your data is about as useful as reporting furry ears on your dog. That neither means that power laws are useless nor that they should never be reported. It’s just that they’re not particularly exciting by themselves. If you’ve found a power law in your data, that’s great! Now comes the time when you have to actually describe the line that you’ve found. Finding the parameters for the power law and comparing them against other similar datasets can be exceptionally enlightening, as would using the trends found for prediction or comparison.

Too often (at least once in every conference I’ve attended in the last year), I see people mentioning that their data followed a power law and leaving it at that. Oy vey. Alright, now that that’s out of the way, I promise the next Networks Demystified post will actually be about networks again.

Notes:

  1. I opt not to use words like “mass,” “log-normal,” or “exponential,” but don’t worry, I have a rant about people confusing those as well. It’ll come out soon enough.
  2. Actually it’s log-normal, but it’s close enough for our purposes here.
  3. The log-uniform distribution is actually a special case of a power law with a parameter of -1. Nifty, huh?

The Myth of Text Analytics and Unobtrusive Measurement

Text analytics are often used in the social sciences as a way of unobtrusively observing people and their interactions. Humanists tend to approach the supporting algorithms with skepticism, and with good reason. This post is about the difficulties of using words or counts as a proxy for some secondary or deeper meaning. Although I offer no solutions here, readers of the blog will know I am hopeful of the promise of these sorts of measurements if used appropriately, and right now, we’re still too close to the cutting edge to know exactly what that means. There are, however, copious examples of text analytics used well in the humanities (most recently, for example, Joanna Guldi’s  publication on the history of walking).

The Confusion

Klout is a web service which ranks your social influence based on your internet activity. I don’t know how Klout’s algorithm works (and I doubt they’d be terribly forthcoming if I asked), but one of the products of that algorithm is a list of topics about which you are influential. For instance, Klout believes me to be quite influential with regards to Money (really? I don’t even have any of that.) and Journalism (uhmm.. no.), somewhat influential in Juggling (spot on.), Pizza (I guess I am from New York…), Scholarship (Sure!), and iPads (I’ve never touched an iPad.), and vaguely influential on the topic of Cars (nope) and Mining (do they mean text mining?).

By Ildar Sagdejev (Specious) (Own work) [GFDL (www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
My pizza expertise is clear.
Thankfully careers don’t ride on this measurement (we have other metrics for that), but the danger is still fairly clear: the confusion of vocabulary and syntax for semantics and pragmatics. There are clear layers between the written word and its intended meaning, and those layers often depend on context and prior knowledge. Further, regardless of the intended meaning of the author, how her words are interpreted in the larger world can vary wildly. She may talk about money and pizza until she is blue in the face, but if the whole world disagrees with her, that is no measurement of expertise nor influence (even if angry pizza-lovers frequently shout at her about her pizza opinions).

We see very simple examples of this in sentiment  analysis, a way to extract the attitude of the writer toward whatever it was he’s written. An old friend who recently dipped his fingers in sentiment analysis wrote this:

According to his algorithm, that sentence was a positive one. Unless I seriously misunderstand my social cues (which I suppose wouldn’t be too unlikely), I very much doubt the intended positivity of the author. However, most decent algorithms would pick up that this was a tweet from somebody who was positive about Sarah Jessica Parker.

Unobtrusive Measurements

This particular approach to understanding humans belongs to the larger methodological class of unobtrusive measurements. Generally speaking, this topic is discussed in the context of the social sciences and is contrasted with more ‘obtrusive’ measurements along the lines of interviews or sticking people in labs. Historians generally don’t need to talk about unobtrusive measurements because, hey, the only way we could be obtrusive to our subjects would require exhuming bodies. It’s the idea that you can cleverly infer things about people from a distance, without them knowing that they are being studied.

Notice the disconnect between what I just said, and the word itself. ‘Unobtrusive’ against “without them knowing that they are being studied.” These are clearly not the same thing, and that distinction between definition and word is fairly important – and not merely in the context of this discussion. One classic example (Doob and Gross, 1968) asks how somebody’s social status determines whether someone might take aggressive action against them. They specifically measures a driver’s likelihood to honk his horn in frustration based on the perceived social status of the driver in front of them. Using a new luxury car and an old rusty station wagon, the researchers would stop at traffic lights that had turned green and would wait to see whether the car behind them honked. In the end, significantly more people honked at the low status car. More succinctly: status affects decisions of aggression.  Honking and the perceived worth of the car were used as proxies for aggression and perceptions of status, much like vocabulary is used as a proxy for meaning.

In no world would this be considered unobtrusive from the subject’s point of view. The experimenters intruded on their world, and their actions and lives changed because of it. All it says is that the subjects won’t change their behavior based on the knowledge that they are being studied. However, when an unobtrusive experiment becomes large enough, even one as innocuous as counting words, even that advantage no longer holds. Take, for example, citation analysis and the h-index. Citation analysis was initially construed as an unobtrusive measurement; we can say things about scholars and scholarly communication by looking at their citation patterns rather than interviewing them directly. However, now that entire nations (like Australia or the UK) use quantitative analysis to distribute funding to scholarship, the measurements are no longer unobtrusive. Scholars know how the new scholarly economy works, and have no problem changing their practices to get tenure, funding, etc.

The Measurement and The Effect: Untested Proxies

A paper was recently published (O’Boyle Jr. and Aguinis, 2012) on the non-normality of individual performance. The idea is that we assume that people’s performance (for example students in a classroom) are normally distributed along a bell curve. A few kids get really good grades, a few kids get really bad grades, but most are ‘C’ students. The authors challenge this view, suggesting performance takes on more of a power-law distribution, where very few people perform very well, and the majority perform very poorly, with 80% of people performing worse than the statistical average. If that’s hard to imagine, it’s because people are trained to think of averages on a bell curve, where 50% are greater than average and 50% are worse than average. Instead, imagine one person gets a score of 100, and another five people get scores of 10. The average is (100 + (10 * 5)) / 6 = 25, which means five out of the six people performed worse than average.

It’s an interesting hypothesis, and (in my opinion) probably a correct one, but their paper does not do a great job showing that. The reason is (you guessed it) they use scores as a proxy for performance.  For example, they look at the number of published papers individuals have in top-tier journals, and show that some authors are very productive whereas most are not. However, it’s a fairly widely-known phenomena that in science, famous names are more likely to be published than obscure ones (there are many anecdotes about anonymous papers being rejected until the original, famous author is revealed, at which point the paper is magically accepted). The number of accepted papers may be as much a proxy for fame as it is for performance, so the results do not support their hypothesis. The authors then look at awards given to actors and writers, however those awards suffer the same issues: the more well-known an actor, the the more likely they’ll be used in good movies, the more likely they’ll be visible to award-givers, etc. Again, awards are not a proxy for the quality of a performance. The paper then goes on to measure elected officials based on votes in elections. I don’t think I need to go on about how votes might not map one-to-one on the performance and prowess of an elected official.

I blogged a review of the most recent culturomics paper, which used google ngrams to look at the frequency of recurring natural disasters (earthquakes, floods, etc.) vs. the frequency of recurring social events (war, unemployment, etc.). The paper concludes that, because of differences in the frequency of word-use for words like ‘war’ or ‘earthquake’, the phenomena themselves are subject to different laws. The authors use word frequency as a proxy for the frequency of the events themselves, much in the same way that Klout seems to measure influence based on word-usage and counting. The problem, of course, is that the processes which govern what people decide to write down do not enjoy a one-to-one relationship to what people experience. Using words as proxies for events is just as problematic as using them for proxies of expertise, influence, or performance. The underlying processes are simply far more complicated than these algorithms give them credit for.

It should be noted, however, that the counts are not meaningless; they just don’t necessarily work as proxies for what these ngram scholars are trying to measure. Further, although the underlying processes are quite complex, the effect size of social or political pressure on word-use may be negligible to the point that their hypothesis is actually correct. The point isn’t that one cannot use one measurement as a proxy for something else; rather, the effectiveness of that proxy is assumed rather than actually explored or tested in any way. We need to do a better job, especially as humanists, of figuring out exactly how certain measurements map onto effects we seek.

A beautiful case study that exemplifies this point was written by famous statistician Andrew Gelman, and it aims to use unobtrusive and indirect measurements to find alien attacks and zombie outbreaks. He uses Google Trends to show that the number of zombies in the world are growing at a frightening rate.

Zombies will soon take over!