Appreciability & Experimental Digital Humanities

Operationalize: to express or define (something) in terms of the operations used to determine or prove it.

Precision deceives. Quantification projects an illusion of certainty and solidity no matter the provenance of the underlying data. It is a black box, through which uncertain estimations become sterile observations. The process involves several steps: a cookie cutter to make sure the data are all shaped the same way, an equation to aggregate the inherently unique, a visualization to display exact values from a process that was anything but.

In this post, I suggest that Moretti’s discussion of operationalization leaves out an integral discussion on precision, and I introduce a new term, appreciability, as a constraint on both accuracy and precision in the humanities. This conceptual constraint paves the way for an experimental digital humanities.

Operationalizing and the Natural Sciences

An operationalization is the use of definition and measurement to create meaningful data. It is an incredibly important aspect of quantitative research, and it has served the western world well for at leas 400 years. Franco Moretti recently published a LitLab Pamphlet and a nearly identical article in the New Left Review about operationalization, focusing on how it can bridge theory and text in literary theory. Interestingly, his description blurs the line between the operationalization of his variables (what shape he makes the cookie cutters that he takes to his text) and the operationalization of his theories (how the variables interact to form a proxy for his theory).

Moretti’s account anchors the practice in its scientific origin, citing primarily physicists and historians of physics. This is a deft move, but an unexpected one in a recent DH environment which attempts to distance itself from a narrative of humanists just playing with scientists’ toys. Johanna Drucker, for example, commented on such practices:

[H]umanists have adopted many applications […] that were developed in other disciplines. But, I will argue, such […] tools are a kind of intellectual Trojan horse, a vehicle through which assumptions about what constitutes information swarm with potent force. These assumptions are cloaked in a rhetoric taken wholesale from the techniques of the empirical sciences that conceals their epistemological biases under a guise of familiarity.

[…]

Rendering observation (the act of creating a statistical, empirical, or subjective account or image) as if it were the same as the phenomena observed collapses the critical distance between the phenomenal world and its interpretation, undoing the basis of interpretation on which humanistic knowledge production is based.

But what Drucker does not acknowledge here is that this positivist account is a century-old caricature of the fundamental assumptions of the sciences. Moretti’s account of operationalization as it percolates through physics is evidence of this. The operational view very much agrees with Drucker’s thesis, where the phenomena observed takes second stage to a definition steeped in the nature of measurement itself. Indeed, Einstein’s introduction of relativity relied on an understanding that our physical laws and observations of them rely not on the things themselves, but on our ability to measure them in various circumstances. The prevailing theory of the universe on a large scale is a theory of measurement, not of matter. Moretti’s reliance on natural scientific roots, then, is not antithetical to his humanistic goals.

I’m a bit horrified to see myself typing this, but I believe Moretti doesn’t go far enough in appropriating natural scientific conceptual frameworks. When describing what formal operationalization brings to the table that was not there before, he lists precision as the primary addition. “It’s new because it’s precise,” Moretti claims, “Phaedra is allocated 29 percent of the word-space, not 25, or 39.” But he asks himself: is this precision useful? Sometimes, he concludes, “It adds detail, but it doesn’t change what we already knew.”

From Moretti, 'Operationalizing', New Left Review.
From Moretti, ‘Operationalizing’, New Left Review.

I believe Moretti is asking the wrong first question here, and he’s asking it because he does not steal enough from the natural sciences. The question, instead, should be: is this precision meaningful? Only after we’ve assessed the reliability of new-found precision can we understand its utility, and here we can take some inspiration from the scientists, in their notions of accuracy, precision, uncertainty, and significant figures.

Terminology

First some definitions. The accuracy of a measurement is how close it is to the true value you are trying to capture, whereas the precision of a measurement is how often a repeated measurement produces the same results. The number of significant figures is a measurement of how precise the measuring instrument can possibly be. False precision is the illusion that one’s measurement is more precise than is warranted given the significant figures. Propagation of uncertainty is the pesky habit of false precision to weasel its way into the conclusion of a study, suggesting conclusions that might be unwarranted.

Accuracy and Precision. [via]
Accuracy and Precision. [via]
Accuracy roughly corresponds to how well-suited your operationalization is to finding the answer you’re looking for. For example, if you’re interested in the importance of Gulliver in Gulliver’s Travels, and your measurement is based on how often the character name is mentioned (12 times, by the way), you can be reasonably certain your measurement is inaccurate for your purposes.

Precision roughly corresponds to how fine-tuned your operationalization is, and how likely it is that slight changes in measurement will affect the outcomes of the measurement. For example, if you’re attempting to produce a network of interacting characters from The Three Musketeers, and your measuring “instrument” is increase the strength of connection between two characters every time they appear in the same 100-word block, then you might be subject to difficulties of precision. That is, your network might look different if you start your sliding 100-word window from the 1st word, the 15th word, or the 50th word. The amount of variation in the resulting network is the degree of imprecision of your operationalization.

Significant figures are a bit tricky to port to DH use. When you’re sitting at home, measuring some space for a new couch, you may find that your meter stick only has tick marks to the centimeter, but nothing smaller. This is your highest threshold for precision; if you eyeballed and guessed your space was actually 250.5cm, you’ll have reported a falsely precise number. Others looking at your measurement may have assumed your meter stick was more fine-grained than it was, and any calculations you make from that number will propagate that falsely precise number.

Significant Figures. [via]
Significant Figures. [via]
Uncertainty propagation is especially tricky when you wind up combing two measurements together, when one is more precise and the other less. The rule of thumb is that your results can only be as precise as the least precise measurements that made its way into your equation. The final reported number is then generally in the form of 250 (±1 cm). Thankfully, for our couch, the difference of a centimeter isn’t particularly appreciable. In DH research, I have rarely seen any form of precision calculated, and I believe some of those projects would have reported different results had they accurately represented their significant figures.

Precision, Accuracy, and Appreciability in DH

Moretti’s discussion of the increase of precision granted by operationalization leaves out any discussion of the certainty of that precision. Let’s assume for a moment that his operationalization is accurate (that is, his measurement is a perfect conversion between data and theory). Are his measurements precise? In the case of Phaedra, the answer at first glance is yes, words-per-character in a play would be pretty robust against slight changes in the measurement process.

And yet, I imagine, that answer will probably not sit well with some humanists. They may ask themselves: Is Oenone’s 12%  appreciably different from Theseus’s 13% of the word-space of the play? In the eyes of the author? Of the actors? Of the audience? Does the difference make a difference?

The mechanisms by which people produce and consume literature is not precise. Surely Jean Racine did not sit down intending to give Theseus a fraction more words than Oenone. Perhaps in DH we need a measurement of precision, not of the measuring device, but of our ability to interact with the object we are studying. In a sense, I’m arguing, we are not limited to the precision of the ruler when measuring humanities objects, but to the precision of the human.

In the natural sciences, accuracy is constrained by precision: you can only have as accurate a measurement as your measuring device is precise.  In the corners of humanities where we study how people interact with each other and with cultural objects, we need a new measurement that constrains both precision and accuracy: appreciability. A humanities quantification can only be as precise as that precision is appreciable by the people who interact with matter at hand. If two characters differ by a single percent of the wordspace, and that difference is impossible to register in a conscious or subconscious level, what is the meaning of additional levels of precision (and, consequently, additional levels of accuracy)?

Experimental Digital Humanities

Which brings us to experimental DH. How does one evaluate the appreciability of an operationalization except by devising clever experiments to test the extent of granularity a person can register? Without such understanding, we will continue to create formulae and visualizations which portray a false sense of precision. Without visual cues to suggest uncertainty, graphs present a world that is exact and whose small differentiations appear meaningful or deliberate.

Experimental DH is not without precedent. In Reading Tea Leaves (Chang et al., 2009), for example, the authors assessed the quality of certain topic modeling tweaks based on how a large number of people assessed the coherence of certain topics. If this approach were to catch on, as well as more careful acknowledgements of accuracy, precision, and appreciability, then those of us who are making claims to knowledge in DH can seriously bolster our cases.

There are some who present the formal nature of DH as antithetical to the highly contingent and interpretative nature of the larger humanities. I believe appreciability and experimentation can go some way alleviating the tension between the two schools, building one into the other. On the way, it might build some trust in humanists who think we sacrifice experience for certainty, and in natural scientists who are skeptical of our abilities to apply quantitative methods.

Right now, DH seems to find its most fruitful collaborations in computer science or statistics departments. Experimental DH would open the doors to new types of collaborations, especially with psychologists and sociologists.

I’m at an extremely early stage in developing these ideas, and would welcome all comments (especially those along the lines of “You dolt! Appreciability already exists, we call it x.”) Let’s see where this goes.

Submissions to Digital Humanities 2014

Submissions for the 2014 Digital Humanities conference just closed. It’ll be in Switzerland this time around, which unfortunately means I won’t be able make it, but I’ll be eagerly following along from afar. Like last year, reviewers are allowed to preview the submitted abstracts. Also like last year, I’m going to be a reviewer, which means I’ll have the opportunity to revisit the submissions to DH2013 to see how the submissions differed this time around. No doubt when the reviews are in and the accepted articles are revealed, I’ll also revisit my analysis of DH conference acceptances.

To start with, the conference organizers received a record number of submissions this year: 589. Last year’s Nebraska conference only received 348 submissions. The general scope of the submissions haven’t changed much; authors were still supposed to tag their submissions using a controlled vocabulary of 95 topics, and were also allowed to submit keywords of their own making. Like last year, authors could submit long papers, short papers, panels, or posters, but unlike last year, multilingual submissions were encouraged (English, French, German, Italian, or Spanish). [edit: Bethany Nowviskie, patient awesome person that she is, has noticed yet another mistake I’ve made in this series of posts. Apparently last year they also welcomed multilingual submissions, and it is standard practice.]

Digital Humanities is known for its collaborative nature, and not much has changed in that respect between 2013 and 2014 (Figure 1). Submissions had, on average, between two and three authors, with 60% of submissions in both years having at least two authors. This year, a few fewer papers have single authors, and a few more have two authors, but the difference is too small to be attributable to anything but noise.

Figure 1. Number of authors per paper.
Figure 1. Number of authors per paper.

The distribution of topics being written about has changed mildly, though rarely in extreme ways. Any changes visible should also be taken with a grain of salt, because a trend over a single year is hardly statistically robust to small changes, say, in the location of the event.

The grey bars in Figure 2 show what percentage of DH2014 submissions are tagged with a certain topic, and the red dotted outlines show what the percentages were in 2013. The upward trends to note this year are text analysis, historical studies, cultural studies, semantic analysis, and corpora and corpus activities. Text analysis was tagged to 15% of submissions in 2013 and is now tagged to 20% of submissions, or one out of every five. Corpus analysis similarly bumped from 9% to 13%. Clearly this is an important pillar of modern DH.

Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The dotted lines represent the percentage from DH2013.
Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The red dotted outlines represent the percentage from DH2013.

I’ve pointed out before that History is secondary compared to Literary Studies in DH (although Ted Underwood has convincingly argued, using Ben Schmidt’s data, that the numbers may merely be due to fewer people studying history). This year, however, historical studies nearly doubled in presence, from 10% to 17%. I haven’t yet collected enough years of DH conference data to see if this is a trend in the discipline at large, or more of a difference between European and North American DH. Semantic analysis jumped from 1% to 7% of the submissions, cultural studies went from 10% to 14%, and literary studies stayed roughly equivalent. Visualization, one of the hottest topics of DH2013, has become even hotter in 2014 (14% to 16%).

The most visible drops in coverage came in pedagogy, scholarly editions, user interfaces, and research involving social media and the web. At DH2013, submissions on pedagogy had a surprisingly low acceptance rate, which combined the drop in pedagogy submissions this year (11% to 8% in “Digital Humanities – Pedagogy and Curriculum” and 7% to 4% in “Teaching and Pedagogy”) might suggest a general decline in interest in the DH world in pedagogy. “Scholarly Editing” went from 11% to 7% of the submissions, and “Interface and User Experience Design” from 13% to 8%, which is yet more evidence for the lack of research going into the creation of scholarly editions compared to several years ago. The most surprising drops for me were those in “Internet / World Wide Web” (12% to 8%) and “Social Media” (8.5% to 5%), which I would have guessed would be growing rather than shrinking.

The last thing I’ll cover in this post is the author-chosen keywords. While authors needed to tag their submissions from a list of 95 controlled vocabulary words, they were also encouraged to tag their entries with keywords they could choose themselves. In all they chose nearly 1,700 keywords to describe their 589 submissions. In last year’s analysis of these keywords, I showed that visualization seemed to be the glue that held the DH world together; whether discussing TEI, history, network analysis, or archiving, all the disparate communities seemed to share visualization as a primary method. The 2014 keyword map (Figure 3) reveals the same trend: visualization is squarely in the middle. In this graph, two keywords are linked if they appear together on the same submission, thus creating a network of keywords as they co-occur with one another. Words appear bigger when they span communities.

Figure 3. Co-occurrence of DH2014 author-submitted keywords.
Figure 3. Co-occurrence of DH2014 author-submitted keywords.

Despite the multilingual conference, the large component of the graph is still English. We can see some fairly predictable patterns: TEI is coupled quite closely with XML; collaboration is another keyword that binds the community together, as is (obviously) “Digital Humanities.” Linguistic and literature are tightly coupled, much moreso than, say, linguistic and history. It appears the distant reading of poetry is becoming popular, which I’d guess is a relatively new phenomena, although I haven’t gone back and checked.

This work has been supported by an ACH microgrant to analyze DH conferences and the trends of DH through them, so keep an eye out for more of these posts forthcoming that look through the last 15 years. Though I usually share all my data, I’ll be keeping these to myself, as the submitters to the conference did so under an expectation of privacy if their proposals were not accepted.

[edit: there was some interest on twitter last night for a raw frequency of keywords. Because keywords are author-chosen and I’m trying to maintain some privacy on the data, I’m only going to list those keywords used at least twice. Here you go (Figure 4)!]

Figure 4. Keywords used in DH2014 submissions ordered by frequency.
Figure 4. Keywords used in DH2014 submissions ordered by frequency.

Friends don’t let friends calculate p-values (without fully understanding them)

I wrote this in 2012 in response to a twitter conversation with Mike Taylor, who was patient enough to read a draft of this post in late November and point out all the various ways it could be changed and improved. No doubt if I had taken the time to take his advice and change the post accordingly, this post would be a beautiful thing and well-worth reading. I’d like to thank him for his patient comments, and apologize for not acting on them, as I don’t foresee being able to take the time in the near future to get back to this post. I don’t really like seeing a post sit in my draft folder for months, so I’ll release it out to the world as naked and ugly as the day it was born, with the hopes that I’ll eventually sit down and write a better one that takes all of Mike’s wonderful suggestions into account. Also, John Kruschke apparently published a paper with a similar title; as it’s a paper I’ve read before but forgot about, I’m guessing I inadvertently stole his fantastic phrase. Apologies to John!

——————–

I recently [okay, at one point this was ‘recent’] got in a few discussions on Twitter stemming from a tweet about p-values. For the lucky and uninitiated, p-values are basically the go-to statistics used by giant swaths of the academic world to show whether or not their data are statistically significant. Generally, the idea is that if the value of p is less than 0.05, you have results worth publishing.

Introducing p-values

There’s a lot to unpack in p-values; a lot of history, of a lot of baggage, and a lot of reasons why it’s a pretty poor choice for quantitative analysis. I won’t unpack all of it, but I will give a brief introduction to how the statistic works. At its simplest, according to Wikipedia, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” No, don’t run away! That actually makes sense, just let me explain it.

We’ll start with a coin toss. Heads. Another: tails. Repeat: heads, heads, tails, heads, heads, heads, heads, tails. Tally up the score, of the 10 tosses, 7 were heads and 3 were tails. At this point, if you were a gambler, you might start guessing the coin is weighted or rigged. I mean, seriously, you toss the coin 10 times and only 3 of those were tails? What’s the probability of that, if the coin is fair?

That’s exactly what the p-value is supposed to tell us; what’s the probability of getting 3 or fewer tails out of 10 coin tosses, assuming the coin is fair? That assumption that the coin is fair is what statisticians call the null hypothesis; if we show that it’s less than 5% likely (p < 0.05) that the we’d see 3 or fewer tails in 10 coin flips, assuming a fair coin, statisticians tell us we can safely reject the null hypothesis. If this were a scientific experiment, we’d say that because p < 0.05, our results are significant; we can reject the null hypothesis that the coin is a fair one, because it’s pretty unlikely that the experiment would have turned out this way if the coin was fair. Huzzah! Callooh! Callay! We can publish our findings that this coin just isn’t fair in the world-renowned Journal of Trivial Results and Bad Bar Bets (JTRB3).

So how is this calculated? Well, the idea is that we have to look at the universe of possible results given the null hypothesis; that is, if I had a coin that was truly fair, how often will the result of my flipping it yield 7 heads and 3 tails? Obviously, given a fair coin, most scenarios will yield about a 50/50 split on heads and tails. It is, however, not completely impossible for us to get our 7/3 results. What we have to calculate is, if we flipped a fair coin in an infinite amount of experiments, what percent of those experiments would give us 7 heads and 3 tails? If it’s under 5%, p < 0.05, we can be reasonably certain that our result probably implies a stacked coin. We can assume that because in only 5% of experiments on a fair coin will the results look like the results we see.

We use 0.05, or 5%, mostly out of a long-standing social agreement. The 5% mark is low enough that we can reject the null hypothesis beyond most reasonable doubts. That said, it’s worth remembering that the difference between 4.9% and 5.1% is itself often not statistically significant, and though 5% is our agreed upon convention, it’s still pretty arbitrary. The best thing anyone’s ever written to this point was by Rosnow & Rosenthal (1989):

“We are not interested in the logic itself, nor will we argue for replacing the .05 alpha with another level of alpha, but at this point in our discussion we only wish to emphasize that dichotomous significance testing has no ontological basis. That is, we want to underscore that, surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?”

How p-values work. via wikipedia.
How p-values work. via wikipedia.

I’m really nailing this point home because p-values are so often misunderstood. Wikipedia lists seven common misunderstandings of p-values, and I can almost guarantee that a majority of scientists who use the statistic are guilty of at least some of them. I’m going to quote the misconceptions here, because they’re really important.

  1. The p-value is not the probability that the null hypothesis is true.
    In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is Lindley’s paradox.
  2. The p-value is not the probability that a finding is “merely a fluke.”
    As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
  3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor’s fallacy.
  4. The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of P-rep (which is heavily criticized)
  5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
  6. The significance level of the test (denoted as alpha) is not determined by the p-value.
    The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)
  7. The p-value does not indicate the size or importance of the observed effect (compare with effect size). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.

One problem with p-values (and it often surprises people)

Okay, I’m glad that explanation is over, because now we can get to the interesting stuff, and they’re all based around assumptions. I worded the above coin-toss experiment pretty carefully; you keep flipping a coin until you wind up with 7 heads and 3 tails, 10 flips in all. The experimental process seems pretty straight forward: you flip a coin a bunch of times, and record what you get. Not much in the way of underlying assumptions to get in your way, right?

Wrong. There are actually a number of ways this experiment could have been designed, each yielding the same exact observed data, but each as well yielding totally different p-values. I’ll focus on two of those possible experiments here. Let’s say we went into the experiment saying “Alright, coin, I’m going to flip you 10 times, and then count how often you land on heads and how often you land on tails.” That’s probably how you assumed the experiment was run, but I never actually described it like that; all I said was you wound up flipping a coin 10 times, seeing 7 heads and 3 tails. With that information, you can also have said “I’m going to flip this coin until I see 3 tails, and then stop.” If you remember the sequence above (heads, tails, heads, heads, tails, heads, heads, heads, heads, tails), that experimental design also fits our data perfectly. We kept flipping a coin until we saw 3 tails, and in the interim, we also observed 7 heads.

The problem here is in how p-values work. When you ask what the probability is that your result came from chance alone, a bunch of assumptions underlie this question, the key one here being the halting conditions of your experiment. Calculating how often 10 coin-tosses of a fair coin will result in a 7/3 split (or 8/2 or 9/1 or 10/0) will give you a different result than if you calculated how often waiting until the third tails will give you a 7/3 split (or 8/3 or 9/3 or 10/3 or 11/3 or 12/3 or…). The space of possibilities changes, the actual p-value of your experiment changes, based on assumptions built into your experimental design. If you collect data with one set of halting conditions, and it turns out your p-value isn’t as low as you’d like it to be, you can just pretend you went into your experiment with different halting conditions and, voila!, your results become significant.

The moral of the story is that p-values can be pretty sensitive to assumptions that are often left unspoken in an experiment. The same observed data can yield wildly different values. With enough mental somersaults regarding experimental assumptions, you can find that a 4/6 split in a coin toss actually resulted from flipping a coin that isn’t fair. Much of statistics is rife with unspoken assumptions about how data are distributed, how the experiment is run, etc. It is for this reason that I’m trying desperately to get quantitative humanists using non-parametric and Bayesian methods from the very beginning, before our methodology becomes canonized and set. Bayesian methods are also, of course, rife with assumptions, but at least most of those assumptions are made explicit at the outset. Really, the most important thing is to learn not only how to run a statistic, but also what it means to do so. There are appropriate times to use p-values, but those times are probably not as frequent as is often assumed.

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.

Metadata

Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use bookworm.culturomics.org to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!

Notes:

  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value

Topic nets

I’m sorry. I love you (you know who you are, all of you). I really do. I love your work, I think it’s groundbreaking and transformative, but the network scientist / statistician in me twitches uncontrollably whenever he sees someone creating a network out of a topic model by picking the top-topics associated with each document and using those as edges in a topic-document network. This is meant to be a short methodology post for people already familiar with LDA and already analyzing networks it produces, so I won’t bend over backwards trying to re-explain networks and topic modeling. Most of my posts are written assuming no expert knowledge, so I apologize if in the interest of brevity this one isn’t immediately accessible.

MALLET, the go-to tool for topic modeling with LDA, outputs a comma separated file where each row represents a document, and each pair of columns is a topic that document is associated with. The output looks something like

        Topic 1 | Topic 2 | Topic 3  | ...
Doc 1 | 0.5 , 1 | 0.2 , 5 | 0.1  , 2 | ...
Doc 2 | 0.4 , 6 | 0.3 , 1 | 0.06 , 3 | ...
Doc 3 | 0.6 , 2 | 0.4 , 3 | 0.2  , 1 | ...
Doc 4 | 0.5 , 5 | 0.3 , 2 | 0.01 , 6 | ...

Each pair is the amount a document is associated with a certain topic followed by the topic of that association. Given a list like this, it’s pretty easy to generate a bimodal/bipartite network (a network of two types of nodes) where one variety of node is the document, and another variety of node is a topic. You connect each document to the top three (or n) topics associated with that document and, voila, a network!

The problem here isn’t that a giant chunk of the data is just being thrown away (although there are more elegant ways to handle that too), but the way in which a portion of the data is kept. By using the top-n approach, you lose the rich topic-weight data that shows how some documents are really only closely associated with one or two documents, whereas others are closely associated with many. In practice, the network graph generated by this approach will severely skew the results, artificially connecting documents which are topical outliers toward the center of the graph, and preventing documents in the topical core from being represented as such.

In order to account for this skewing, an equally simple (and equally arbitrary) approach can be taken whereby you only take connections that are over weight 0.2 (or whatever, m). Now, some documents are related to one or two topics and some are related to several, which more accurately represents the data and doesn’t artificially skew network measurements like centrality.

The real trouble comes when a top-n topic network is converted from a bimodal to a unimodal network, where you connect documents to one another based on the topics they share. That is, if Document 1 and Document 4 are both connected to Topics 4, 2, and 7, they get a connection to each other of weight 3 (if they were only connected to 2 of the same topics, they’d get a connection of weight 2, and so forth). In this situation, the resulting network will be as much an artifact of the choice of n as of the underlying document similarity network. If you choose different values of n, you’ll often get very different results.

bimodal to unimodal network. via.

In this case, the solution is to treat every document as a vector of topics with associated weights, making sure to use all the topics, such that you’d have a list that looks somewhat like the original topic CSV, except this time ordered by topic number rather than individually for each document by topic weight.

      T1, T2, T3,...
Doc4(0.2,0.3,0.1,...)
Doc5(0.6,0.2,0.1,...)
...

From here you can use your favorite correlation or distance finding algorithm (cosine similarity, for example) to find the distance from every document to every other document. Whatever you use, you’ll come up with a (generally) symmetric matrix from every document to every other document, looking a bit like this.

      Doc1|Doc2|Doc3,...
Doc1  1   |0.3 |0.1
Doc2  0.3 |1   |0.4 
Doc3  0.1 |0.4 |1
...

If you chop off the bottom left or top right triangle of the matrix, you now have a network of document similarity which takes the entire topic model into account, not just the first few topics. From here you can set whatever arbitrary m thresholds seem legitimate to visually represent the network in an uncluttered way, for example only showing documents that are more than 50% topically similar to one another, while still being sure that the entire richness of the underlying topic model is preserved, not just the first handful of topical associations.

Of course, whether this method is any more useful than something like LSA in clustering documents is debatable, but I just had to throw my 2¢ in the ring regarding topical networks. Hope it’s useful.

Making pretty things with R and ggplot2

This isn’t going to be a long tutorial. I’ve just had three people asking how I made the pretty graphs on my last post about counting citations, and I’m almost ashamed to admit how easy it was. Somebody with no experience coding can (I hope) follow these steps and make themselves a pretty picture with the data provided, and understand how it was created.

library(ggplot2)
sci=read.csv("scicites.csv")
qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

That’s the whole program. Oh, also this table, saved as a csv:

[table id=3 /]

 

And that was everything I used to produce this graph:

Density graph made with R.

Quick Walkthrough

Installation

The first thing you need to make this yourself is the programming language R (an awesome language for statistical analysis) installed on your machine, which you can get here. Download it and install it; it’s okay, Ill wait. Now, R by itself is not fun to code in, so my favorite program to use when writing R code is called RStudio, so go install that too. Now you’re going to have to install the visualization package, which is called ggplot2. You do this from within RStudio itself, so open up the newly installed program. If you’re running Windows Vista or 7, don’t open it up the usual way; right click on the icon and click ‘Run as administrator’ – you need to do this so it’ll actually let you install the package. Once you’ve opened up RStudio, at the bottom of the program there’s a section of your screen labeled ‘Console’, with a blinking text cursor. In the console, type install.packages(“ggplot2”) and hit enter. Congratulations, ggplot2 is now installed.

Now download this R file (‘Save as’) that I showed you before and open it in RStudio (‘File -> Open File’). It should look a lot like that code at the beginning of the post. Now go ahead and download the csv shown above as well, and be sure to put it in the same directory 1 you put the R code. Once you’ve done that, in RStudio click ‘Tools -> Set Working Directory -> To Source File Location’), which will help R figure out where the csv is that you just downloaded.

Before I go on explaining what each line of the code does, run it and see what happens! Near the top of your code on the right side, there should be a list of buttons, on the left one that says ‘Run’ and on the right one that says ‘Source’. Click the button that says ‘Source‘. Voila, a pretty picture!

Code

Now to go through the code itself, we’ll start with line 1. library(ggplot2) just means that that we’re going to be using ggplot2 to make the visualization, and lets R know to look for it when it’s about to put out the graphics.

Line 2 is fairly short as well, sci=read.csv(“scicites.csv”), and it creates a new variable called sci which contains the entire csv file you downloaded earlier. read.csv(“scicites.csv”) is a command that tells R to read the csv file in the parentheses, and setting the variable sci as equal to that read file just saves it.

Line 3 is where the magic happens.

qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

The entire line is surrounded by the parenthetical command qplot() which is just our way of telling R “hey, plot this bit in here!” The first thing inside the parentheses is Cited, which you might recall was one of the columns in the CSV file. This is telling qplot() what column of data it’s going to be plotting, in this case, the number of citations that papers have received. Then, we tell qplot() where that data is coming from with the command data = sci, which sets what table the data column is coming from. After that geom=”density” appears. geom is short for ‘Geometric Object’ and it sets what the graph will look like. In this case we’re making a density graph, so we give it “density”, but we could just as easily have used something like “histogram” or “line”.

The next bit is fill=YearRange, which you might recall was another column in the csv. This is a way of breaking the data we’re using into categories; in this case, the data are categorized by which year range they fall into. fill is one way of categorizing the data by filling in the density blobs with automatically assigned colors; another way would be to replace fill with color. Try it and see what happens. After the next comma is log=”x”, which puts the x-axis on a log scale, making the graph a bit easier to read. Take a look at what the graph looks like if you delete that part of it.

Now we have a big chunk of code devoted to labels. xlab=”Number of Citations”, ylab=”Density”, main=”Density of citations per 8 years”. As can probably be surmised, xlab corresponds to the label on the x-axis, ylab corresponds to the label on the y-axis, and main corresponds to the title of the graph. The very last part, before the closing parentheses, is alpha=I(.5)alpha sets the transparency of the basic graph elements, which is why the colored density blobs all look a little bit transparent. I set their transparency to .5 so they’d each still be visible behind the others. You can set the value between 0 and 1, with the former being completely transparent and the latter being completely opaque.

There you have it, easy-peasy. Play around with the csv, try adding your own data, and take a look at this chapter from “ggplot2: Elegant Graphics for Data Analysis” to see what other options are available to you.

Notes:

  1. thanks Andy for the correction!

Networks Demystified 3: The Power Law Rant

Dear humanists, scientists, music-makers, and dreamers of dreams,

Stop. Please, please, please stop telling me about the power law / Pareto distribution / Zipf’s law you found in your data.  I know it’s exciting, and I know power laws are sexy these days, but really the only excuse to show off your power law is if it’s followed by:

  1. How it helps you predict/explain something you couldn’t before.
  2. Why a power law is so very strange as opposed to some predicted distribution (e.g., normal, uniform, totally bodacious, etc.).
  3. A comparison against other power law distributions, and how parameters differ across them.
  4. Anything else that a large chunk of the scholarly community hasn’t already said about power laws.

Alright, I know not all of you have broken this rule, and many of you may not be familiar enough with what I’m talking about to care, so I’m going to give a quick primer here on power law and scale-free distributions (they’re the same thing). If you want to know more, read Newman’s (2005) seminal paper.

The take-home message of this rant will be that the universe counts in powers rather than linear progressions, and thus in most cases a power law is not so much surprising as it is overwhelmingly expected. Reporting power laws in your data is a bit like reporting furry ears on your puppy; often true, but not terribly useful. Going further to describe the color, shape, and floppiness of those ears might be just enough to recognize the breed.

The impetus for this rant is my having had to repeat it at nearly every conference I’ve attended over the last year, and it’s getting exhausting.

Even though the content here looks kind of mathy, anybody should be able to approach this with very minimal math experience. Even so, the content is kind of abstract, and it’s aimed mostly at those people who will eventually look for and find power laws in their data. It’s an early intervention sort of rant. 

I will warn that I conflate some terms below relating to probability distributions and power laws 1 in order to ease the reader into the concept. In later posts I’ll tease apart the differences between log-normal and power laws, but for the moment let’s just consider them similar beasts.

This is actually a fairly basic and necessary concept for network analysis, which is why I’m lumping this into Networks Demystified; you need to know about this before learning about a lot of recent network analysis research.

Introducing Power Laws

The Function

One of the things I’ve been thanked for with regards to this blog is keeping math and programming out of it; I’ll try to stick to that in this rant, but you’ll (I hope) forgive the occasional small formula to help us along the way. The first one is this:

f(x) = xn

The exponent, n, is what’s called a parameter in this function. It’ll be held constant, so let’s arbitrarily set n = -2 to get:

f(x) = x-2

When n = -2, if we make x the set of integers from 1 to 10, we wind up getting a table of values that looks like this:

x  |  f(x) = x^(-2)
--------------------
1  |  1
2  |  0.25
3  |  0.1111
4  |  0.0625
5  |  0.04
6  |  0.0277
7  |  0.0204
8  |  0.0156
9  |  0.0123
10 |  0.01

When plotted with x values along the horizontal axis and f(x) values along the vertical axis, the result looks like this:

Figure 1

Follow so far? This is actually all the math you need, so if you’re not familiar with it, take another second to make sure you have it down before going forward. That curve the points make in Figure 1, starting up and to the left and dipping down quickly before it shoots to the right, is the shape that people see right before they shout, joyously and from the rooftops, “We have a power law!” It can be an exciting moment, but hold your enthusiasm for now.

There’s actually a more convenient way of looking at these data, and that’s on a log-log plot. What I’ve made above is a linear plot, so-called because the numbers on the horizontal and vertical axes increase linearly. For example, on the horizontal axis, 1 is just as far away from 2 as 2 is from 3, and 3 is just as far away from 4 as 9 is from 10. We can transform the graph by stretching the the low numbers on the axes really wide and pressing the higher numbers closer together, such that 1 is further from 2 than 2 is from 3, 3 is further from 4 than 4 is from 5, and so forth.

If you’re not familiar with logs, that’s okay, the take-home message is that what we’re doing is not changing the data, we’re just changing the graph. We stretch and squish certain parts of the graph to make the data easier to read, and the result looks like this:

Figure 2

This is called a log-log plot because the horizontal and vertical axes grow logarithmically rather than linearly;  the lower numbers are further apart than the higher ones. Note that with this new graph shows the same data as before but, instead of having that steep curve, the data appear on a perfectly straight diagonal line. It turns out that data which changes exponentially appears straight on a log-log plot. This is really helpful, because it allows us to eyeball any dataset thrown on a log-log plot, and if the points look like they’re on a straight line, it’s pretty likely that the best-fit line for the data is some function involving an exponent. That is, a graph that looks like Figure 2 usually has one of those equations I showed above behind it; a “power law.” We generally care about the best-fit line because it either tells us something about the underlying dataset, or it allows us to predict new values that we haven’t observed yet. Why precisely you’d want to fit a line to data is beyond the scope of this post.

To sum up: a function – f(x) = x-2 – produces a set of points along a curved line that falls rapidly (exponentially quickly, actually). That set of points is, by definition, described by a power law; a power function produced it, so it must be. When plotted on a log-log scale, power-law fitting data looks like a straight diagonal line.

The Distribution

When people talk about power laws, they’re usually not actually talking about simple functions like the one I showed above. In the above function ( f(x) =  x-2 ), every x has a single corresponding value. For example, as we saw in the table, when x = 1, f(x) = 1. When x = 2, f(x) = 0.25. Rinse, repeat. Each x has one and only one f(x). When people invoke power laws, however, they usually say the distribution of data follows a power law.

Stats people will get angry with me for fudging some vocabulary in the interest of clarity; you should probably listen to them instead of me. Pushing forward, though, it’s worth dwelling over the subject of probability distributions. If you don’t think you’re familiar with probability distributions, you’re wrong. Let’s start with an example most of my readers will probably be familiar with:

Figure 3

The graph is fairly straightforward. Out of a hundred students, it looks like one student has a grade between 50 and 55, three students have a grade between 55 and 60, four students have a grade between 60 and 65, about eight students have a grade between 65 and 70, etc. The astute reader will note that a low B average is most common, with other grades decreasingly common on either side of the peak. This is indicative of a normal distribution, something I’ll touch on again momentarily.

Instead of saying there are a hundred students represented in this graph, let’s say instead that the numbers on the left represent a percentage. That is, 20% of students have a grade between 80 and 85. If we stack up the bars on top of each other, then, they would just reach 100%. And, if we were to pick a student at random from a hat (advice: always keep your students in a hat), the height of each bar corresponds to the probability that the student you picked at random would have the corresponding grade.

More concretely, when I reach into my student hat, 20% of the uncomfortably-packed students have a low B average, so there’s a 20% chance that the student I pick has a low B. There’s about a 5% chance that the student has a high A. That’s why this is a graph of probability distribution, because the height of each bar represents the probability that a particular value will be seen when data is chosen at random. Notice particularly how different the above graph is from the first one we discussed, Figure 1. For example, every x value (the grade group) corresponds to multiple other data points; that is, about 20% of the students correspond to a low B. That’s because the height of the data points, the value on the vertical axis, is a measure of the frequency of a particular value rather than just two bits of information about one data point.

The grade distribution above can be visualized much like the function from before (remember, that  f(x) = x-2 bit?), take a look:

Figure 4

This graph and Figure 3 show the same data, except this one is broken up into individual data points. The horizontal axis is just a list of 100 students, in order of their grades, pretending that each number on that axis is their student ID number. They’re in order of their rank, which is generally standard practice.  The vertical axis measures each student’s grade.

We can see near the bottom left that only four students scored below sixty, which is the same thing we saw in the degree distribution graph above. The entire point of this is to say that, when we talk about power laws, we’re talking graphs of distributions rather than graphs of data points. While figures 3 and 4 show the same data, figure 3 is the interesting one here, and we say that figure 3 is described by a normal distribution. A normal distribution looks like a bell curve, with a central value with very high frequency (say, students receiving a low B), and decreasing frequencies on either side (few high As and low Fs).

There are other datasets which present power law probability distributions. For example, we know that city population sizes tend to be distributed along a power law 2. That is, city populations are not normally distributed, with most cities having a around the same population and a few cities having some more or some less than the average. Instead, most cities have quite a small population, and very few cities have quite high populations. There are only a few New Yorks, but there are plenty of Bloomington, Indianas.

If we were to plot the probability distribution of cities – that is, essentially, the percent of some national population living in various size cities, we’d get a graph like Figure 5.

Figure 5

This graph should be seen as analogous to Figure 3, the probability distribution of student grades. We see that nearly 20% of cities have a population of about a thousand, another 10% of cities have populations around two thousand, about 1% of cities have a population of twenty thousand, and the tiniest fraction of cities have more than a hundred thousand residents.

If I were to reach into my giant hat filled with cities in the fictional nation graphed here (let’s call it Narnia, because it seems to be able to fit into small places), there’d be a 20% chance that the city I picked up had only a thousand residents, and a negligible chance that the city would be the hugely populated one on the far right. Figure 5 shows that cities, unlike grades, are not normally distributed but instead are distributed along some sort of power function. This can be seen more clearly when the distribution is graphed on a log-log plot like Figure 2, stretching the axes exponentially.

Figure 6

A straight line! A power law! Time to sing from the heavens, right? I mean hey, that’s really cool. We expect things to be normally distributed, with a lot of cities being medium-sized and fewer cities being smaller or larger on either side, and yet city sizes are distributed exponentially, where a few cities have huge populations and increasing numbers of cities hold increasingly smaller portions of the population.

Not so fast. Exponentially shrinking things appear pretty much everywhere, and it’s often far more surprising that something is normally or uniformly distributed. Power laws are normal, in that they appear everywhere. When dealing with differences of size and volume, like city populations  or human wealth, the universe tends to enjoy numbers increasing exponentially rather than linearly.

Orders of Magnitude and The Universe

I hope everyone in the whole world has seen the phenomenal 1977 video, Powers of Ten. It’s a great perspective on our place in the universe, zooming out from a picnic to the known universe in about five minutes before zooming back down to DNA and protons. Every ten seconds the video features a zooming level another of order of magnitude smaller or greater, going from ten to a hundred to a thousand meters. This is by no means the same as a power law (it’s the difference of 10x rather than x10), but the point in either case is that understanding the scale of the universe is a lot easier in terms of exponents than in terms of addition or multiplication.

Zipf’s Law

Zipf’s law is pretty cool, guys. It says that most languages are dominated by the use of a few words over and over again, with more rare words exponentially less frequently used. The top-most ranking word in English, “the,” is used twice as much second-most-frequent word, “of,” which itself is used nearly twice as much as “and.” In fact, in one particularly large corpus, only 135 words comprise half of all words used. That is, if we took a favorite book and removed all but the top hundred  words used, half the book would still exist. This law holds true for most languages. Power law!

The Long Tail

A few years ago, Wired editor Chris Anderson wrote an article (then a blog, then a book) called “The Long Tail,” which basically talked about the business model of internet giants like Amazon. Because Amazon serves such a wide audience, it can afford to carry books that almost nobody reads, those niche market books that appeal solely to underwater basket weavers or space taxidermists or Twilight readers (that’s a niche market, right? People aren’t really reading those books?).

Local booksellers could never stock those books, because the cost of getting them and storing them would overwhelm the number of people in the general vicinity who would actually buy them. However, according to Anderson, because Amazon’s storage space and reach is nigh-infinite, having these niche market books for sale actually pays off tremendously.

Take a look at Figure 7. It’s not actually visualizing a distribution like in Figure 5; it’s more like Figure 4 with the students and the grades. Pretend each tick on the horizontal axis is a different book, and their heights corresponds to how many times each book is bought. They’re ordered by rank, so the bestselling books are over at the left, and as you go further to the right of the graph, the books clearly don’t do as well.

Figure 7: http://en.wikipedia.org/wiki/File:Long_tail.svg

One feature worth noting is that, in the graph above, the area of green is equal to the area of yellow. That means a few best-sellers comprise 50% of the books bought on Amazon. A handful of books dominate the market, and they’re the green side of Figure 7.

However, that means the other 50% of the market is books that are rarely purchased. Because Amazon is able to store and sell books from that yellow section, it can double its sales. Anderson popularized the term “long tail” as that yellow part of the graph. To recap, we now have cities, languages, and markets following power laws.

Pareto Principle

The Pareto Principle, also called the 80/20 rule, is brought up in an absurd amount of contexts (often improperly). It says that 80% of land in Italy is held by 20% of the population; 20% of pea pods contain 80% of the peas; the richest 20% control 80% of the income (as of 1989…); 80% of complaints in business come from 20% of the clients; the list goes on. Looking at Figure 7, it’s easy to see how such a principle plays out.

Scale-Free Networks

Yay! We finally get to the networks part of this Networks Demystified post. Unfortunately it’ll be rather small, but parts 4 and 5 of Networks Demystified will be about Small World and Scale Free networks more extensively, and this post sort of has to come first.

If you’ve read Parts I and II of Networks Demystified, then you already know the basics. Here’s what you need in brief: Networks are stuff and relationships between them. I like to call the stuff nodes and the relationships edges. A node’s degree is how many times it’s connected to an edge. Read the beginning of Demystified Part II for more detail, but that should be all you need to know for the next example. It turns out that the node degree distribution of many networks follows a power law (surprise!), and scholarly citation networks are no exception.

Citation Networks

If you look at scholarly articles, you can pretend each paper is a node and each citation is an edge going from the citing to the cited article. In a turn of events that will shock no-one, some papers are cited very heavily while most are, if they’re lucky, cited once or twice. A few superstar papers attract the majority of citations. In network terms, a very few nodes have a huge degree – are attached to many edges – and the amount of edges per node decreases exponentially as you get to less popular papers. Think of it like Figure 5, where cities with exponentially higher population (horizontal axis) are exponentially more rare (vertical axis). It turns out this law holds true for almost all citation networks.

Preferential Attachment

The concept of preferential attachment is very old and was independently discovered by many, although the term itself is recent. I’ll get into the history of it and relevant citations in my post on scale-free networks, what matters here is the effect itself. In a network, the idea goes, nodes that are already linked to many others will be more likely to collect even more links. The rich get richer. In terms of citation networks, the mechanism by which nodes preferentially attach to other nodes is fairly simple to discern.

A few papers, shortly after publication, happen to get a lot of people very excited; the rest of the papers published at the same time are largely ignored. Those few initial papers are cited within months of their publication and then researchers come across the new papers in which the original exciting papers were cited. That is, papers that are heavily cited have a greater chance of being noticed, which in turn increases their chances of being even more heavily cited as time goes on. The rich get richer.

This is also an example of a positive feedback loop, a point which I’ll touch on in some greater detail in the next section. This preferential attachment appears in all sorts of evolving systems, from social networks to the world wide web. And, in each of these cases, power laws appear present in the networks.

Feedback Loops

Wikipedia defines positive feedback as “a process in which the effects of a small disturbance on a system include an increase in the magnitude of the perturbation.” We’ve all heard what happens when somebody takes a microphone too close to the speaker it’s projecting on. The speaker picks up some random signal from the mic and projects it back into the mic, which is then re-amplified through the speaker, picked up again by the mic, and so on until everyone in the room has gone deaf or the speaker explodes.

Wikipedia seemingly-innocuously adds “[w]hen there is more positive feedback than there are stabilizing tendencies, there will usually be exponential growth of any oscillations or divergences from equilibrium.” In short, feedback tends to lead to exponential changes. And systems that are not sufficiently stabilized (hint: most of them aren’t) will almost inevitably fall into positive feedback loops, which will manifest as (you guessed it) power laws.

Once more with the chorus: power laws.

Benford’s Law

By this point I’m sure you all believe me about the ubiquity (and thus impotence) of power laws. Given that, what I’m about to say shouldn’t surprise you but, if you haven’t heard of it before, I promise you it will still come as a shock. It’s called Benford’s Law, and what it says is equal parts phenomenal and absurd.

Consult your local post office records (it’s okay, I’ll wait) and find a list of every street address in your county. Got it? Good. Now, get rid of everything but the street numbers; that is, if one address reads “1600 Pennsylvania Avenue, Northwest Washington, DC 20500,” get rid of everything but the “1600.” We’re actually going to go even further, get rid of everything but the first digit of the street address.

Now, you’re left with a long list of single-digits that used to be the first digits of street addresses. If you were to count all of those digits, you would find that digit “1” is used significantly more frequently than digit “2.” Further, digit “2” appears significantly more frequently than digit “3.” This trend continues down to digit “9.”

That’s odd, but not terribly shocking. It gets more shocking when you find out that the same process (looking at the first digits of numbers and seeing that 1 is used more than 2, 2 more than 3, 3 more than 4, etc.) holds true for the area of rivers, physical constants, socio-economic data, the heights of skyscrapers, death rates, numbers picked at random from issues of Readers’ Digest, etc., etc., etc. In an absurdly wide range of cases, the probability distribution of leading digits in lists of numbers shows lower digits appear exponentially more frequently than higher digits. In fact, when looking at the heights of the world’s tallest buildings, it holds true no matter the scale; that is, if heights are measured in yards, miles, kilometers, or furlongs, Benford’s law still holds.

So what’s going on here? It turns out that if you take the logarithm of all the numbers in these sets, the numbers turn out to be uniformly distributed 3. I won’t unpack logs here (I’ve already gone on too long, haven’t I?), but this basically means that if you take into account powers and exponents, all of these numbers are actually uniformly distributed. The universe tends to filter random numbers into exponential equations before we have a chance to count them, but once we do count them, all we see are power laws.

Conclusion

About four thousand words ago, I wrote that reporting power laws in your data is about as useful as reporting furry ears on your dog. That neither means that power laws are useless nor that they should never be reported. It’s just that they’re not particularly exciting by themselves. If you’ve found a power law in your data, that’s great! Now comes the time when you have to actually describe the line that you’ve found. Finding the parameters for the power law and comparing them against other similar datasets can be exceptionally enlightening, as would using the trends found for prediction or comparison.

Too often (at least once in every conference I’ve attended in the last year), I see people mentioning that their data followed a power law and leaving it at that. Oy vey. Alright, now that that’s out of the way, I promise the next Networks Demystified post will actually be about networks again.

Notes:

  1. I opt not to use words like “mass,” “log-normal,” or “exponential,” but don’t worry, I have a rant about people confusing those as well. It’ll come out soon enough.
  2. Actually it’s log-normal, but it’s close enough for our purposes here.
  3. The log-uniform distribution is actually a special case of a power law with a parameter of -1. Nifty, huh?

The Myth of Text Analytics and Unobtrusive Measurement

Text analytics are often used in the social sciences as a way of unobtrusively observing people and their interactions. Humanists tend to approach the supporting algorithms with skepticism, and with good reason. This post is about the difficulties of using words or counts as a proxy for some secondary or deeper meaning. Although I offer no solutions here, readers of the blog will know I am hopeful of the promise of these sorts of measurements if used appropriately, and right now, we’re still too close to the cutting edge to know exactly what that means. There are, however, copious examples of text analytics used well in the humanities (most recently, for example, Joanna Guldi’s  publication on the history of walking).

The Confusion

Klout is a web service which ranks your social influence based on your internet activity. I don’t know how Klout’s algorithm works (and I doubt they’d be terribly forthcoming if I asked), but one of the products of that algorithm is a list of topics about which you are influential. For instance, Klout believes me to be quite influential with regards to Money (really? I don’t even have any of that.) and Journalism (uhmm.. no.), somewhat influential in Juggling (spot on.), Pizza (I guess I am from New York…), Scholarship (Sure!), and iPads (I’ve never touched an iPad.), and vaguely influential on the topic of Cars (nope) and Mining (do they mean text mining?).

By Ildar Sagdejev (Specious) (Own work) [GFDL (www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
My pizza expertise is clear.
Thankfully careers don’t ride on this measurement (we have other metrics for that), but the danger is still fairly clear: the confusion of vocabulary and syntax for semantics and pragmatics. There are clear layers between the written word and its intended meaning, and those layers often depend on context and prior knowledge. Further, regardless of the intended meaning of the author, how her words are interpreted in the larger world can vary wildly. She may talk about money and pizza until she is blue in the face, but if the whole world disagrees with her, that is no measurement of expertise nor influence (even if angry pizza-lovers frequently shout at her about her pizza opinions).

We see very simple examples of this in sentiment  analysis, a way to extract the attitude of the writer toward whatever it was he’s written. An old friend who recently dipped his fingers in sentiment analysis wrote this:

According to his algorithm, that sentence was a positive one. Unless I seriously misunderstand my social cues (which I suppose wouldn’t be too unlikely), I very much doubt the intended positivity of the author. However, most decent algorithms would pick up that this was a tweet from somebody who was positive about Sarah Jessica Parker.

Unobtrusive Measurements

This particular approach to understanding humans belongs to the larger methodological class of unobtrusive measurements. Generally speaking, this topic is discussed in the context of the social sciences and is contrasted with more ‘obtrusive’ measurements along the lines of interviews or sticking people in labs. Historians generally don’t need to talk about unobtrusive measurements because, hey, the only way we could be obtrusive to our subjects would require exhuming bodies. It’s the idea that you can cleverly infer things about people from a distance, without them knowing that they are being studied.

Notice the disconnect between what I just said, and the word itself. ‘Unobtrusive’ against “without them knowing that they are being studied.” These are clearly not the same thing, and that distinction between definition and word is fairly important – and not merely in the context of this discussion. One classic example (Doob and Gross, 1968) asks how somebody’s social status determines whether someone might take aggressive action against them. They specifically measures a driver’s likelihood to honk his horn in frustration based on the perceived social status of the driver in front of them. Using a new luxury car and an old rusty station wagon, the researchers would stop at traffic lights that had turned green and would wait to see whether the car behind them honked. In the end, significantly more people honked at the low status car. More succinctly: status affects decisions of aggression.  Honking and the perceived worth of the car were used as proxies for aggression and perceptions of status, much like vocabulary is used as a proxy for meaning.

In no world would this be considered unobtrusive from the subject’s point of view. The experimenters intruded on their world, and their actions and lives changed because of it. All it says is that the subjects won’t change their behavior based on the knowledge that they are being studied. However, when an unobtrusive experiment becomes large enough, even one as innocuous as counting words, even that advantage no longer holds. Take, for example, citation analysis and the h-index. Citation analysis was initially construed as an unobtrusive measurement; we can say things about scholars and scholarly communication by looking at their citation patterns rather than interviewing them directly. However, now that entire nations (like Australia or the UK) use quantitative analysis to distribute funding to scholarship, the measurements are no longer unobtrusive. Scholars know how the new scholarly economy works, and have no problem changing their practices to get tenure, funding, etc.

The Measurement and The Effect: Untested Proxies

A paper was recently published (O’Boyle Jr. and Aguinis, 2012) on the non-normality of individual performance. The idea is that we assume that people’s performance (for example students in a classroom) are normally distributed along a bell curve. A few kids get really good grades, a few kids get really bad grades, but most are ‘C’ students. The authors challenge this view, suggesting performance takes on more of a power-law distribution, where very few people perform very well, and the majority perform very poorly, with 80% of people performing worse than the statistical average. If that’s hard to imagine, it’s because people are trained to think of averages on a bell curve, where 50% are greater than average and 50% are worse than average. Instead, imagine one person gets a score of 100, and another five people get scores of 10. The average is (100 + (10 * 5)) / 6 = 25, which means five out of the six people performed worse than average.

It’s an interesting hypothesis, and (in my opinion) probably a correct one, but their paper does not do a great job showing that. The reason is (you guessed it) they use scores as a proxy for performance.  For example, they look at the number of published papers individuals have in top-tier journals, and show that some authors are very productive whereas most are not. However, it’s a fairly widely-known phenomena that in science, famous names are more likely to be published than obscure ones (there are many anecdotes about anonymous papers being rejected until the original, famous author is revealed, at which point the paper is magically accepted). The number of accepted papers may be as much a proxy for fame as it is for performance, so the results do not support their hypothesis. The authors then look at awards given to actors and writers, however those awards suffer the same issues: the more well-known an actor, the the more likely they’ll be used in good movies, the more likely they’ll be visible to award-givers, etc. Again, awards are not a proxy for the quality of a performance. The paper then goes on to measure elected officials based on votes in elections. I don’t think I need to go on about how votes might not map one-to-one on the performance and prowess of an elected official.

I blogged a review of the most recent culturomics paper, which used google ngrams to look at the frequency of recurring natural disasters (earthquakes, floods, etc.) vs. the frequency of recurring social events (war, unemployment, etc.). The paper concludes that, because of differences in the frequency of word-use for words like ‘war’ or ‘earthquake’, the phenomena themselves are subject to different laws. The authors use word frequency as a proxy for the frequency of the events themselves, much in the same way that Klout seems to measure influence based on word-usage and counting. The problem, of course, is that the processes which govern what people decide to write down do not enjoy a one-to-one relationship to what people experience. Using words as proxies for events is just as problematic as using them for proxies of expertise, influence, or performance. The underlying processes are simply far more complicated than these algorithms give them credit for.

It should be noted, however, that the counts are not meaningless; they just don’t necessarily work as proxies for what these ngram scholars are trying to measure. Further, although the underlying processes are quite complex, the effect size of social or political pressure on word-use may be negligible to the point that their hypothesis is actually correct. The point isn’t that one cannot use one measurement as a proxy for something else; rather, the effectiveness of that proxy is assumed rather than actually explored or tested in any way. We need to do a better job, especially as humanists, of figuring out exactly how certain measurements map onto effects we seek.

A beautiful case study that exemplifies this point was written by famous statistician Andrew Gelman, and it aims to use unobtrusive and indirect measurements to find alien attacks and zombie outbreaks. He uses Google Trends to show that the number of zombies in the world are growing at a frightening rate.

Zombies will soon take over!

 

Quick Followup to Avoiding Traps

So apparently yesterday was a big day for hypothesis testing and discovery. Stanley Fish’s third post on Digital Humanities also brought up the issue of fishing for correlations, although his post was… slightly more polemic. Rather than going over it on this blog, I’ll let Ted Underwood describe it. Anybody who read my post on Avoiding Traps should also read Underwood’s post; it highlights the role of discovery in the humanities as a continuous process of appraisal and re-appraisal, both on the quantitative and qualitative side.

…the significance of any single test is reduced when it’s run as part of a large battery.

That’s a valid observation, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.”And it’s why Matt Wilkens (targeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them more stringently. (For instance, after noticing that coastal states initially seem more prominent in American fiction than the Midwest, he tests whether this remains true after you compensate for differences in population size, and then proposes a hypothesis that he suggests will need to be confirmed by additional “test cases.”)

It’s important to keep in mind that Reichenbach’s old distinction between discovery and justification is not so clear-cut as it was originally conceived. How we generate our hypotheses, and how we support them to ourselves and the world at large, is part of the ongoing process of research. In my last post, I suggested people keep clear ideas of what they plan on testing before they begin testing; let me qualify that slightly. One of the amazing benefits of Big Data has been the ability to spot trends we were not looking for; an unexpected trend in the data can lead us to a new hypothesis, one which might be fruitful and interesting. The task, then, is to be clever enough to devise further tests to confirm the hypothesis in a way that isn’t circular, relying on the initial evidence that led you toward it.

… I like books with pictures. When I started this blog, I promised myself I’d have a picture in every post. I can’t think of one that’s relevant, so here’s an angry cupcake:

http://melivillosa.deviantart.com/

Avoiding traps

We have the advantage of arriving late to the game.

In the cut-throat world of high-tech venture capitalism, the first company with a good idea often finds itself at the mercy of latecomers. The latecomer’s product might be better-thought-out, advertised to a more appropriate market, or simply prettier, but in each case that improvement comes through hindsight. Trailblazers might get there first, but their going is slowest, and their way the most dangerous.

Digital humanities finds itself teetering on the methodological edge of many existing disciplines, boldly going where quite a few have gone before. When I’ve blogged before about the dangers of methodology appropriation, it was in the spirit of guarding against our misunderstanding of foundational aspects of various methodologies. This post is instead about avoiding the monsters already encountered (and occasionally vanquished) by other disciplines.

If a map already exists with all the dragons' hideouts, we should probably use it. (Image from the Carta Marina)

Everything Old Is New Again

A collective guffaw probably accompanied my defining digital humanities as a “new” discipline. Digital humanities itself has a rich history dating back to big iron computers in the 1950s, and the humanities in general, well… they’re old. Probably older than my grandparents.

The important point, however, is that we find ourselves in a state of re-definition. While this is not the first time, and it certainly will not be the last, this state is exceptionally useful in planning against future problems. Our blogosphere cup overfloweth with definitions of and guides to the digital humanities, many of our journals are still in their infancy, and our curricula are over-ready for massive reconstruction. Generally (from what I’ve seen), everyone involved in these processes are really excited and open to new ideas, which should ease the process of avoiding monsters.

Most of the below examples, and possible solutions, are drawn from the same issues of bias I’ve previously discussed. Also, the majority are meta-difficulties. While some of the listed dangers are avoidable when writing papers and doing research, most are discipline-level systematic. That is, despite any researcher’s best efforts, the aggregate knowledge we gain while reading the newest exciting articles might fundamentally mislead us. While these dangers have never been wholly absent from the humanities, our recent love of big data profoundly increases their effect sizes.

An architect from Florida might not be great at designing earthquake-proof housing, and while earthquakes are still a distant danger, this shouldn’t really affect how he does his job at home. If the same architect moves to California, odds are he’ll need to learn some extra precautions. The same is true for a digital humanist attempting to make inferences from lots of data, or from a bunch of studies which all utilize lots of data. Traditionally, when looking at the concrete and particular, evidence for something is necessary and (with enough evidence) sufficient to believe in that thing. In aggregate, evidence for is necessary but not sufficient to identify a trend, because that trend may be dwarfed by or correlated to some other data that are not available.

Don't let Florida architects design your California home. (Image by Claudio Núñez, through Wikimedia Commons)

The below lessons are not all applicable to DH as it exists today, and of course we need to adapt them to our own research (their meaning changes in light of our different material of study), however they’re still worth pointing out and, perhaps, may be guarded against. Many traditional sciences still struggle with these issues due to institutional inertia. Their journals have acted in such a way for so long, so why change it now? Their tenure has acted in such a way for so long, so why change it now? We’re already restructuring, and we have a great many rules that are still in flux, so we can change it now.

Anyway, I’ve been dancing around the examples for way too long, so here’s the meat:

Sampling and Selection Bias

The problem here is actually two-fold, both for the author of a study, and for the reader of several studies. We’ll start with the author-centric issues.

Sampling and Selection Bias in Experimental Design

People talk about sampling and selection biases in different ways, but for the purpose of this post we’ll use wikipedia’s definition:

Selection bias is a statistical bias in which there is an error in choosing the individuals or groups to take part in a scientific study.

A distinction, albeit not universally accepted, of sampling bias [from selection bias] is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

In this case, we’ll say a study exhibits a sampling error if the conclusions drawn from the data at hand, while internally valid, does not actually hold true for the world around it. Let’s say I’m analyzing the prevalence of certain grievances in the cahiers de doléances from the French Revolution. One study showed that, of all the lists written, those from urban areas were significantly more likely to survive to today. Any content analysis I perform on those lists will bias the grievances of those people from urban areas, because my sample is not representative. Conclusions I draw about grievances in general will be inaccurate, unless I explicitly take into account which sort of documents I’m missing.

Selection bias can be insidious, and many varieties can be harder to spot than sampling bias. I’ll discuss two related phenomena of selection bias which lead to false positives, those pesky statistical effects which leave us believing we’ve found something exciting when all we really have is hot air.

Data Dredging

The first issue, probably the most relevant to big-data digital humanists, is data dredging. When you have a lot of data (and increasingly more of us have just that), it’s very tempting to just try to find correlations between absolutely everything. In fact, as exploratory humanists, that’s what we often do: get a lot of stuff, try to understand it by looking at it from every angle, and then write anything interesting we notice. This is a problem. The more data you have, the more statistically likely it is that it will contain false-positive correlations.

Google has lots of data, let’s use them as an example! We can look at search frequencies over time to try to learn something about the world. For example, people search for “Christmas” around and leading up to December, but that search term declines sharply once January hits. Comparing that search with searches for “Santa”, we see the two results are pretty well correlated, with both spiking around the same time. From that, we might infer that the two are somehow related, and would do some further studies.

Unfortunately, Google has a lot of data, and a lot of searches, and if we just looked for every search term that correlated well with any other over time, well, we’d come up with a lot of nonsense. Apparently searches for “losing weight” and “2 bedroom” are 93.6% correlated over time. Perhaps there is a good reason, perhaps there is not, but this is a good cautionary tale that the more data you have, the more seemingly nonsensical correlations will appear. It is then very easy to cherry pick only the ones that seem interesting to you, or which support your hypothesis, and to publish those.

Comparing searches for "losing weight" (blue) against "2 bedroom" (red) over time, using Google Trends.

Cherry Picking

The other type of selection bias leading to false positives I’d like to discuss is cherry picking. This is selective use of evidence, cutting data away until the desired hypothesis appears to be the correct one. The humanities, not really known for their hypothesis testing, are not quite as likely to be bothered by this issue, but it’s still something to watch out for. This is also related to confirmation bias, the tendency for people to only notice evidence for that which they already believe.

Much like data dredging, cherry picking is often done without the knowledge or intent of the research. It arises out of what Simmons, Nelson, and Simonsohn (2011) call researcher degrees of freedom. Researchers often make decisions on the fly:

Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding [that is significant] is [itself necessarily significant]. This exploratory behavior is not the by-product of malicious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result.

When faced with decisions of how to proceed with analysis, we will almost invariably (and inadvertently) favor the decision that results in our hypothesis seeming more plausible.

If I go into my favorite dataset (The Republic of Letters!) trying to show that Scholar A was very similar to Scholar B in many ways, odds are I could do that no matter who the scholars were, so long as I had enough data. If you take a cookie-cutter to your data, don’t be surprised when cookie-shaped bits come out the other side.

Sampling and Selection Bias in Meta-Analysis

There are copious examples of problems with meta-analysis. Meta-analysis is, essentially, a quantitative review of studies on a particular subject. For example, a medical meta-analysis could review data from hundreds of small studies testing the side-effects of a particular medicine, bringing them all together and drawing new or more certain conclusions via the combination of data. Sometimes these are done to gain a larger sample size, or to show how effects change across different samples, or to provide evidence that one non-conforming study was indeed a statistical anomaly.

A meta-analysis is the quantitative alternative to something every one of us in academia does frequently: read a lot of papers or books, find connections, draw inferences, explore new avenues, and publish novel conclusions. Because quantitative meta-analysis is so similar to what we do, we can use the problems it faces to learn more about the problems we face, but which are more difficult to see. A criticism oft-lobbed at meta-analyses is that of garbage in – garbage out; the data used for the meta-analysis is not representative (or otherwise flawed), so the conclusions as well are flawed.

There are a number of reasons why the data in might be garbage, some of which I’ll cover below. It’s worth pointing out that the issues above (cherry-picking and data dredging) also play a role, because if the majority of studies are biased toward larger effect sizes, then the overall perceived effect across papers will appear systematically larger. This is not only true of quantitative meta-analysis; when every day we read about trends and connections that may not be there, no matter how discerning we are, some of those connections will stick and our impressions of the world will be affected. Correlation might not imply anything.

Before we get into publication bias,  I will write a short aside that I was really hoping to avoid, but really needs to be discussed. I’ll dedicate a post to it eventually, when I feel like punishing myself, but for now, here’s my summary of

The Problems with P

Most of you have heard of p-values. A lucky few of you have never heard of them, and so do not need to be untrained and retrained. A majority of you probably hold a view similar to a high-ranking, well-published, and well-learned professor I met recently. “All I know about statistics,” he said, “is that p-value formula you need to show whether or not your hypothesis is correct. It needs to be under .05.” Many of you (more and more these days) are aware of the problems with that statement, and I thank you from the bottom of my heart.

Let’s talk about statistics.

The problems with p-values are innumerable (let me count the ways), and I will not get into most of them here. Essentially, though, the calculation of a p-value is the likelihood that the results of your study did not appear by random chance alone. In many studies which rely on statistics, the process works like this: begin with a hypothesis, run an experiment, analyze the data, calculate the p-value. The researcher then publishes something along the lines of “my hypothesis is correct because p is under 0.05.”

Most people working with p-values know that it has something to do with the null hypothesis (that is, the default position; the position that there is no correlation between the measured phenomena). They work under the assumption that the p-value is the likelihood that the null hypothesis is true. That is, if the p-value is 0.75, it’s 75% likely that the null hypothesis is true, and there is no correlation between the variables being studied. Generally, the cut-off to get published is 0.05; you can only publish your results if it’s less than 5% likely that the null hypothesis is true, or more than 95% likely that your hypothesis is true. That means you’re pretty darn certain of your result.

Unfortunately, most of that isn’t actually how p-values work. Wikipedia writes:

The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

In a nutshell, assuming there is no correlation between two variables, what’s the likelihood that they’ll appear as correlated as you observed in your experiment by chance alone? If your p-value is .05, that means it’s 5% likely that random chance caused your variables to be correlated. That is, one in every twenty studies (5%) that get a p-value of 0.05 will have found a correlation that doesn’t really exist.

Wikipedia's image explaining p-values.

To recap: p-values say nothing about your hypothesis. They say, assuming there is no real correlation, what’s the likelihood that your data show one anyway? Also, in the scholarly community, a result is considered “significant” if p is less than or equal to 0.05. Alright, I’m glad that’s out of the way, now we’re all on the same footing.

Publication Biases

The positive results bias, the first of many interrelated publication biases, simply states that positive results are more likely to get published then negative or inconclusive ones. Authors and editors will be more likely to submit and accept work if the results are significant (p < .05). The file drawer problem is the opposite effect: negative results are more likely to be stuck in somebody’s file drawer, never to see the light of day. HARKing (Hypothesizing After the Results Are Known), much like cherry-picking above, is when, if during the course of a study many trials and analyses occur, only the “significant” ones are ever published.

Let’s begin with HARKing. Recall that a p-value is (basically) the likelihood that an effect occurred by chance alone. If one research project consisted of 100 different trials and analyses, if only 5 of them yielded significant results pointing toward the author’s hypothesis, those 5 analyses likely occurred by chance. They could still be published (often without the researcher even realizing they were cherry-picking, because obviously non-fruitful analyses might be stopped before they’re even finished). Thus, again, more positive results are published than perhaps there ought to be.

Let’s assume some people are perfect in every way, shape, and form. Every single one of their studies is performed with perfect statistical rigor, and all of their results are sound. Again, however, they only publish their positive results – the negative ones are kept in the file drawer. Again, more positive results are being published than being researched.

Who cares? So what that we’re only seeing the good stuff?

The problem is that, using common significance testing of p < 0.05, 5% of published, positive results ought to have occurred by chance alone. However, since we cannot see the studies that haven’t been published because their results were negative, those 5% studies that yielded correlations where they should not have are given all the scholarly weight. One hundred small studies are done on the efficacy of some medicine for some disease; only five by chance find some correlation – they are published. Let’s be liberal, and say another three are published saying there was no correlation between treatment and cure. Thus, an outside observer will see that the evidence is stacked in the favor of the (ineffectual) medication.

xkcd take on significance values. (comic 882)

The Decline Effect

A recent much-discussed article by John Lehrer, as well as countless studies by John Ioannidis and others, show two things: (1) a large portion of published findings are false (some of the reasons are shown above). (2) The effects of scientific findings seem to decline. A study is published, showing a very noticeable effect of some medicine curing a disease, and further tests tend show that very noticeable effect declining sharply. (2) is mostly caused by (1). Much ink (or blood) could be spilled discussing this topic, but this is not the place for it.

Biases! Everywhere!

So there are a lot of biases in rigorous quantitative studies. Why should humanists care? We’re aware that people are not perfect, that research is contingent, that we each bring our own subjective experiences to the table, and they shape our publications and our outlooks, and none of those are necessarily bad things.

The issues arise when we start using statistics, or algorithms derived using statistics, and other methods used by our quantitative brethren. Make no mistake, our qualitative assessments are often subject to the same biases, but it’s easy to write reflexively on one’s own position when they are only one person, one data-point. In the age of Big Data, with multiplying uncertainties for any bit of data we collect, it is far easier to lose track of small unknowns in the larger picture. We have the opportunity of learning from past mistakes so we can be free to make mistakes of our own.

Solutions?

Ioannidis’ most famous article is, undoubtedly, the polemic “Why Most Published Research Findings Are False.” With a statement like that, what hope is there? Ioannidis himself has some good suggestions, and there are many floating around out there; as with anything, the first step is becoming cognizant of the problems, and the next step is fixing them. Digital humanities may be able to avoid inheriting these problems entirely, if we’re careful.

We’re already a big step ahead of the game, actually, because of the nearly nonsensical volumes of tweets and blog posts on nascent research.  In response to publication bias and the file drawer problem, many people suggest a authors submit their experiment to a registry before they begin their research. That way, it’s completely visible what experiments on a subject have been run that did not yield positive results, regardless of whether they eventually became published. Digital humanists are constantly throwing out ideas and preliminary results, which should help guard against misunderstandings through publication bias. We have to talk about all the effort we put into something, especially when nothing interesting comes out of it. The fact that some scholar felt there should be something interesting, and there wasn’t, is itself interesting.

At this point, “replication studies” means very little in the humanities, however if we begin heading down the road where replication studies become more feasible, our journals will need to be willing to accept them just as they accept novel research. Funding agencies should also be just as willing to fund old, non-risky continuation research as they are the new exciting stuff.

Other institutional changes needed for us to guard against this sort of thing is open access publications (so everyone draws inferences from the same base set of research), tenure boards that accept negative research and exploratory research (again, not as large of an issue for the humanities), and restructured curricula that teach quantitative methods and their pitfalls, especially statistics.

On the ground level, a good knowledge of statistics (especially Bayesian statistics, doing away with p-values entirely) will be essential as more data becomes available to us. When running analysis on data, to guard against coming up with results that appear by random chance, we have to design an experiment before running it, stick to the plan, and publish all results, not just ones that fit our hypotheses. The false-positive psychology paper I mentioned above actually has a lot of good suggestions to guard against this effect:

  1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
  2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.
  3. Authors must list all variables collected in a study
  4. Authors must report all experimental conditions, including failed manipulations.
  5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.
  6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
  1. Reviewers should ensure that authors follow the requirements.
  2. Reviewers should be more tolerant of imperfections in results.
  3. Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
  4. If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.

Going Forward

This list of problems and solutions is neither exhaustive nor representative. That is, there are a lot of biases out there unlisted, and not all the ones listed are the most prevalent. Gender and power biases come to mind, however they are well beyond anything I could intelligently argue, and there are issues of peer-review and retraction rates that are an entirely different can of worms.

Also, the humanities are simply different. We don’t exactly test hypothesis, we’re not looking for ground truths, and our publication criteria are very different from that of the natural and social sciences. It seems clear that the issues listed above will have some mapping on our own research going forward, but I make no claims at understanding exactly how or where. My hope in this blog post is to raise awareness of some of the more pressing concerns in quantitative studies that might have bearing on our own studies, so we can try to understand how they will be relevant to our own research, and how we might guard against it.