personal research

Acceptances to Digital Humanities 2013 (part 1)

The 2013 Digital Humanities conference in Nebraska just released its program with a list of papers and participants. As some readers may recall, when the initial round of reviews went out for the conference, I tried my hand at analyzing submissions to DH2013. Now that the schedule has been released, the data available puts us in a unique position to compare proposed against accepted submissions, thus potentially revealing how what research is being done compares with what research the DH community (through reviews) finds good or interesting. In my last post, I showed that literary studies and data/text mining submissions were at the top of the list; only half as many studies were historical rather than literary. Archive work and visualizations were also near the top of the list, above multimedia, web, and content analyses, though each of those were high as well.

A keyword analysis showed that while Visualization wasn’t necessarily at the top of the list, it was the most central concept connecting the rest of the conference together. Nobody knows (and few care) what DH really means; however, these analyses present the factors that bind together those who call themselves digital humanists and submit to its main conference. The post below explores to what extent submissions and acceptances align. I preserve anonymity wherever possible, as submitting authors did not do so with the expectation that turned down submission data would be public.

It’s worth starting out with a few basic acceptance summary statistics. As I don’t have access to poster data yet, nor do I have access to withdrawals, I can’t calculate the full acceptance rate, but there are a few numbers worth mentioning. Just take all of the percentages as a lower bounds, where withdrawals or posters might make the acceptance rate higher. Of the 144 long papers submitted, 66.6% of them (96) were accepted, although only 57.6% (83) were accepted as long papers; another 13 were accepted as short papers instead. Half of the submitted panels were accepted, although curiously, one of the panels was accepted instead as a long paper. For short papers, only 55.9% of those submitted were accepted. There were 66 poster submissions, but I do not know how many of those were accepted, or how many other submissions were accepted as posters instead. In all, excluding posters, 60.9% of submitted proposals were accepted. More long papers than short papers were submitted, but roughly equal numbers of both were accepted. People who were turned down should feel comforted by the fact that they faced some stiff competition.

As with most quantitative analyses, the interesting bits come more when comparing internal data than when looking at everything in aggregate. The first three graphs do just that, and are in fact the same data, but ordered differently. When authors submitted their papers to the conference, they could pick any number of keywords from a controlled vocabulary. Looking at how many times each keyword was submitted with a paper (Figure 1) can give us a basic sense of what people are doing in the digital humanities. From Figure 1 we see (again, as a version of this viz appeared in the last post) that “Literary Studies” and “Text Mining” are the most popular keywords among those who submitted to DH2013; the rest you can see for yourself. The total height of the bar (red + yellow) represents the number of total submissions to the conference.

Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions.
Figure 1: Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions. (click to enlarge)

Figure 2 shows the same data as Figure 1, but sorted by acceptance rates rather than the total number of submissions. As before, because we don’t know about poster acceptance rates or withdrawals, you should take these data with a grain of salt, but assuming a fairly uniform withdrawal/poster rate, we can still make some basic observations. It’s also worth pointing out that the fewer overall submissions to the conference with a certain keyword, the less statistically meaningful the acceptance rate; with only one submission, whether or not it’s accepted could as much be due to chance as due to some trend in the minds of DH reviewers.

With those caveats in mind, Figure 2 can be explored. One thing that immediately pops out is that “Literary Studies” and “Text Mining” both have higher than average acceptance rates, suggesting that not only are a lot of DHers doing that kind of research; that kind of research is still interesting enough that a large portion of it is getting accepted, as well. Contrast this with the topic of “Visualization,” whose acceptance rate is closer to 40%, significantly fewer than the average acceptance rate of 60%. Perhaps this means that most reviewers thought visualizations worked better as posters, the data for which we do not have, or perhaps it means that the relatively low barrier to entry on visualizations and their ensuing proliferation make them more fun to do than interesting to read or review.

“Digitisation – Theory and Practice” has a nearly 60% acceptance rate, yet “Digitisation; Resource Creation; and Discovery” has around 40%, suggesting that perhaps reviewers are more interested in discussions about digitisation than the actual projects themselves, even though far more “Digitisation; Resource Creation; and Discovery” papers were submitted than “”Digitisation – Theory and Practice.” The imbalance between what was submitted and what was accepted on that front is particularly telling, and worth a more in-depth exploration by those who are closer to the subject. Also tucked at the bottom of the acceptance rate list are three related keywords “Digital Humanities – Institutional Support, “Digital Humanities – Facilities,” & “Glam: Galleries; Libraries; Archives; Museums,” each with a 25% acceptance rate. It’s clear the reviewers were not nearly as interested in digital humanities infrastructure as they were in digital humanities research. As I’ve noted a few times before, “Historical Studies” is also not well-represented, with both a lower acceptance rate than average and a lower submission rate than average. Modern digital humanities, at least as it is represented by this conference, appears far more literary than historical.

Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers.
Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers. (click to enlarge)

Figure 3, once again, has the same data as Figures 2 and 1, but is this time sorted simply by accepted papers and panels. This is the front face of DH2013; the landscape of the conference (and by proxy the discipline) as seen by those attending. While this reorientation of the graph doesn’t show us much we haven’t already seen, it does emphasize the oddly low acceptance rates of infrastructural submissions (facilities, libraries, museums, institutions, etc.) While visualization acceptance rates were a bit low, attendees of the conference will still see a great number of them, because the initial submission rate was so high. Conference goers will see that DH maintains a heavy focus on the many aspects of text: its analysis, its preservation, its interfaces, and so forth. The web also appears well-represented, both in the study of it and development on it. Metadata is perhaps not as strong a focus as it once was (historical DH conference analysis would help in confirming this speculation on my part), and reflexivity, while high (nearly 20 “Digital Humanities – Nature and Significance” submissions), is far from overwhelming.

A few dozen papers will be presented on multimedia beyond simple text – a small but not insignificant subgroup. Fewer still are papers on maps, stylometry, or medieval studies, three subgroups I imagine once had greater representation. They currently each show about the same force as gender studies, which had a surprisingly high acceptance rate of 85% and is likely up-and-coming in the DH world. Pedagogy was much better represented in submissions than acceptances, and a newcomer to the field coming to the conference for the first time would be forgiven in thinking pedagogy was less of an important subject in DH than veterans might think it is.

Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)
Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)

As what’s written so far is already a giant wall of text, I’ll go ahead and leave it at this for now. When next I have some time I’ll start analyzing some networks of keywords and titles to find which keywords tend to be used together, and whatever other interesting things might pop up. Suggestions and requests, as always, are welcome.



Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.


Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!


  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 1

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s my review of part I (Foundation), all chapters. Read Part 2, Part 3, …

Macroanalysis: Digital Methods & Literary History is a book whose time has come. “Individual creativity,” Matthew L. Jockers writes, “is highly constrained, even determined, by factors outside of what we consider to be a writer’s conscious control.” Although Jockers’ book is a work of impressive creativity, it also fits squarely within a larger set of trends. The scents of ‘Digital Humanities’ (DH) and ‘Big Data’ are in the air, the funding-rich smells attracting predators from all corners, and Jockers’ book floats somewhere in the center of it all. As with many DH projects, Macroanalysis attempts the double goal of explaining a new method and exemplifying the type of insights that can be achieved via this method. Unlike many projects, Jockers succeeds masterfully at both. Macroanalysis introduces its readers to large scale quantitative methods for studying literary history, and through those methods explores the nature of creativity and influence in general and the place of Irish literature within its larger context in particular.

I’ve apparently gained a bit of a reputation for being overly critical, and it’s worth pointing out at the beginning of this review that this trend will continue for Macroanalysis. That said, I am most critical of the things I love the most, and readers who focus on any nits I might pick without reading the book themselves should keep in mind that the overall work is staggering in its quality, and if it does fall short in some small areas, it is offset by the many areas it pushes impressively forward.

Macroanalysis arrives on bookshelves eight years after Franco Moretti’s Graphs, Maps, and Trees (2005), and thirteen years after Moretti’s “Conjectures on World Literature” went to press in early 2000, where he coined the phrase “distant reading.” Moretti’s distant reading is a way of seeing literature en masse, of looking at text at the widest angle and reporting what structures and forms only become visible at this scale. Moretti’s early work paved the way, but as might be expected with monograph published the same year as the initial release of Google Books, lack of available data made it stronger in theory than in computational power.

From Moretti's Graphs, Maps, and Trees
From Moretti’s Graphs, Maps, and Trees

In 2010, Moretti and Jockers, the author of Macroanalysis, co-founded the Stanford Lit Lab for the quantitative and digital research of literature. The two have collaborated extensively,  and Jockers acknowledge’s Moretti’s influence on his monograph. That said, in his book, Jockers distances himself slightly from Moretti’s notion of distant reading, and it is not the first time he has done so. His choice of “analysis” over “reading” is an attempt to show that what his algorithms are doing at this large scale is very different from our normal interpretive process of reading; it is simply gathering and aggregating data, the output of which can eventually be read and interpreted instead of or in addition to the texts themselves. The term macroanalysis was inspired by the difference between macro- and microeconomics, and Jockers does a good job justifying the comparison. Given that Jockers came up with the comparison in 2005, one does wonder if he would have decided on different terminology after our recent financial meltdown and the ensuing large-scale distrust of macroeconomic methods. The quantitative study of history, cliometrics, also had its origins in economics and suffered its own fall from grace decades ago; quantitative history still hasn’t recovered.

Part I: Foundation

I don’t know whether the allusion was intended, but lovers of science fiction and quantitative cultural studies will enjoy the title of Part I: “Foundation.” It shares a name with a series of books by Isaac Asimov, centering around the ability to combine statistics and human-centric research to understand and predict people’s behaviors. Punny titles aside, the section provides the structural base of the monograph.

The story of Foundation in a nutshell. Via c0ders.
The story of Foundation in a nutshell. Via c0ders.

Much of the introductory chapters are provocative statements about the newness of the study at hand, and they are not unwarranted. Still, I can imagine that the regular detractors of technological optimism might argue their usual arguments in response to Jockers’ pronouncements of a ‘revolution.’ The second chapter, on Evidence, raises some particularly important (and timely) points that are sure to raise some hackles. “Close reading is not only impractical as a means of evidence gathering in the digital library, but big data render it totally inappropriate as a method of studying literary history.” Jockers hammers home this point again and again, that now that anecdotal evidence based on ‘representative’ texts is no longer the best means of understanding literature, there’s no reason it should still be considered the gold standard of evidentiary support.

Not coming from a background of literary history or criticism, I do wonder a bit about these notions of representativeness (a point also often brought up by Ted Underwood, Ben Schmidt, and Jockers himself). This is probably something lit-researchers worked out in the 70s, but it strikes me that the questions being asked of a few ‘exemplary, representative texts’ are very different than the ones that ought to be asked of whole corpora of texts. Further, ‘representative’ of what? As this book appears to be aimed not only at traditional literary scholars, it would have been beneficial for Jockers to untangle these myriad difficulties.

One point worth noting is that, although Jockers calls his book Macroanalysis, his approach calls for a mixed method, the combination of the macro/micro, distant/close. The book is very careful and precise in its claims that macroanalysis augments and opens new questions, rather than replaces. It is a combination of both approaches, one informing the other, that leads to new insights. “Today’s student of literature must be adept at reading and gathering evidence from individual texts and equally adept at accessing and mining digital-text repositories.” The balance struck here is impressive: to ignore macroanalysis as a superior source of evidence for many types of large questions would be criminal, but its adoption alone does not make for good research (further, either without the other would be poorly done). For example, macroanalysis can augment close reading approaches by contextualizing a text within its broad historical and cultural moment, showing a researcher precisely where their object of research fits in the larger picture.

Historians would do well to heed this advice, though they are not the target audience. Indeed, historians play a perplexing role in Jockers’ narrative; not because his description is untrue, but because it ought not be true. In describing the digital humanities, Jockers calls it an “ambiguous and amorphous amalgamation of literary formalists, new media theorists, tool builders, coders, and linguists.” What place historians? Jockers places their role earlier, tracing the wide-angle view to the Annales historians and their focus on longue durée history. If historian’s influence ends there, we are surely in a sad state; that light, along with those of cliometrics and quantitative history, shone brightest in the 1970s before a rapid decline. Unsworth recently attributed the decline to the fallout following Time on the cross (Fogel & Engerman, 1974), putting quantitative methods in history “out of business for decades.” The ghost of cliometrics still haunts historians to such an extent that the best research in that area, to this day, comes more from information scientists and applied mathematicians than from historians. Digital humanities may yet exorcise that ghost, but it has not happened yet, as evidenced in part by the glaring void in Jockers’ introductory remarks.

It is with this framing in mind that Jockers embarks on his largely computational and empirical study of influence and landscape in British and American literature.


In Defense of Collaboration

Being a very round-about review of the new work of fiction by Robin Sloan, Mr. Penumbra’s 24-Hour Bookstore.

Ship’s Logs and Collaborative DH

Ben Schmidt has stolen the limelight of the recent digital humanities blogosphere, writing a phenomenal series of not one, not two, not three, not four, not five, not six, but seven posts about ship logs and digital history. They’re a whale of a read, and whale worth it too (okay, okay, I’m sorry, I had to), but the point for the purpose of this post is his conclusion:

The central conclusion is this: To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences. This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.

– Ben Schmidt

He goes on to add “A historian whose access is mediated by an archivist tends to know how best to interpret her sources; one plugging at databases through dimly-understood methods has lost his claim to expertise.”  Ben makes many great points, and he himself, with this series of posts, exemplifies the power of humanistic competency and technical expertise combined in one wrinkled protein sponge. It’s a powerful mix, and one just beginning to open a whole new world of inquiry.

Yes, I know this is not how brains work. It’s still explanatory. via.

This conclusion inspired a twitter discussion where Ben and Ted Underwood questioned whether there was a limit to the division-of-labor/collaboration model in the digital humanities.  Which of course I disagreed with. Ben suggested that humanists “prize source familiarity more. You can’t teach Hitler studies without speaking German.” The humanist needs to actually speak German; they can’t just sit there with a team of translators and expect to do good humanistic work.

This opens up an interesting question: how do we classify all this past work involving collaboration between humanists and computer scientists, quals and quants, epistêmê and technê?  Is it not actually digital humanities? Will it eventually be judged bad digital humanities, that noisy pre-paradigmatic stuff that came before the grand unification of training and pervasive dual-competencies? My guess is that, if there are limits to collaboration, they are limits which can be overcome with careful coordination and literacy.

I’m not suggesting collaboration is king, nor that it will always produce faster or better results. We can’t throw nine women and nine men in a room and hope to produce a baby in a month’s time, with the extra help. However, I imagine that there are very few, if any, situations where some conclusion can’t be reached by two people with complementary competencies that can be produced by one person with both. Scholarship works on trust. Academics are producing knowledge every day that relies on their trusting the competencies of the secondary sources they cite, so that they do not need methodological or content expertise in the entire hypothetical lattice extending from their conclusions down to the most basic elements of their arguments.

And I predict that as computationally-driven humanities matures and utilizes increasingly-complex datasets and algorithms, our reliance on these networks of trust (and our need to methodologically formalize them) will only grow. This shift occurred many years ago in the natural sciences, as scientists learned to rely on physical tools and mathematical systems that they did not fully understand, as they began working in ever-growing teams where no one person could reconstruct the whole. Our historical narratives also began to shift, moving away from the idea that the most important ideas in history sprung forth fully developed from the foreheads of “Great Men,” as we realized that an entire infrastructure was required to support them.

How we used to think science worked. via.

What we need in the digital humanities is not combined expertise (although that would probably make things go faster, at the outset), but multiple literacies and an infrastructure to support collaboration; a system in place we can trust to validate methodologies and software and content and concepts. By multiple literacies, I mean the ability for scholars to speak the language of the experts they collaborate with. Computer scientists who can speak literary studies, humanists who can speak math, dedicated translators who can bridge whatever gaps might exist, and enough trust between all the collaborators that each doesn’t need to reinvent the wheel for themselves. Ben rightly points out that humanists value source expertise, that you can’t teach Hitler without speaking German; true, but the subject, scope, and methodologies of traditional humanists have constrained them from needing to directly rely on collaborators to do their research. This will not last.

The Large Hadron Collider is arguably the most complex experiment the world has ever seen. Not one person understands all, most, or even a large chunk of it. Physics and chemistry could have stuck with experiments and theories that could reside completely and comfortably in one mind, for there was certainly a time when this was the case, but in order to grow (to scale), a translational trust infrastructure needed to be put in place. If you take it for granted that humanities research (that is, research involving humans and their interactions with each other and the world, taking into account the situated nature of the researcher) can scale, then in order for it to do so, we as individuals must embrace a reliance on things we do not completely understand. The key will be figuring out how to balance blind trust with educated choice, and that key lies in literacies, translations, and trust-granting systems in the academy or social structure, as well as solidified standard practices. These exist in other social systems and scholarly worlds (like the natural sciences), and I think they can exist for us as well, and to some extent already do.

Timely Code Cracking

Coincidentally enough, the same day Ben tweeted about needing to know German to study Hitler in the humanities, Wired posted an article reviewing some recent(-ish) research involving a collaboration between a linguist, a computer scientist, and a historian to solve a 250-year-old cipher. The team decoded a German text describing an 18th century secret society, and it all started when one linguist (Christiane Schaefer) was given photocopies of this manuscript about 15 years ago. She toyed with the encoded text for some time, but never was able to make anything substantive of it.

After hearing a talk by machine translation expert and computer scientist Kevin Knight, who treats translations as ciphers, Schaefer was inspired to bring the code to Knight. At the time, neither knew what language the original was written in, nor really anything else about it. In short order, Knight utilized algorithmic analysis and some educated guesswork to recognize textual patterns suggesting the text to be German. “Knight didn’t speak a word of German, but he didn’t need to. As long as he could learn some basic rules about the language—which letters appeared in what frequency—the machine would do the rest.”

Copiale cipher. via.

Within weeks, Knight’s analysis combined with a series of exchanges between him and Schaefer and a colleague of hers led to the deciphering of the text, revealing its original purpose. “Schaefer stared at the screen. She had spent a dozen years with the cipher. Knight had broken the whole thing open in just a few weeks.” They soon enlisted the help of a historian of secret societies to help further understand and contextualize the results they’d discovered, connecting the text to a group called the Oculists and connecting them with the Freemasons.

If this isn’t a daring example of digital humanities at its finest, I don’t know what is. Sure, if one researcher had the competencies of all four, the text wouldn’t have sat dormant for a dozen years, and likely a few assumptions still exist in the dataset that might be wrong or improved upon. But this is certainly an example of a fruitful collaboration. Ben’s point still stands – a humanist bungling her way through a database without a firm grasp of the process of data creation or algorithmic manipulation has lost her claim to expertise – but there are ways around these issues; indeed, there must be, if we want to start asking more complex questions of more complex data.

Mr. Penumbra’s 24-Hour Bookstore

You might have forgotten, but this post is actually a review of a new piece of fiction by Robin Sloan. The book, Mr. Penumbra’s 24-Hour Bookstore, is a love letter. That’s not to say the book includes love (which I suppose it does, to some degree), but that the thing itself is a love letter, directed at the digital humanities. Possibly without the author’s intent.

This is a book about collaboration. It’s about data visualization, and secret societies, and the history of the book. It’s about copyright law and typefaces and book scanning. It’s about the strain between old and new ways of knowing and learning. In short, this book is about the digital humanities. Why is this book review connected with a defense of collaboration in the digital humanities? I’ll attempt to explain the connection without spoiling too much of the book, which everyone interested enough to read this far should absolutely read.

The book begins just before the main character, an out-of-work graphic designer named Clay, gets hired at a mysterious and cavernous used bookstore run by the equally mysterious Mr. Penumbra. Strange things happen there. Crazy people with no business being up during Clay’s night shift run into the store, intent on retrieving one particular book, leaving with it only to return some time later seeking another one. The books are illegible. The author doesn’t say as much, but the reader suspects some sort of code is involved.

Intent on discovering what’s going on, Clay enlists the help of a Google employee, a programming wiz, to visualize the goings on in the bookstore. Kat, the Googler, is “the kind of girl you can impress with a prototype,” and the chemistry between them as they try to solve the puzzle fantastic in the nerdiest of ways. Without getting into too many details, they and a group of friends wind up solving a puzzle using data analysis in mere weeks that most people take years to discover in their own analog ways. Some of those people who did spend years trying to solve the aforementioned puzzle are quite excited by this new technique; some, predictably, are not. For their part, the rag-tag group of friends who digitally solved it don’t quite understand what it is they’d solved, not in the way the others have. If this sounds familiar, you’ve probably heard of culturomics.

Mr. Penumbra’s 24-Hour Bookstore. via.

A group of interdisciplinary people, working with Google, who figure out in weeks what should have taken years (and generally does). A few of the old school researchers taking their side, going along with them against the herd, an establishment that finds their work Wrong in so many ways. Essentially, if you read this book, you’ll have read a metaphorical, fictional argument that aligns quite closely with what I’ve argued in the blog post above.

So go out and buy the book. The physical book, mind you, not the digital version, and make sure to purchase the hardcover. It was clearly published with great care and forethought; the materiality of the book, its physical attributes and features, were designed cleverly to augment the book itself in ways that are not revealed until you have finished it. While the historical details in the novel are fictional, the historical among you will recognize many connections to actual people and events, and those digitally well-versed will find similarly striking connections. Also, I want you to buy the book so I have other people to talk to about it with, because I think the author was wrong about his main premise. We can start a book-club. I’d like to thank Paige Morgan for letting me know Sloan had turned his wonderful short story into a novel. And re-read this post after you’ve finished reading the book – it’ll make a lot more sense.


Each of these three sections were toward one point: collaboration in the digital humanities is possible and, for certain projects as we go forward, will become essential. That last section won’t make much sense in support of this argument until you actually read the novel, so go out and do that. It’s okay, I’ll wait.

To Ben and Ted’s credit, they weren’t saying collaboration was futile. They were arguing for increasingly well-rounded competencies, which I think we can all get behind. But I also think we need to start establishing some standard practices and to create a medium wherein we can develop methodologies that can be peer-reviewed and approved, so that individual scholars can have an easier time doing serious and theoretically compelling computational work without having to relearn the entire infrastructure supporting it. Supporting more complex ways of knowing in the field of humanities will require us as individuals becoming more comfortable with not knowing everything.

method personal research

Analyzing submissions to Digital Humanities 2013

Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a new method of matching submissions to reviewers. It’s a bidding process; reviewers take a look at the many submissions and state their reviewing preferences or, when necessary, conflicts of interest. It’s unclear the extent to which these preferences will be accommodated, as this is an experiment on their part. Bethany Nowviskie describes it here. As a potential reviewer, I just went through the process of listing my preferences, and managed to do some data scraping while I was there. How could I not? All 348 submission titles were available to me, as well as their authors, topic selections, and keywords, and given that my submission for this year is all about quantitatively analyzing DH, it was an opportunity I could not pass up. Given that these data are sensitive, and those who submitted did so under the assumption that rejected submissions would remain private, I’m opting not to release the data or any non-aggregated information. I’m also doing my best not to actually read the data in the interest of the privacy of my peers; I suppose you’ll all just have to trust me on that one, though.

So what are people submitting? According to the topics authors assigned to their 348 submissions, 65 submitted articles related to “literary studies,” trailed closely by 64 submissions which pertained to “data mining/ text mining.” Work on archives and visualizations are also up near the top, and only about half as many authors submitted historical studies (37) as those who submitted literary ones (65). This confirms my long suspicion that our current wave of DH (that is, what’s trending and exciting) focuses quite a bit more on literature than history. This makes me sad.  You can see the breakdown in Figure 1 below, and further analysis can be found after.

Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).

The majority of authors attached fewer than five topics to their submissions; a small handful included over 15.  Figure 2 shows the number of topics assigned to each document.

Figure 2: The number of topics attached to each document, in order of rank.

I was curious how strongly each topic coupled with other topics, and how topics tended to cluster together in general, so I extracted a topic co-occurrence network. That is, whenever two topics appear on the same document, they are connected by an edge (see Networks Demystified Pt. 1 for a brief introduction to this sort of network); the more times two topics co-occur, the stronger the weight of the edge between them.

Topping off the list at 34 co-occurrences were “Data Mining/ Text Mining” and “Text Analysis,” not terrifically surprising as the the latter generally requires the former, followed by “Data Mining/ Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis” and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m saying here is that Literary Studies, Mining, and Analysis seem to go hand-in-hand.

Knowing my readers, about half of you are already angry with me counting co-occurrences, and rightly so. That measurement is heavily biased by the sheer total number of times a topic is used; if “literary studies” is attached to 65 submissions, it’s much more likely that it will co-occur with any particular topic than topics (like “teaching and pedagogy”) which simply appear more infrequently. The highest frequency topics will co-occur with one another simply by an accident of magnitude.

To account for this, I measured the neighborhood overlap of each node on the topic network. This involves first finding the number of other topics  a pair of two topics shares. For example, “teaching and pedagogy” and “digital humanities – pedagogy and curriculum” each co-occur with several other of the same topics, including “programming,” “interdisciplinary collaboration,” and “project design, organization, management.” I summed up the number topical co-occurrences between each pair of topics, and then divided that total by the number of co-occurrences each node in the pair had individually. In short, I looked at which pairs of topics tended to share similar other topics, making sure to take into account that some topics which are used very frequently might need some normalization. There are better normalization algorithms out there, but I opt to use this one for its simplicity for pedagogical reasons. The method does a great job leveling the playing field between pairs of infrequently-used topics compared to pairs of frequently-used topics, but doesn’t fair so well when looking at a pair where one topic is popular and the other is not. The algorithm is well-described in Figure 3, where the darker the edge, the higher the neighborhood overlap.

Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar.

Neighborhood overlap paints a slightly different picture of the network. The pair of topics with the largest overlap was “Internet / World Wide Web” and “Visualization,” with 90% of their neighbors overlapping. Unsurprisingly, the next-strongest pair was “Teaching and Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The data might be used to suggest multiple topics that might be merged into one, and this pair seems to be a pretty good candidate. “Visualization” also closely overlaps “Data Mining/ Text Mining”, which itself (as we saw before) overlaps with “Cultural Studies” and “Literary Studies.” What we see from this close clustering both in overlap and in connection strength is the traces of a fairly coherent subfield out of DH, that of quantitative literary studies. We see a similarly tight-knit cluster between topics concerning archives, databases, analysis, the web, visualizations, and interface design, which suggests another genre in the DH community: the (relatively) recent boom of user interfaces as workbenches for humanists exploring their archives. Figure 4 represents the pairs of topics which overlap to the highest degree; topics without high degrees of pair correspondence don’t appear on the network graph.

Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.

The topics authors chose for each submission were from a controlled vocabulary. Authors also had the opportunity to attach their own keywords to submissions, which unsurprisingly yielded a much more diverse (and often redundant) network of co-occurrences. The resulting network revealed a few surprises: for example, “topic modeling” appears to be much more closely coupled with “visualization” than with “text analysis” or “text mining.” Of course some pairs are not terribly surprising, as with the close connection between “Interdisciplinary” and “Collaboration.” The graph also shows that the organizers have done a pretty good job putting the curated topic list together, as a significant chunk of the high thresholding keywords are also available in the topic list, with a few notable exceptions. “Scholarly Communication,” for example, is a frequently used keyword but not available as a topic – perhaps next year, this sort of analysis can be used to help augment the curated topic list. The keyword network appears in Figure 5. I’ve opted not to include a truly high resolution image to dissuade readers from trying to infer individual documents from the keyword associations.

Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.

There’s quite a bit of rich data here to be explored, and anyone who does have access to the bidding can easily see that the entire point of my group’s submission is exploring the landscape of DH, so there’s definitely more to come on the subject from this blog. I especially look forward to seeing what decisions wind up being made in the peer review process, and whether or how that skews the scholarly landscape at the conference.

On a more reflexive note, looking at the data makes it pretty clear that DH isn’t as fractured as some occasionally suggest (New Media vs. Archives vs. Analysis, etc.). Every document is related to a few others, and they are all of them together connected in a rich family, a network, of Digital Humanities. There are no islands or isolates. While there might be no “The” Digital Humanities, no unifying factor connecting all research, there are Wittgensteinian family resemblances  connecting all of these submissions together, in a cohesive enough whole to suggest that yes, we can reasonably continue to call our confederation a single community. Certainly, there are many sub-communities, but there still exists an internal cohesiveness that allows us to differentiate ourselves from, say, geology or philosophy of mind, which themselves have their own internal cohesiveness.


Making pretty things with R and ggplot2

This isn’t going to be a long tutorial. I’ve just had three people asking how I made the pretty graphs on my last post about counting citations, and I’m almost ashamed to admit how easy it was. Somebody with no experience coding can (I hope) follow these steps and make themselves a pretty picture with the data provided, and understand how it was created.

qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

That’s the whole program. Oh, also this table, saved as a csv:

[table id=3 /]


And that was everything I used to produce this graph:

Density graph made with R.

Quick Walkthrough


The first thing you need to make this yourself is the programming language R (an awesome language for statistical analysis) installed on your machine, which you can get here. Download it and install it; it’s okay, Ill wait. Now, R by itself is not fun to code in, so my favorite program to use when writing R code is called RStudio, so go install that too. Now you’re going to have to install the visualization package, which is called ggplot2. You do this from within RStudio itself, so open up the newly installed program. If you’re running Windows Vista or 7, don’t open it up the usual way; right click on the icon and click ‘Run as administrator’ – you need to do this so it’ll actually let you install the package. Once you’ve opened up RStudio, at the bottom of the program there’s a section of your screen labeled ‘Console’, with a blinking text cursor. In the console, type install.packages(“ggplot2”) and hit enter. Congratulations, ggplot2 is now installed.

Now download this R file (‘Save as’) that I showed you before and open it in RStudio (‘File -> Open File’). It should look a lot like that code at the beginning of the post. Now go ahead and download the csv shown above as well, and be sure to put it in the same directory 1 you put the R code. Once you’ve done that, in RStudio click ‘Tools -> Set Working Directory -> To Source File Location’), which will help R figure out where the csv is that you just downloaded.

Before I go on explaining what each line of the code does, run it and see what happens! Near the top of your code on the right side, there should be a list of buttons, on the left one that says ‘Run’ and on the right one that says ‘Source’. Click the button that says ‘Source‘. Voila, a pretty picture!


Now to go through the code itself, we’ll start with line 1. library(ggplot2) just means that that we’re going to be using ggplot2 to make the visualization, and lets R know to look for it when it’s about to put out the graphics.

Line 2 is fairly short as well, sci=read.csv(“scicites.csv”), and it creates a new variable called sci which contains the entire csv file you downloaded earlier. read.csv(“scicites.csv”) is a command that tells R to read the csv file in the parentheses, and setting the variable sci as equal to that read file just saves it.

Line 3 is where the magic happens.

qplot(Cited, data = sci, geom="density", fill=YearRange, log="x", xlab="Number of Citations", ylab="Density", main="Density of citations per 8 years", alpha=I(.5))

The entire line is surrounded by the parenthetical command qplot() which is just our way of telling R “hey, plot this bit in here!” The first thing inside the parentheses is Cited, which you might recall was one of the columns in the CSV file. This is telling qplot() what column of data it’s going to be plotting, in this case, the number of citations that papers have received. Then, we tell qplot() where that data is coming from with the command data = sci, which sets what table the data column is coming from. After that geom=”density” appears. geom is short for ‘Geometric Object’ and it sets what the graph will look like. In this case we’re making a density graph, so we give it “density”, but we could just as easily have used something like “histogram” or “line”.

The next bit is fill=YearRange, which you might recall was another column in the csv. This is a way of breaking the data we’re using into categories; in this case, the data are categorized by which year range they fall into. fill is one way of categorizing the data by filling in the density blobs with automatically assigned colors; another way would be to replace fill with color. Try it and see what happens. After the next comma is log=”x”, which puts the x-axis on a log scale, making the graph a bit easier to read. Take a look at what the graph looks like if you delete that part of it.

Now we have a big chunk of code devoted to labels. xlab=”Number of Citations”, ylab=”Density”, main=”Density of citations per 8 years”. As can probably be surmised, xlab corresponds to the label on the x-axis, ylab corresponds to the label on the y-axis, and main corresponds to the title of the graph. The very last part, before the closing parentheses, is alpha=I(.5)alpha sets the transparency of the basic graph elements, which is why the colored density blobs all look a little bit transparent. I set their transparency to .5 so they’d each still be visible behind the others. You can set the value between 0 and 1, with the former being completely transparent and the latter being completely opaque.

There you have it, easy-peasy. Play around with the csv, try adding your own data, and take a look at this chapter from “ggplot2: Elegant Graphics for Data Analysis” to see what other options are available to you.


  1. thanks Andy for the correction!

How many citations does a paper have to get before it’s significantly above baseline impact for the field?

[Note: This blog post was originally hidden because it’s not aimed at my usual audience. I decided to open it up because, hey, I guess it’s okay for all you humanists and data scientists to know that one of the other hats I wear is that of an informetrician. Another reason I kept it hidden is because I’m pretty scared of how people use citation impact ratings to evaluate research for things like funding and tenure, often at the expense of other methods that ought be used when human livelihoods are at stake. So please don’t do that.]

It depends on the field, and field is defined pretty loosely. This post is in response to a twitter conversation between mrgunn, myself, and some others. mrgunn thinks citation data ought to be freely available, and I agree with him, although I believe data is difficult enough to gather and maintain that a service charge for access is fair, if a clever free alternative is lacking. I’d love to make a clever free alternative (CiteSeerX already is getting there), but the best data still comes from expensive sources like ISI’s Web of Science or Scopus.

At any rate, the question is an empirical one, and one that lots of scientometricians have answered in a number of ways. I’m going to perform my own SSA (Super-Stupid Analysis) here, and I won’t bother taking statistical regression models or Bayesian inferences into account, because you can get a pretty good sense of “impact” (if you take citations to be a good proxy for impact, which is debatable – I won’t even get into using citations as a proxy for quality) using some fairly simple statistics. For the mathy and interested, a forthcoming paper by Evans, Hopkins, and Kaube treats the subject more seriously in Universality of Performance Indicators based on Citation and Reference Counts.

I decided to use the field of Scientometrics, because it’s fairly self-contained (and I love being meta), and I drew my data from ISI’s Web of Science. I retrieved all articles published in the journal Scientometrics up until 2009, which is a nicely representative sample of the field, and then counted the number of citations to each article in a given year. Keep in mind that if you’re wondering how much your Scientometrics paper stood out above its peers in citations with this chart, you have to use ISI’s citation count to your paper, otherwise you’re comparing apples to something else that isn’t apples.

Figure 1. Histogram of citations to papers, with the height of each bar representing the number of papers cited x times. The colors break down the bars by year. (Click to enlarge)
Figure 2. Same as Figure 1, but with the x axis on a log scale.

According to Figure 1 and Figure 2 (Fig. 2 is the same as Fig. 1 but with the x axis on a log scale to make the data a bit easier to read), it’s immediately clear that citations aren’t normally distributed. This tells us right away that some basic statistics simply won’t tell us much with regards to this data. For example, if we take the average number of citations per paper, by adding up each paper’s citation count and dividing it by the total number of papers, we get 7.8 citations per paper. However, because the data are so skewed to one side, over 70% of the papers in the set fall below that average (that is, 70% of papers are cited fewer than 7 times). In this case, a slightly better measurement would be the median, which is 4. That is, about half the papers have fewer than four citations. About a fifth of the papers have no citations at all.

If we look at the colors of Figure 1, which breaks down each bar by year, we can see that the data aren’t really evenly distributed by years, either. Figure 3 breaks this down a bit better.

Figure 3. Number of papers to articles in the journal Scientometrics, colored by number of citations each received.

In Figure 3, you can see the amount of papers published in a given year, and the colors represent how many citations each paper got that year, with the red end of the spectrum showing papers cited very little, and the violet end of the spectrum showing highly cited papers. Immediately we see that the most recent papers don’t have many highly cited articles, so the first thing we should do is normalize by year. That is, an article published this year shouldn’t be placed against the same standards as an article that’s had twenty years to slowly accrue citations.

To make these data a bit easier to deal with, I’ve sliced the set into 8-year chunks. There are smarter ways to do this, but like I said, we’re keeping the analysis simple for the sake of presentation. Figure 4 is the same as Figure 3, but separated out into the appropriate time slices.

Figure 4. Same as figure 3, but separated into 8 year time slices.

Now, to get back to the original question, mrgunn asked how many citations a paper needs to be above the fold. Intuitively, we’d probably call a paper highly impactful if it’s in the blue or violet sections of its time slice (sorry for those of you who are colorblind, I just mean the small top-most area). There’s another way to look at these data that’ll make it a bit easier to eyeball how much more citations a paper’s received than its peers; a density graph. Figure 5 shows just that.

Figure 5. Each color blob represents a time slice, with the height at any given point representing the proportion of papers in that chunk of time which have x citations. The x axis is on a log scale.

Looking at Figure 5, it’s easy to see that a paper published before 2008 with fewer than half a dozen citations is clearly below the norm. If the paper were published after 2008, it could be above the norm even if it had only a small handful of citations. A hundred citations is clearly “highly impactful” regardless of the year the paper was published. To get a better sense of papers that are above the baseline, we can take a look at the actual numbers.

The table below (excuse the crappy formatting, I’ve never tried to embed a big table in WP before) shows the percent of papers which have x citations or fewer in a given time slice. That is, 24% of papers published before 1984 have no citations to them, 31% of papers published before 1984 have 0 or 1 citations to them, 40% of papers published before 1984 have 0, 1, or 2 citations to them, and so forth. That means if you published a paper in Scientometrics  in 1999 and ISI’s Web of Science says you’ve received 15 citations, it means your paper has received more citations than 80% of the other papers published between 1992 and 2000.

[table id=2 /]


The conversation also brought up the point of whether this should be a clear binary at the ends of the spectrum (paper A is low impact because it received only a handful of citations, paper B is high impact because it received 150, but we can’t really tell anything in between), or whether we could get a more nuanced few of the spectrum. A combined qualitative/quantitative analysis would be required for a really good answer to that question, but looking at the numbers in the table above, we can see pretty quickly that while 1 citation is pretty different from 2 citations, 38 citations is pretty much the same as 39. That is, the “jitter” of precision probably increases exponentially the more citations you’ve received, such that with very few citations the “impact” precision is quite high, and that precision gets exponentially lower the more citations you’ve received.

All this being said, I do agree with mrgunn that a free and easy to use resource for this sort of analysis would be good. However, because citations often don’t equate to quality, I’d be afraid this tool would just make it easier and more likely for people to make sweeping and inaccurate quality measurements for the purpose of individual evaluations.


On Simplicity

You can build complex arguments on a very simple foundation
Ted Underwood

Celestial Navigation

The world is full of very complex algorithms honed to solve even more complex problems. When you use your phone as a GPS, it’s not a simple matter of triangulating signals from towers or satellites. Because your GPS receiver has to know the precise time (in nanoseconds) at the satellites it’s receiving signals from, and because the satellites are moving at various speeds and orbiting at an altitude where the force of gravity is significantly different, calculating times gets quite complicated due to the effects of relativity. The algorithms that allow the GPS to work have to take relativity into account, often on the fly, and without those complex algorithms the GPS would not be nearly so precise.

Precision and complexity go hand-in-hand, and often the relationship between the two is non-linear. That is, a little more precision often requires lot more complexity. Ever-higher levels of precision get exponentially more difficult to achieve. The traditional humanities is a great example of this; a Wikipedia article can sum up most of what one needs to know regarding, say, World War I, but it takes many scholars many lifetimes to learn  everything. And the more that’s already been figured out, the more work we need to do to find the smallest next piece to understand.

This level of precision is often important, insightful, and necessary to make strides in a field. Whereas before an earth-centered view of the universe was good enough to aid in navigation and predict the zodiac and the seasons, a heliocentric model was required for more precise predictions of the movements of the planets and stars. However, these complex models are not always the best ones for a given situation, and sometimes simplicity and a bit of cleverness can go much further than whatever convoluted equation yields the most precise possible results.

Sticking with the example of astronomy, many marine navigation schools still teach the geocentric model; not because they don’t realize the earth moves, but because navigation is simply easier when you pretend the earth is fixed and everything moves around it. They don’t need to tell you exactly when the next eclipse will be, they just need to figure out where they are. Similarly, your phone can usually pinpoint you within a few blocks by triangulating itself between cellphone towers, without ever worrying about satellites or Einstein.

Geocentric celestial navigation chart from a class assignment.

Whether you need to spend the extra time figuring out relativistic physics or heliocentric astronomical models depends largely on your purpose. If you’re just trying to find your way from Miami to New York City, and for some absurd reason you can only rely on technology you’ve created yourself for navigation, the simpler solution is probably the best way to go.

Simplicity and Macroanalysis

If I’ve written over-long on navigation, it’s because I believe it to be a particularly useful metaphor for large-scale computational humanities. Franco Moretti calls it distant reading, Matthew Jockers calls it macroanalysis, and historians call it… well, I don’t think we’ve come up with a name for it yet. I’d like to think we large-scale computational historians share a lot in spirit with big history, though I rarely see that connection touched on. My advisor likes shifting the focus from what we’re looking at to what we’re looking through, calling tools that help us see the whole of something macroscopes, as opposed to the microscopes which help us reach ever-greater precision.

Whatever you choose to call it, the important point is the shifting focus from precision to contextualization. Rather than exploring a particular subject with ever-increasing detail and care, it’s important to sometimes take a step back and look at how everything fits together. It’s a tough job because ‘everything’ is really quite a lot of things, and it’s easy to get mired in the details. It’s easy to say “well, we shouldn’t look at the data this way because it’s an oversimplification, and doesn’t capture the nuance of the text,” but capturing the nuance of the text isn’t the point. I have to admit, I sent an email to that effect to Matthew Jockers regarding his recent DH2012 presentation,  suggesting that time and similarity were a poor proxy for influence. But that’s not the point, and I was wrong in doing so, because the data still support his argument of authors clustering stylistically and thematically by gender, regardless of whether he calls the edges ‘influence’ or ‘similarity.’

I wrote a post a few months back on halting conditions, figuring out that point when adding more and more detailed data stops actually adding to the insight and instead just contributes to the complexity of the problem. I wrote

Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.

And this is the crux of the matter. If we’re trying to contextualize our data, if we’re trying to open our arms to collect everything that’s available, we need to keep it simple. We need a map, a simple way to navigate the deluge that is human history and creativity. This map will not be hand-drawn with rulers and yard-sticks, it will be taken via satellite, where only the broadest of strokes are clear. Academia, and especially the humanities, fetishizes the particular at the expense of the general. General knowledge is overview knowledge, is elementary knowledge. Generalist is a dirty word lobbed at popular authors who wouldn’t know a primary source if it fell on their head from the top shelf, covering them in dust and the smell of old paper.

Generality is not a vice. Simplicity can, at times, be a virtue. Sometimes you just want to know where the hell you are.

For these maps, a reasonable approximation is often good enough for most situations. Simple triangulation is good enough to get you from Florida to New York, and simply counting the number of dissertations published at ProQuest in a given year for a particular discipline is good enough to approximate the size of one discipline compared to another. Both lack nuance and are sure to run you into some trouble at the small scale, but often that scale is not necessary for the task at hand.

Two situations clearly shout for reasonable approximations; general views and contextualization. In the image below Matthew Jockers showed that formal properties of novels tend to split around the genders of their authors; that is, men wrote differently and about different things than women.

Network graph of 19th century novels, with nodes (novels) colored according to the gender of their authors.

Of course this macroanalysis lacks a lot of nuance, and one can argue for years which set of measurements might yield the best proxy for novel similarity, but as a base approximation the split is so striking that there is little doubt the apparent split is indicative of something interesting actually going on. Jockers has successfully separated signal from noise. This is a great example of how a simple approximation is good enoughto provide a general overview, a map offering one useful view of the literary landscape.

Beyond general overviews and contextualizations, simple models and quantifications can lead to surprisingly concrete and particular results. Take Strato, a clever observer who died around 2,300 years ago. There’s a lot going on after a rainstorm. The sun glistens off the moist grass, little insects crawl their way out of the ground, water pours from the corners of the roof. Each one of these events are themselves incredibly complex and can be described in a multitude of ways; with water pouring from a roof, for example, you can describe the thickness of the stream, or the impression the splash makes on the ground below, or the various murky colors it might take depending on where it’s dripping from. By isolating one property of the pouring rainwater, the fact that it tends to separate into droplets as it gets closer to the ground, Strato figured out that the water moved faster the further it had fallen. That is, falling bodies accelerate. Exactly measuring that acceleration, and quite how it worked, would elude humankind for over a thousand years, but a very simple observation that tended to hold true in many situations was good enough to discover a profound property of physics.

A great example of using simple observations to look at specific historical developments is Ben Schmidt’s Poor man’s sentiment analysis. By looking at words that occur frequently after the word ‘free’ in millions of books,  Ben is able to show the decreasing centrality of ‘freedom of movement’ after its initial importance in the 1840s, or the drop in the use of ‘free men’ after the fall of slavery. Interesting changes are also apparent in the language of markets and labor which both fit well with our understanding of the history of the concepts, and offer new pathways to study, especially around inflection points.

Ben Schmidt looks at the words that appear most frequently following the word ‘free’ in the google ngrams corpus.

 Toward Simplicity

Complex algorithms and representations are alluring. Ben Fry says of network graphs

Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it.

And the same is often true of algorithms; the more complex the algorithm, the more we feel it somehow ‘fits’ the data, because we know our data are so complex to begin with. Oftentimes, however, the simplest solution is good enough (and often really good) for the broadest number of applications.

If there is any take-home message of this post, as a follow-up to my previous one on Halting Conditions, it’s that diminishing returns doesn’t just apply to the amount of data you’ve collected, it also applies to the analysis you plan on running them through. More data aren’t always better, and newer, more precise, more complex algorithms aren’t always better.

Spend your time coming up with clever, simpler solutions so you have more time to interpret your results soundly.


The Myth of Text Analytics and Unobtrusive Measurement

Text analytics are often used in the social sciences as a way of unobtrusively observing people and their interactions. Humanists tend to approach the supporting algorithms with skepticism, and with good reason. This post is about the difficulties of using words or counts as a proxy for some secondary or deeper meaning. Although I offer no solutions here, readers of the blog will know I am hopeful of the promise of these sorts of measurements if used appropriately, and right now, we’re still too close to the cutting edge to know exactly what that means. There are, however, copious examples of text analytics used well in the humanities (most recently, for example, Joanna Guldi’s  publication on the history of walking).

The Confusion

Klout is a web service which ranks your social influence based on your internet activity. I don’t know how Klout’s algorithm works (and I doubt they’d be terribly forthcoming if I asked), but one of the products of that algorithm is a list of topics about which you are influential. For instance, Klout believes me to be quite influential with regards to Money (really? I don’t even have any of that.) and Journalism (uhmm.. no.), somewhat influential in Juggling (spot on.), Pizza (I guess I am from New York…), Scholarship (Sure!), and iPads (I’ve never touched an iPad.), and vaguely influential on the topic of Cars (nope) and Mining (do they mean text mining?).

By Ildar Sagdejev (Specious) (Own work) [GFDL ( or CC-BY-SA-3.0-2.5-2.0-1.0 (], via Wikimedia Commons
My pizza expertise is clear.
Thankfully careers don’t ride on this measurement (we have other metrics for that), but the danger is still fairly clear: the confusion of vocabulary and syntax for semantics and pragmatics. There are clear layers between the written word and its intended meaning, and those layers often depend on context and prior knowledge. Further, regardless of the intended meaning of the author, how her words are interpreted in the larger world can vary wildly. She may talk about money and pizza until she is blue in the face, but if the whole world disagrees with her, that is no measurement of expertise nor influence (even if angry pizza-lovers frequently shout at her about her pizza opinions).

We see very simple examples of this in sentiment  analysis, a way to extract the attitude of the writer toward whatever it was he’s written. An old friend who recently dipped his fingers in sentiment analysis wrote this:

According to his algorithm, that sentence was a positive one. Unless I seriously misunderstand my social cues (which I suppose wouldn’t be too unlikely), I very much doubt the intended positivity of the author. However, most decent algorithms would pick up that this was a tweet from somebody who was positive about Sarah Jessica Parker.

Unobtrusive Measurements

This particular approach to understanding humans belongs to the larger methodological class of unobtrusive measurements. Generally speaking, this topic is discussed in the context of the social sciences and is contrasted with more ‘obtrusive’ measurements along the lines of interviews or sticking people in labs. Historians generally don’t need to talk about unobtrusive measurements because, hey, the only way we could be obtrusive to our subjects would require exhuming bodies. It’s the idea that you can cleverly infer things about people from a distance, without them knowing that they are being studied.

Notice the disconnect between what I just said, and the word itself. ‘Unobtrusive’ against “without them knowing that they are being studied.” These are clearly not the same thing, and that distinction between definition and word is fairly important – and not merely in the context of this discussion. One classic example (Doob and Gross, 1968) asks how somebody’s social status determines whether someone might take aggressive action against them. They specifically measures a driver’s likelihood to honk his horn in frustration based on the perceived social status of the driver in front of them. Using a new luxury car and an old rusty station wagon, the researchers would stop at traffic lights that had turned green and would wait to see whether the car behind them honked. In the end, significantly more people honked at the low status car. More succinctly: status affects decisions of aggression.  Honking and the perceived worth of the car were used as proxies for aggression and perceptions of status, much like vocabulary is used as a proxy for meaning.

In no world would this be considered unobtrusive from the subject’s point of view. The experimenters intruded on their world, and their actions and lives changed because of it. All it says is that the subjects won’t change their behavior based on the knowledge that they are being studied. However, when an unobtrusive experiment becomes large enough, even one as innocuous as counting words, even that advantage no longer holds. Take, for example, citation analysis and the h-index. Citation analysis was initially construed as an unobtrusive measurement; we can say things about scholars and scholarly communication by looking at their citation patterns rather than interviewing them directly. However, now that entire nations (like Australia or the UK) use quantitative analysis to distribute funding to scholarship, the measurements are no longer unobtrusive. Scholars know how the new scholarly economy works, and have no problem changing their practices to get tenure, funding, etc.

The Measurement and The Effect: Untested Proxies

A paper was recently published (O’Boyle Jr. and Aguinis, 2012) on the non-normality of individual performance. The idea is that we assume that people’s performance (for example students in a classroom) are normally distributed along a bell curve. A few kids get really good grades, a few kids get really bad grades, but most are ‘C’ students. The authors challenge this view, suggesting performance takes on more of a power-law distribution, where very few people perform very well, and the majority perform very poorly, with 80% of people performing worse than the statistical average. If that’s hard to imagine, it’s because people are trained to think of averages on a bell curve, where 50% are greater than average and 50% are worse than average. Instead, imagine one person gets a score of 100, and another five people get scores of 10. The average is (100 + (10 * 5)) / 6 = 25, which means five out of the six people performed worse than average.

It’s an interesting hypothesis, and (in my opinion) probably a correct one, but their paper does not do a great job showing that. The reason is (you guessed it) they use scores as a proxy for performance.  For example, they look at the number of published papers individuals have in top-tier journals, and show that some authors are very productive whereas most are not. However, it’s a fairly widely-known phenomena that in science, famous names are more likely to be published than obscure ones (there are many anecdotes about anonymous papers being rejected until the original, famous author is revealed, at which point the paper is magically accepted). The number of accepted papers may be as much a proxy for fame as it is for performance, so the results do not support their hypothesis. The authors then look at awards given to actors and writers, however those awards suffer the same issues: the more well-known an actor, the the more likely they’ll be used in good movies, the more likely they’ll be visible to award-givers, etc. Again, awards are not a proxy for the quality of a performance. The paper then goes on to measure elected officials based on votes in elections. I don’t think I need to go on about how votes might not map one-to-one on the performance and prowess of an elected official.

I blogged a review of the most recent culturomics paper, which used google ngrams to look at the frequency of recurring natural disasters (earthquakes, floods, etc.) vs. the frequency of recurring social events (war, unemployment, etc.). The paper concludes that, because of differences in the frequency of word-use for words like ‘war’ or ‘earthquake’, the phenomena themselves are subject to different laws. The authors use word frequency as a proxy for the frequency of the events themselves, much in the same way that Klout seems to measure influence based on word-usage and counting. The problem, of course, is that the processes which govern what people decide to write down do not enjoy a one-to-one relationship to what people experience. Using words as proxies for events is just as problematic as using them for proxies of expertise, influence, or performance. The underlying processes are simply far more complicated than these algorithms give them credit for.

It should be noted, however, that the counts are not meaningless; they just don’t necessarily work as proxies for what these ngram scholars are trying to measure. Further, although the underlying processes are quite complex, the effect size of social or political pressure on word-use may be negligible to the point that their hypothesis is actually correct. The point isn’t that one cannot use one measurement as a proxy for something else; rather, the effectiveness of that proxy is assumed rather than actually explored or tested in any way. We need to do a better job, especially as humanists, of figuring out exactly how certain measurements map onto effects we seek.

A beautiful case study that exemplifies this point was written by famous statistician Andrew Gelman, and it aims to use unobtrusive and indirect measurements to find alien attacks and zombie outbreaks. He uses Google Trends to show that the number of zombies in the world are growing at a frightening rate.

Zombies will soon take over!



Halting Conditions

Occasionally, in computer science, the term “halting condition” is thrown around as the point at which the program should stop running.

Say I’ve got a robot that watches my roommate and I play scrabble, and I want it to count how many scrabble pieces we use, and tell us who won and what the highest scoring word was. Unfortunately, let’s say, I’m also Superman, so our scrabble games frequently end early when I hear cries for help and run off to the nearest phone booth. Our robot has to decide what conditions mean the game is over so it can give us the winner report; in this case, it is either when one player runs out of pieces, or when nobody plays a piece for a significant amount of time, because games often end early. Those are our halting conditions.

Scrabble Robot from

When it comes to data collection, humanists have no halting conditions. We don’t even have decent halting heuristics. Lisa Rhody just blogged a fantastically important piece about the difficulties of data collection in the humanities, and her points are worth stressing. “You need to know,” Rhody writes, “when it’s time to cut the rope and release what might be done.” She points out that humanists need to be discerning in what data we do collect, and we need to be comfortable with analyzing and releasing imperfect data. “The decision not to be perfect is the right choice, but it isn’t an easy one.”

Research Design

Many (but not all!) of the natural sciences have it easy. You design an experiment, you get the data you planned to get, then you analyze and release it. The halting conditions, when to stop collecting and cleaning data, are usually fairly easily pre-determined and stuck to. Psychology and the social sciences are usually similar; they often either use data that already exists, or else collect it themselves under pre-specified conditions.

The humanities, well… we’re used to a tradition that involves very deep and particular reading. The tiniest stones of our studied objects do not go unturned. The idea that a first pass, an incomplete pass, can lead to anything at all, let alone analysis and release, is almost anathema to the traditional humanistic mindset.

Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.

Measuring the Coast

While I won’t suggest that humanists should take a more natural-scientific approach to research, beginning with a specific hypothesis and pre-specified data that could either confirm or deny it, we should look to them for inspiration on how to plan research. Thinking about what sort of specific analyses you’d like to perform with the data at the end can reasonably constrain what you try to collect from the beginning. Think about what bits of data are redundant, or would yield diminishing returns on your time and money investment of data collection.

Being Comfortable With Imperfection

In her blog post, Lisa wrote about her experience at MITH. She had a four month fellowship to research 4,500 poems; she could easily have spent the whole time collecting increasingly minute data about each poem. In the end, she settled on only collecting the gender of the poet and whether the poem pertained to a work of art, opting not to include information like when each poem was published, what work of art it referred to, etc. She would then go in later and use other large-scale analytic tools (like text analysis), augmenting those results with the tags she entered about each poem.

A lot of valuable, rich information was lost in this data collection, but the important thing is that Lisa was still able to go in with a specific question, and collect only that which she needed most to explore it. The data may not have been perfect, and they may not have described everything, but they were sufficient and useful.

Her story reminded me a  lot of my undergraduate years. I spent all of them collecting data on early modern letters for my old advisor. Letters, of course, generally have various locations and dates attached to them, and this presented us with no end of problems. Sometimes the places mentioned were cities, or houses, or states; granularities differed. Over the course of two hundred years, cities would change names, move, or wink out of or into existence entirely. Sometimes they would subsumed into new or different empires. Computers, unfortunately, need fairly regularized data to perform comparative analyses, so we had to make a lot of editorial decisions when entering locations that would make answering our questions easier, but would lose some of the nuance otherwise available.

Similarly, my colleague Jeana Jorgensen recently spent several months painstakingly hand-collecting data about the usage of body parts in fairy tales for her dissertation. Of particular interest in her case was the overtly interpretive layer she added to the collection; for example, did a reference somehow embody the “grotesque?” By allowing herself the freedom to use interpretive frameworks, she embraced the subjective nature of data collection, and was able to analyze her data accordingly.

Of course, by allowing this sort of humanistic nuance, the amount of data one could collect for any single sentence is effectively infinite, and so Jeana had to constrain herself to only collecting for that which she could eventually use. It nevertheless took her months of daily collection, but if she tried to make her data perfect or complete, it would have taken her over a lifetime. She still managed to produce really interesting and thoughtful results for her dissertation.

Perfect or complete data is impossible in the humanities. The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.

We can still see the landscape, even though not every piece is in place.

The trick and art is knowing the right halting conditions. How much is too much? What data will actually be useful? These are not easy questions, and their answers differ for every project. The important thing to remember is to just do it. Too many projects get hung up because they just haven’t quite collected enough yet, or if they just spend a few more months cleaning their data will be so much better. There will never be a point when your data are perfect. Do your analysis now, release it, and be comfortable with the fact that you’ve fairly accurately mapped the coastline, even if you haven’t quite worked out the jitters of the tides.