From Trees to Webs: Uprooting Knowledge through Visualization

[update: here are some of the pretty pictures I will be showing off in The Hague]

The blog’s been quiet lately; my attention has been occupied by various journal submissions and a new book in the works, but I figured my readers would be interested in one of those forthcoming publications. This is an article [preprint] I’m presenting at the Universal Decimal Classification Seminar in The Hague this October, on the history of how we’ve illustrated the interconnections of knowledge and scholarly domains. It’s basically two stories: one of how we shifted from understanding the world hierarchically to understanding it as a flat web of interconnected parts, and the other of how the thing itself and knowledge of that thing became separated.

Porphyrian Tree: tree of Aristotle's categories from the 6th century. [via]
Porphyrian Tree: tree of Aristotle’s categories originally dating from the 6th century. [via some random website about trees]
A few caveats worth noting: first, because I didn’t want to deal with the copyright issues, there are no actual illustrations in the paper. For the presentation, I’m going to compile a powerpoint with all the necessary attributions and post it alongside this paper so you can all see the relevant pretty pictures. For your viewing pleasure, though, I’ve included some of the illustrations in this blog post.

An interpretation of the classification of knowledge from Hobbes' Leviathan. [via e-ducation]
An interpretation of the classification of knowledge from Hobbes’ Leviathan. [via e-ducation]
Second, because the this is a presentation directed at information scientists, the paper is organized linearly and with a sense of inevitability; or, as my fellow historians would say, it’s very whiggish. I regret not having the space to explore the nuances of the historical narrative, but it would distract from the point and context of this presentation. I plan on writing a more thorough article to submit to a history journal at a later date, hopefully fitting more squarely in the historiographic rhetorical tradition.

H.G. Wells' idea of how students should be taught. [via H.G. Wells, 1938. World Brain. Doubleday, Doran & Co., Inc]
H.G. Wells’ idea of how students should be taught. [via H.G. Wells, 1938. World Brain. Doubleday, Doran & Co., Inc]
In the meantime, if you’re interested in reading the pre-print draft, here it is! All comments are welcome, as like I said, I’d like to make this into a fuller scholarly article beyond the published conference proceedings. I was excited to put this up now, but I’ll probably have a new version with full citation information within the week, if you’re looking to enter this into Zotero/Mendeley/etc. Also, hey! I think this is the first post on the Irregular that has absolutely nothing to do with data analysis.

Recent map of science by Kevin Boyack, Dick Klavans, W. Bradford Paley, and Katy Börner. [via SEED magazine]
Recent map of science by Kevin Boyack, Dick Klavans, W. Bradford Paley, and Katy Börner. [via SEED magazine]

The Historian’s Macroscope

Whelp, it appears the cat’s out of the bag. Shawn Graham, Ian Milligan, and I have signed our ICP contract and will shortly begin the process of writing The Historian’s Macroscope, a book introducing the process and rationale of digital history to a broad audience. The book will be a further experiment in live-writing: as we have drafts of the text, they will go online immediately for comments and feedback. The publishers have graciously agreed to allow us to keep the live-written portion online after the book goes on sale, and though what remains online will not be the final copy-edited and typeset version, we (both authors and publishers) feel this is a good compromise to prevent the cannibalization of book sales while still keeping much of the content open and available for those who cannot afford the book or are looking for a taste before they purchase it. Thankfully, this plan also fits well with my various pledges to help make a more open scholarly world.

Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
We’re announcing the project several months earlier than we’d initially intended. In light of the American Historical Association’s recent statement endorsing the six year embargo of dissertations on the unsupported claim that it will help career development, we wanted to share our own story to offset the AHA’s narrative. Shawn, Ian, and I have already worked together on a successful open access chapter in The Programming Historian, and have all worked separately releasing public material on our respective blogs. It was largely because of our open material that we were approached to write this book, and indeed much of the material we’ve already posted online will be integrated into the final publication. It would be an understatement to say our publisher’s liaison Alice jumped at this opportunity to experiment with a semi-open publication.

The disadvantage to announcing so early is that we don’t have any content to tease you with. Stay-tuned, though. By September, we hope to have some preliminary content up, and we’d love to read your thoughts and comments; especially from those not already aligned with the DH world.

An experiment in communal editing: Finding the history & philosophy of science.

After my last post about co-citation analysis, the author of one of the papers I was responding to, K. Brad Wray, generously commented and suggested I write up and publish the results and send them off to Erkenntnis, which is the same journal he published his results. That sounded like a great idea, so I am.

Because so many good ideas have come from comments on this blog, I’d like to try opening my first draft to communal commenting. For those who aren’t familiar with google docs (anyone? Bueller?), you can comment by selecting test and either hitting ctrl-alt-m, or going to the insert-> menu and clicking ‘Comment’.

The paper is about the relationship between history of science and philosophy of science, and draws both from the blog post and from this page with additional visualizations. There is also an appendix (pdf, sorry) with details of data collection and some more interesting results for the HPS buffs. If you like history of science, philosophy of science, or citation analysis, I’d love to see your comments! If you have any general comments that don’t refer to a specific part of the text, just post them in the blog comments below.

This is a bit longer form than the usual blog, so who knows if it will inspire much interaction, but it’s worth a shot. Anyone who is signed in so I can see their name will get credit in the acknowledgements.

Finding the History and Philosophy of Science (earlier draft)  ← draft 1, thanks for your comments.

Finding the History and Philosophy of Science (current draft) ← comment here!


Networks Demystified 4: Co-Citation Analysis

This installment of Networks Demystified is the first one that’s actually applied. A few days ago, a discussion arose over twitter involving citation networks, and this post fills the dual purpose of continuing that discussion, and teaching a bit about basic citation analysis. If you’re looking for the very basics of networks, see part 1 and part 2. Part 3 is a warning for anyone who feels the urge to say “power law.” To recap: nodes are the dots/points in the network, edges are the lines/arrows/connections.

Understanding Sociology, Philosophy, and Literary Theory using One Easy Method™!

The growing availability of humanities and social science (HSS) citation data in databases like ISI’s Web of Science (warning: GIANT paywall. Good luck getting access if your university doesn’t subscribe.) has led to a groundswell of recent blog activity in the area, mostly by the humanists and social scientists themselves. Which is a good thing, because citation analyses of HSS will happen whether we’re involving in doing them or not, so if humanists start becoming familiar with the methods, at least we can begin getting humanistically informed citation analyses of our own data.

ISI Web of Science paywall
The size of ISI’s Web of Science paywall. You shall not pass. [via]
This is a sort of weird post. It’s about history and philosophy of science, by way of social history, by way of literary theory, by way of philosophy, by way of sociology. About this time last year, Dan Wang asked the question Is There a Canon in Economic Sociology (pdf)? Wang was searching for a set of core texts for economic sociology, using a set of 52 syllabi regarding the subject. It’s a reasonable first pass at the question, counting how often each article appears in the syllabi (plus some more complex measurements) as well as how often individual authors appear. Those numbers are used to support the hypothesis that there is a strongly present canon, both of authors and individual articles, in economic sociology. This is an example of an extremely simple bimodal network analysis where there are two varieties of node: syllabi or articles. Each syllabi cites multiple articles, and several of those articles are cited by multiple syllabi. The top part of Figure 1 is what this would look like in a basic network representation.

Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [Via Mark Newman, PNAS]
Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [via Mark Newman]
Wang was also curious how instructors felt these articles fit together, so he used a common method called co-citation analysis to answer the question. The idea is that if two articles are cited in the same syllabus, they are probably related, so they get an edge drawn between them. He further restricted his analysis so that articles had to appear together in the same class session, rather than the the same syllabus, to be considered related to each other. What results is a new network (Figure 1, below) of article similarity based on how frequently they appear together (how frequently they are cited by the same source). In Figure 1, you can see that because article H and article F are both cited in syllabus class session 3, they get an edge drawn between them.

A further restriction was then placed on the network, what’s called a threshold. Two articles would only get an edge drawn between them if they were cited by at least 2 different class sessions (threshold = 2). The resulting economic sociology syllabus co-citation network looked like Figure 2, pulled from the original article. From this picture, one can begin to develop a clear sense of the demarcations of subjects and areas within economic sociology, thus splitting the canon into its constituent parts.

Figure 2: Co-citation network in economic sociology. [via]
Figure 2: Co-citation network in economic sociology. Edge thickness represents how often articles appear together in syllabi, and node size is based on a measure of centrality. [via]
In short order, Kieran Healy blogged a reply to this study, providing his own interpretations of the graph and what the various clusters represented. Remember Healy’s name, as it’s important later in the story. Two days after Healy’s blog post, Neal Caren took inspiration and created a co-citation analysis of sociology more broadly–not just economic sociology–using data he downloaded from ISI’s Web of Science (remember the giant paywall from before?). Instead of using syllabi, Caren looked at articles found in American Journal of Sociology, American Sociological Review, Social Forces and Social Problems since 2008. Web of Science gave him a list of every citation from every article in those journals, and he performed the same sort of co-citation analysis as Dan Wang did with syllabi, but at a much larger scale.

Because the dataset Caren used was so much larger, he had to enforce much stricter thresholds to keep the visualization manageable. Whereas Wang’s graph showed all articles, and connected them if they appeared together in more than 2 class sessions, Caren’s graph only connected articles which were cited together more than 4 times (threshold = 4). Further, a cited article wouldn’t even appear on the network visualization unless the article itself had been cited 8 or more times, thus reducing the amount of articles appearing on the visualization overall. The final network had 397 nodes (articles) and 1,597 edges (connections between articles). He also used a popular community detection algorithm to color the different article nodes based on which other articles they were most related to. Figure 3 shows the resulting network, and clicking on it will lead to an interactive version.

Figure 3: Neal Caren's sociology co-citation analysis. Click the picture to see the interactive version. [via]
Figure 3: Neal Caren’s sociology co-citation analysis. Click the picture to see the interactive version. [via]
Caren adds a bit of contextual description in his blog post, explaining what the various clusters represent and why this visualization is a valid and useful one for the field of sociology. Notably, at the end of the post, he shares his raw data, a python script for analyzing it, and all the code for visualizing the network and making it interactive and pretty.

Jump forward a year. Kieran Healy, the one who wrote the original post inspiring Neal Caren’s, decides to try his own hand at a citation analysis using some of the code and methods that Neal Caren had posted about. Healy’s blog post, created just a few days ago, looks at the field of philosophy through the now familiar co-citation analysis. Healy’s analysis covers 20 years of four major philosophy journals, consisting of 2,200 articles. These articles together make over 34,000 citations, although many of the cited articles are duplicates of articles that had already been cited. Healy writes:

The more often any single paper is cited, the more important it’s likely to be. But the more often any two papers are cited together, the more likely they are to be part of some research question or ongoing problem or conversation topic within the discipline.

With a dataset this large, the resulting co-citation network wound up having over a million edges, or connections between co-cited articles. Healy decides to only focus on the 500 most highly-cited items in the journals (not the best practice for a co-citation analysis, but I’ll address that in a later post), resulting in only articles that had been cited more than 10 times within the four journal dataset to be present in the network. Figure 4 shows the resulting network, which like Figure 3, can be clicked on to reach the interactive version.

Figure 4: Kieran Healy's co-citation analysis of four philosophy journals. Click for interactivity. [via]
Figure 4: Kieran Healy’s co-citation analysis of four philosophy journals. Click for interactivity. [via]
The post goes on to provide a fairly thorough and interesting analysis of the various communities formed by article clusters, thus giving a description of the general philosophy landscape as it currently stands. The next day, Healy posted a follow-up delving further into citations of philosopher David Lewis, and citation frequencies by gender. Going through the most highly cited 500 or so philosophy articles by hand, Healy finds that 3.6% of the articles are written by women; 6.3% are written by David Lewis; the overwhelming majority are written by white men. It’s not lost on me that the overwhelming majority of people doing these citation analyses are also white men – someone please help change that? Healy posted a second follow-up a few days later, worth reading, on his reasoning behind which journals he used and why he looked at citations in general. He concludes “The 1990s were not the 1950s. And yet essentially none of the women from this cohort are cited in the conversation with anything close to the same frequency, despite working in comparable areas, publishing in comparable venues, and even in many cases having jobs at comparable departments.”

Merely short days after Healy’s articles, Jonathan Goodwin became inspired, using the same code Healy and Caren used to perform a co-citation analysis of Literary Theory Journals. He began by concluding that these co-citation analysis were much more useful (better) than his previous attempts at direct citation analysis. About four decades of bibliometric research backs up Goodwin’s claim. Figure 5 shows Goodwin’s Literary Theory co-citation network, drawn from five journals and clickable for the interactive version, where he adds a bit of code so that the user can determine herself what threshold she wants to cut off co-citation weights. Goodwin describes the code to create the effect on his github account. In a follow-up post, directly inspired by Healy’s, Goodwin looks at citations to women in literary theory. His results? When a feminist theory journal is included, 8 of the top 30 authors are women (27%); when that journal is not included, only 2 of the top 30 authors are women (7%).

Figure 5: Goodwin's literary theory co-citation network. [via]
Figure 5: Goodwin’s literary theory co-citation network. [via]

At the Speed of Blog

Just after these blog posts were published, a quick twitter exchange between Jonathan Goodwin, John Theibault, and myself (part of it readable here) spurred Goodwin, in the space of 20 minutes, to download, prepare, and visualize the co-citation data of four social history journals over 40 years. He used ISI Web of Science data, Neal Caren’s code, a bit of his own, and a few other bits of open script which he generously cites and links to. All of this is to highlight not only the phenomenal speed of research when unencumbered by the traditional research process, but also the ease with which these sorts of analysis can be accomplished. Most of this is done using some (fairly simple) programming, but there are just as easy solutions if you don’t know how to or don’t care to code–one specifically which I’ll mention later, the Sci2 Tool. From data to visualization can take a matter of minutes; a first pass at interpretation won’t take much longer. These are fast analyses, pretty useful for getting a general overview of some discipline, and can provide quite a bit of material for deeper analysis.

The social history dataset is now sitting on Goodwin’s blog just waiting to be interpreted by the right expert. If you or anyone you know is familiar with social history, take a stab at figuring out what the analysis reveals, and then let us all know in a blog post of your own. I’ll be posting a little more about it as well soon, though I’m no expert of the discipline. Also, if you’re interested in citation analysis in the humanities, and you’ll be at DH2013 in Nebraska, I’ll be chairing a session all about citations in the humanities featuring an impressive lineup of scholars. Come join us and bring questions, July 17th at 10:30am.

Discovering History and Philosophy of Science

Before I wrap up, it’s worth mentioning that in one of Kieran Healy’s blog posts, he thanks Brad Wray for pointing out some corrections in the dataset. Brad Wray is one of the few people to have published a recent philosophy citation analysis in a philosophy journal. Wray is a top-notch philosopher, but his citation analysis (Philosophy of Science: What are the Key Journals in the Field?, Erkenntnis, May 2010 72:3, paywalled) falls a bit short of the mark, and as this is an instructional piece on co-citation analysis, it’s worth taking some time here to explore why.

Wray’s article’s thesis is that “there is little evidence that there is such a field as the history and philosophy of science (HPS). Rather, philosophy of science is most properly conceived of as a sub-field of philosophy.” He arrives at this conclusion via a citation analysis of three well-respected monographs, A Companion to the Philosophy of ScienceThe Routledge Companion to Philosophy of Science, and The Philosophy of Science edited by David Papineau, in total comprising 149 articles. Wray then counts how many times major journals are cited within each article, and shows that in most cases, the most frequently cited journals across the board are strict philosophy of science journals.

The data used to support Wray’s thesis–that there is no such field as history & philosophy of science (HPS)–is this coarse-level journal citation data. No history of science journal is listed in the top 10-15 journals cited by the three monographs, and HPS journals appear, but very infrequently. Of the evidence, Wray writes “if there were such a field as history and philosophy of science, one would expect scholars in that field to be citing publications in the leading history of science journal. But, it appears that philosophy of science is largely independent of the history of science.”

It is curious that Wray would suggest that total citations from strict philosophy of science companions can be used as evidence of whether a related but distinct field, HPS, actually exists. Low citations from philosophy of science to history of science is that evidence. Instead, a more nuanced approach to this problem would be similar to the approach above: co-citation analysis. Perhaps HPS can be found by analyzing citations from journals which are ostensibly HPS, rather than analyzing three focused philosophy of science monographs. If a cluster of articles should appear in a co-citation analysis, this would be strong evidence that such a discipline currently exists among citing articles. If such a cluster does not appear, this would not be evidence of the non-existence of HPS (absence of evidence ≠ evidence of absence), but that the dataset or the analysis type is not suited to finding whatever HPS might be. A more thorough analysis would be required to actually disprove the existence of HPS, although one imagines it would be difficult explaining that disproof to the people who think that’s what they are.

With this in mind, I decided to perform the same sort of co-citation analysis as Dan Wang, Kieran Healy, Neal Caren, and Jonathan Goodwin, and see what could be found. I drew from 15 journals classified in ISI’s Web of Science as “History & Philosophy of Science” (British Journal for the Philosophy of Science, Journal of Philosophy, Synthese, Philosophy of Science, Studies in History and Philosophy of Science, Annals of Science, Archive for History of Exact Sciences, British Journal for the History of Science, Historical Studies in the Natural Sciences, History and Philosophy of the Life Sciences, History of Science, Isis, Journal for the History of Astronomoy, Osiris, Social Studies of Science, Studies in History and Philosophy of Modern Physics, and Technology and Culture). In all I collected 12,510 articles dating from 1956, with over 300,000 citations between them. For the purpose of not wanting to overheat my laptop, I decided to restrict my analysis to looking only at those articles within the dataset; that is, if any article from any of the 15 journals cited any other article from one of the 15 journals, it was included in the analysis.

I also changed my unit of analysis from the article to the author. I didn’t want to see how often two articles were cited by some third article–I wanted to see how often two authors were cited together within some article. The resulting co-citation analysis gives author-author pairs rather than article-article pairs, like the examples above. In all, there were 7,449 authors in the dataset, and 10,775 connections between author pairs; I did not threshold edges, so the some authors in the network were cited together only once, and some as many as 60 times. To perform the analysis I used the Science of Science (Sci2) Tool, no programming required, (full advertisement disclosure: I’m on the development team), and some co-authors and I have written up how to do a similar analysis in the documentation tutorials.

The resulting author co-citation network, in Figure 6, reveals two fairly distinct clusters of authors. You can click the image to enlarge, but I’ve zoomed in on the two communities, one primarily history of science, the other primarily philosophy of science. At first glance, Wray’s hypothesis appears to be corroborated by the visualization; there’s not much in the way of a central cluster between the two. That said, a closer look at the middle, Figure 7, highlights a group of people whom either have considered themselves within HPS, or others have considered HPS.

Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!]
Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!] 
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.

Figures 6 & 7 don’t prove anything, but they do suggest that within citation patterns, history of science and philosophy of science are clearly more cohesive than some combined HPS might be. Figure 7 suggests there might be more to the story, and what is needed in the next step to try to pin down HPS–if indeed it exists as some sort of cohesive unit–is to find articles that specifically self-identify as HPS, and through their citation and language patterns, try to see what they have in common with and what separates them from the larger community. A more thorough set of analytics, visualizations, and tables, which I’ll explain further at some point, can be found here (apologies for the pdf, this was originally made in preparation for another project).

The reason I bring up this example is not to disparage Wray, whose work did a good job of finding the key journals in philosophy of science, but to argue that we as humanists need to make sure the methods we borrow match the questions we ask. Co-citation analysis happens to be a pretty good method for exploring the question Wray asked in his thesis, but there are many more situations where it wouldn’t be particularly useful. The recent influx of blog posts on the subject, and the upcoming DH2013 session, is exciting, because it means humanists are beginning to take citation analysis seriously and are exploring the various situations in which its methods are appropriate. I look forward to seeing what comes out of the Social History data analysis, as well as future directions this research will take.

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.


Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!


  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 1

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s my review of part I (Foundation), all chapters. Read Part 2, Part 3, …

Macroanalysis: Digital Methods & Literary History is a book whose time has come. “Individual creativity,” Matthew L. Jockers writes, “is highly constrained, even determined, by factors outside of what we consider to be a writer’s conscious control.” Although Jockers’ book is a work of impressive creativity, it also fits squarely within a larger set of trends. The scents of ‘Digital Humanities’ (DH) and ‘Big Data’ are in the air, the funding-rich smells attracting predators from all corners, and Jockers’ book floats somewhere in the center of it all. As with many DH projects, Macroanalysis attempts the double goal of explaining a new method and exemplifying the type of insights that can be achieved via this method. Unlike many projects, Jockers succeeds masterfully at both. Macroanalysis introduces its readers to large scale quantitative methods for studying literary history, and through those methods explores the nature of creativity and influence in general and the place of Irish literature within its larger context in particular.

I’ve apparently gained a bit of a reputation for being overly critical, and it’s worth pointing out at the beginning of this review that this trend will continue for Macroanalysis. That said, I am most critical of the things I love the most, and readers who focus on any nits I might pick without reading the book themselves should keep in mind that the overall work is staggering in its quality, and if it does fall short in some small areas, it is offset by the many areas it pushes impressively forward.

Macroanalysis arrives on bookshelves eight years after Franco Moretti’s Graphs, Maps, and Trees (2005), and thirteen years after Moretti’s “Conjectures on World Literature” went to press in early 2000, where he coined the phrase “distant reading.” Moretti’s distant reading is a way of seeing literature en masse, of looking at text at the widest angle and reporting what structures and forms only become visible at this scale. Moretti’s early work paved the way, but as might be expected with monograph published the same year as the initial release of Google Books, lack of available data made it stronger in theory than in computational power.

From Moretti's Graphs, Maps, and Trees
From Moretti’s Graphs, Maps, and Trees

In 2010, Moretti and Jockers, the author of Macroanalysis, co-founded the Stanford Lit Lab for the quantitative and digital research of literature. The two have collaborated extensively,  and Jockers acknowledge’s Moretti’s influence on his monograph. That said, in his book, Jockers distances himself slightly from Moretti’s notion of distant reading, and it is not the first time he has done so. His choice of “analysis” over “reading” is an attempt to show that what his algorithms are doing at this large scale is very different from our normal interpretive process of reading; it is simply gathering and aggregating data, the output of which can eventually be read and interpreted instead of or in addition to the texts themselves. The term macroanalysis was inspired by the difference between macro- and microeconomics, and Jockers does a good job justifying the comparison. Given that Jockers came up with the comparison in 2005, one does wonder if he would have decided on different terminology after our recent financial meltdown and the ensuing large-scale distrust of macroeconomic methods. The quantitative study of history, cliometrics, also had its origins in economics and suffered its own fall from grace decades ago; quantitative history still hasn’t recovered.

Part I: Foundation

I don’t know whether the allusion was intended, but lovers of science fiction and quantitative cultural studies will enjoy the title of Part I: “Foundation.” It shares a name with a series of books by Isaac Asimov, centering around the ability to combine statistics and human-centric research to understand and predict people’s behaviors. Punny titles aside, the section provides the structural base of the monograph.

The story of Foundation in a nutshell. Via c0ders.
The story of Foundation in a nutshell. Via c0ders.

Much of the introductory chapters are provocative statements about the newness of the study at hand, and they are not unwarranted. Still, I can imagine that the regular detractors of technological optimism might argue their usual arguments in response to Jockers’ pronouncements of a ‘revolution.’ The second chapter, on Evidence, raises some particularly important (and timely) points that are sure to raise some hackles. “Close reading is not only impractical as a means of evidence gathering in the digital library, but big data render it totally inappropriate as a method of studying literary history.” Jockers hammers home this point again and again, that now that anecdotal evidence based on ‘representative’ texts is no longer the best means of understanding literature, there’s no reason it should still be considered the gold standard of evidentiary support.

Not coming from a background of literary history or criticism, I do wonder a bit about these notions of representativeness (a point also often brought up by Ted Underwood, Ben Schmidt, and Jockers himself). This is probably something lit-researchers worked out in the 70s, but it strikes me that the questions being asked of a few ‘exemplary, representative texts’ are very different than the ones that ought to be asked of whole corpora of texts. Further, ‘representative’ of what? As this book appears to be aimed not only at traditional literary scholars, it would have been beneficial for Jockers to untangle these myriad difficulties.

One point worth noting is that, although Jockers calls his book Macroanalysis, his approach calls for a mixed method, the combination of the macro/micro, distant/close. The book is very careful and precise in its claims that macroanalysis augments and opens new questions, rather than replaces. It is a combination of both approaches, one informing the other, that leads to new insights. “Today’s student of literature must be adept at reading and gathering evidence from individual texts and equally adept at accessing and mining digital-text repositories.” The balance struck here is impressive: to ignore macroanalysis as a superior source of evidence for many types of large questions would be criminal, but its adoption alone does not make for good research (further, either without the other would be poorly done). For example, macroanalysis can augment close reading approaches by contextualizing a text within its broad historical and cultural moment, showing a researcher precisely where their object of research fits in the larger picture.

Historians would do well to heed this advice, though they are not the target audience. Indeed, historians play a perplexing role in Jockers’ narrative; not because his description is untrue, but because it ought not be true. In describing the digital humanities, Jockers calls it an “ambiguous and amorphous amalgamation of literary formalists, new media theorists, tool builders, coders, and linguists.” What place historians? Jockers places their role earlier, tracing the wide-angle view to the Annales historians and their focus on longue durée history. If historian’s influence ends there, we are surely in a sad state; that light, along with those of cliometrics and quantitative history, shone brightest in the 1970s before a rapid decline. Unsworth recently attributed the decline to the fallout following Time on the cross (Fogel & Engerman, 1974), putting quantitative methods in history “out of business for decades.” The ghost of cliometrics still haunts historians to such an extent that the best research in that area, to this day, comes more from information scientists and applied mathematicians than from historians. Digital humanities may yet exorcise that ghost, but it has not happened yet, as evidenced in part by the glaring void in Jockers’ introductory remarks.

It is with this framing in mind that Jockers embarks on his largely computational and empirical study of influence and landscape in British and American literature.

CfP: “Historical Network Research” at Sunbelt, May 21-26, Germany

Marten Düring, an altogether wonderful researcher who is responsible for this brilliant bibliography of networks in history, has issues a call for papers to participate in this year’s Sunbelt Conference, which is one of the premier social network analysis conferences in the world.

Historical network. via.
Historical network. via Marten Düring.


Call for papers “Historical Network Research” at the XXXIII. Sunbelt Conference, May 21-26 – University of Hamburg, Germany


The concepts and methods of social network analysis in historical research are recently being used not only as a mere metaphor but are increasingly applied in practice. In the last decades several studies in the social sciences proved that formal methods derived from social network analysis can be fruitfully applied to selected bodies of historical data as well. These studies however tend to be strongly influenced by concerns, standards of data processing, and, above all, epistemological paradigms that have their roots in the social sciences. Among historians, the term network has been used in a metaphorical sense alone for a long time. It was only recently that this has changed.
We invite papers which successfully integrate social network analysis methods and historical research methods and reflect on the added value of their methodologies. Topics could cover (but are not limited to) network analyses of correspondences, social movements, kinship or economic systems in any historical period.
Submission will be closing on December 31 at 11:59:59 EST. Please limit your abstract to 250 words. Please submit your abstract here:
and select “Historical Network Research” as session title in the drop down box on the submission site. Please put a note in the “additional notes” box on the abstract submission form that states Marten During and Martin Stark as the session organizers.
For further information on the venue and conference registration see:, for any questions regarding the panel, please get in touch with the session organizers.

Session organizers:
Marten During, Radboud University Nijmegen,
Martin Stark, University of Hamburg,

Check for a detailed bibliography, conferences, screencasts and other resources.

On the importance of a single historical author

I have a dirty admission to make: I think yesterday happened. Actually. Objectively. Stuff happened. I hear that’s still a controversial statement in some corners of the humanities, but I can’t say for sure; I generally avoid those corners. And I think descriptions of the historical evidence can vary in degrees of accuracy, separating logically coherent but historically implausible conspiracy theories from more likely narratives.

At the same time, what we all think of as the past is a construct. A bunch of people – historians, cosmologists, evolutionary biologists, your grandmother who loves to tell stories, you – have all worked together to construct and reconstruct the past. Lots of pasts, actually, because no two people can ever wholly agree; everybody sees the evidence through the lens of their own historical baggage.

I’d like to preface this post with the cautious claim that I am an outsider explaining something I know less about than I should. The hats I wear are information/data scientist and historian of science, and through some accident of the past, historians and historians of science have followed largely separate cultural paths. Which is to say, neither the historian of science in me nor the information scientist has a legitimate claim to the narrative of general history and the general historical process, but I’m attempting to write one anyway. I welcome any corrections or admonishments.

The Narrativist Individual

I use in this post (and in life in general) the vocabulary definitions of Aviezer Tucker, who is doing groundbreaking work on simple stuff like defining “history” and asking questions about what we can know about the past. 1 “History,” Tucker defines, is simply stuff that happened: the past itself. “Historians” are anybody who inquires about the past, from cosmologists to historical linguists. A “historiography” is a knowledge of the past, or more concretely, something a historian has written about the past. “Historiographic research” is what we historians do when we try to find out about the past, and a “historiographic narrative” is generally the result of a lot of that research strung together. 2

Narratives are important. In the 1970s, a bunch of historians began realizing that historians create narratives when they collect their historiographic research 3; that is, people tell stories about the past, using the same sorts of literary and rhetorical devices used in many other places. History itself is a giant jumble of events and causal connections, and representing it as it actually happened would be completely unintelligible and philosophically impossible, without recreating the universe from scratch. Historians look at evidence of the past and then impose an order, a pattern, in reconstructing the events to create their own unique historiographic narratives.

The narratives historians write are inescapably linked to their own historical baggage. Historians are biased and imperfect, and they all read history through the filter of themselves. Historiographic reconstructions, then, are as much windows into the historians themselves as they are windows into the past. The narrativist turn in historiography did a lot to situate the historian herself as a primary figure in her narrative, and it became widely accepted that instead of getting closer to some ground truth of history, historians were in the business of building consistent and legible narratives, their own readings of the past, so long as they were also consistent with the evidence. Those narratives became king, both epistemologically and in practice; historical knowledge is narrative knowledge.

Because narrative knowledge is a knowledge derived from lived experience – the historian sees the past in his own unique light – this emphasized the importance of the individual in historiographic research. Because historians neither could (nor by and large were) attempting to reach an objective ground-truth about the past, any claim to knowledge rested in the lone historian and how he read the past and how he presented his narrative. What resulted was a (fairly justified, given their conceptualization of historiographic knowledge) fetishization of the individual, the autonomous historian.

When multiple authors write a historiographic narrative, something almost ineffable is lost: the individual perspective which drives the narrative argument, part of the essential claim-to-knowledge. In a recent discussion with Ben Schmidt about autonomous humanities work vs. collaboration (the original post; my post; Ben’s reply), Ben pointed out “all the Stanley Fishes out there have reason to be discomfited that DHers revel so much in doing away with not only the printed monograph, traditional peer review, and close reading, but also the very institution of autonomous, individual scholarship. Erasmus could have read the biblical translations out there or hired a translator, but he went out and learned Greek [emphasis added].” I think a large part of that drive for autonomy (beyond the same institutional that’s-how-we’ve-always-done-it inertia that lone natural scientists felt at the turn of the last century) is the situatedness-as-a-way-of-knowing that imbues historiographic research, and humanistic research in general.

I’m inclined to believe that historians need to move away from an almost purely narrative epistemology; keeping in sight that all historiographic knowledge is individually constructed, remaining aware that our overarching cultural knowledge of the past is socio-technically constructed, but not letting that cripple our efforts at coordinating research, at reaching for some larger consilience with the other historical research programs out there, like paleontology and cosmology and geology. Computational methodologies will pave the way for collaborative research both because they allow it, and because they require it.

Collaboratively Constructing Paris

This is a map of Paris.

Map of Paris with dots representing photos taken and posted on Flickr. Red dots are pictures taken by tourists, blue are by locals, and yellow are unknown. via Eric Fischer.

On top of this map of Paris are red, blue, and yellow dots. The red dots are the locations of pictures taken and posted to Flickr by tourists to Paris; blue dots are where locals took pictures; yellow dots are unknown. The resulting image maps and differentiates touristic and local space by popularity, at least among Flickr users. It is a representation that would have been staggeringly difficult for an outsider to create without this sort of data-analytic approach, and yet someone intimately familiar with the city could look at this map and not be surprised. Perhaps they could even recreate it themselves.

What is this knowledge of Paris? It’s surely not a subjective representation of the city, not unless we stretch the limits of the word beyond the internally experienced and toward the collective. Neither is it an objective map 4 of the city, external to the whims of the people milling about within. The map represents an aggregate of individual experiences, a kind of hazy middle ground within the usual boundaries we draw between subjective and objective truth. This is an epistemological and ontological problem I’ve been wondering about for some time, without being able to come up with a good word for it until a conversation with a colleague last year.

“This is my problem,” I told Charles van den Heuvel, explaining my difficulties in placing these maps and other similar projects on the -jectivity scale. “They’re not quite intersubjective, not in the way the word is usually used,” I said, and Charles just looked at me like I was missing something excruciatingly obvious. “What is it when a group of people believe or do or think common things in aggregate?”—Charles asked—”isn’t that just called culture?” I initially disagreed, but mostly because it was so obvious that I couldn’t believe I’d just passed it over entirely.

In 1976, the infamous Stanley Milgram and co-author Denise Jodelet 5 responded to Durkheim’s emphasis on “the objectivity of social facts” by suggesting “that we understand things from the actor’s point of view.” To drive this point home, Milgram decides to map Paris. People “have a map of the city [they live in] in their minds,” Milgram suggests, and their individual memories and attitudes flavor those internal representations.

This is a good example to use, for Milgram, because cities themselves are socially constructed entities; what is a city without its people who live in and build it? Milgram goes on to suggest that people’s internal representations of cities are similarly socially constructed, that “such representations are themselves the products of social interaction with the physical environment.” In the ensuing study, Milgram asks 218 subjects to draw a non-tourist map of Paris as it seems to them, including whatever features they feel relevant. “Through selection, emphasis and distortion, the maps became projections of life styles.”

Milgram then compares all the maps together, seeking what unifies them: first and foremost, the city limits and the Seine. The river is distorted in a very particular way in nearly all maps, bypassing two districts entirely and suggesting they are of little importance to those who drew the maps. The center of the city, Notre Dame and the Île de la Cité, also remains constant. Milgram opposes this to a city like New York, the subject of a later similar study, whose center shifts slowly northward as the years roll by. Many who drew maps of either New York or Paris included elements they were not intimately familiar with, but they knew were socially popular, or were frequent spots of those in their social circles. Milgram concludes “the social representations of the city are more than disembodied maps; they are mechanisms whereby the bricks, streets, and physical geography of a place are endowed with social meaning.”

It’s worth posting a large chunk of Milgram’s earlier article on the matter:

A city is a social fact. We would all agree to that. But we need to add an important corollary: the perception of a city is also a social fact, and as such needs to be studied in its collective as well as its individual aspect. It is not only what exists but what is highlighted by the community that acquires salience in the mind of the person. A city is as much a collective representation as it is an assemblage of streets, squares, and buildings. We discern the major ingredients of that representation by studying not only the mental map in a specific individual, but by seeing what is shared among individuals.

Collaboratively Constructing History

Which brings us back to the past. 6 Can collaborating historians create legitimate narratives if they are not well-founded in personal experience? What sort of historical knowledge is collective historical knowledge? To this question, I turn to blogger Alice Bell, who wrote a delightfully short post discussing the social construction of science.  She writes about scientific knowledge, simply, “Saying science is a social construction does not amount to saying science is make believe.” Alice compares knowledge not to a city, like Paris, but to a building like St. Paul’s Cathedral or a scientific compound like CERN; socially constructed, but physically there. Real. Scientific ideas are part of a similar complex.

The social construction of historiographic narratives is painfully clear even without co-authorships, in our endless circles of acknowledgements and references. Still, there seems to be a good deal of push-back against explicit collaboration, where the entire academic edifice no longer lies solely in the mind of one historian (if indeed it ever had). In some cases, this push-back is against the epistemological infrastructure that requires the person in personal narrative. In others, it is because without full knowledge of each of the moving parts in a work of scholarship, that work is more prone to failure due to theories or methodologies not adequately aligning.

Building historiography together. via the Smithsonian.

I fear this is a dangerous viewpoint, one that will likely harm both our historiographic research and our cultural relevancy, as other areas of academia become more comfortable with large-scale collaboration. Single authorship for its own sake is as dangerous as collaboration for its own sake, but it has the advantage of being a tradition. We must become comfortable with the hazy middle ground between an unattainable absolute objectivity and an unscalable personal subjectivity, willing to learn how to construct our knowledge as Parisians construct their city. The individual experiences of Parisians are without a doubt interesting and poignant, but it is the combined experiences of the locals and the tourists that makes the city what it is. Moving beyond the small and individual isn’t just getting past the rut of microhistories that historiography is still trying to escape—it is also getting past the rut of individually driven narratives and toward unified collective historiographies. We have to work together.




  1. Tucker, Aviezer. 2004. Our Knowledge of the Past: A Philosophy of Historiography. Cambridge University Press.
  2. Tucker, Aviezer, ed. 2009. A Companion to the Philosophy of History and Historiography.
  3. Kuukkanen, Jouni-Matti. 2012. “The Missing Narrativist Turn in the Histiography of Science.” History and Theory 51 (3): 340–363. doi:10.1111/j.1468-2303.2012.00632.x.
  4. of anything besides the geolocations of Flickr pictures, in and of itself not particularly interesting
  5. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, ed. Proshansky, Ittelson, and Rivlin, 104–124. New York.
    Milgram, Stanley. 1982. “Cities as Social Representations.” In Social Representations, ed. R. Farr and S. Moscovici, 289–309.
  6. As opposed to bringing us Back to the Future, which would probably be more fun.

Predicting victors in an attention and feedback economy

This post is about computer models and how they relate to historical research, even though it might not seem like it at first. Or at second. Or third. But I encourage anyone who likes history and models to stick with it, because it gets to a distinction of model use that isn’t made frequently enough.

Music in a vacuum

Imagine yourself uninfluenced by the tastes of others: your friends, their friends, and everyone else. It’s an effort in absurdity, but try it, if only to pin down how their interests affect yours. Start with something simple, like music. If you want to find music you liked, you might devise a program that downloads random songs from the internet and plays them back without revealing their genre or other relevant metadata, so you can select from that group to get an unbiased sample of songs you like. It’s a good first step, given that you generally find music by word-of-mouth, seeing your friends’ playlists, listening to what your local radio host thinks is good, and so forth. The music that hits your radar is determined by your social and technological environment, so the best way to break free from this stifling musical determinism is complete randomization.

So you listen to the songs for a while and rank them as best you can by quality, the best songs (Stairway to Heaven, Shine On You Crazy Diamond, I Need A Dollar) at the very top and the worst (Ice Ice Baby, Can’t Touch This, that Korean song that’s been all over the internet recently) down at the bottom of the list. You realize that your list may not be a necessarily objective measurement of quality, but it definitely represents a hierarchy of quality to you, which is real enough, and you’re sure if your best friends from primary school tried the same exercise they’d come up with a fairly comparable order.

Friends don’t let friends share music. via.

Of course, the fact that your best friends would come up with a similar list (but school buddies today or a hundred years ago wouldn’t) reveals another social aspect of musical tastes; there is no ground truth of objectively good or bad music. Musical tastes are (largely) socially constructed 1, which isn’t to say that there isn’t any real difference between good and bad music, it’s just that the evaluative criteria (what aspects of the music are important and definitions of ‘good’ and ‘bad’) are continuously being defined and redefined by your social environment. Alice Bell wrote the best short explanation I’ve read in a while on how something can be both real and socially constructed.

There you have it: other people influence what songs we listen to out of the set of good music that’s been recorded, and other people influence our criteria for defining good and bad music to begin with. This little thought experiment goes a surprisingly long way in explaining why computational models are pretty bad at predicting Nobel laureates, best-selling authors, box office winners, pop stars, and so forth. Each category is ostensibly a mark of quality, but is really more like a game of musical chairs masquerading as a meritocracy. 2

Sure, you (usually) need to pass a certain threshold of quality to enter the game, but once you’re there, whether or not you win is anybody’s guess. Winning is a game of chance with your generally equally-qualified peers competing for the same limited resource: membership in the elite. Merton (1968) compared this phenomenon to the French Academy’s “Forty-First Chair,” because while the Academy was limited to only forty members (‘chairs’), there were many more who were also worthy of a seat but didn’t get one when the music stopped: Descartes, Diderot, Pascal, Proust, and others. It was almost literally a game of musical chairs between great thinkers, much in the same way it is today in so many other elite groups.

Musical Chair. via.

Merton’s same 1968 paper described the mechanism that tends to pick the winners and losers, which he called the ‘Matthew Effect,’ but is also known as ‘Preferential Attachment,’ ‘Rich-Get-Richer,’ and all sorts of other names besides. The idea is that you need money to make money, and the more you’ve got the more you’ll get. In the music world, this manifests when a garage band gets a lucky break on some local radio station, which leads to their being heard by a big record label company who releases the band nationally, where they’re heard by even more people who tell their friends, who in turn tell their friends, and so on and so on until the record company gets rich, the band hits the top 40 charts, and the musicians find themselves desperate for a fix and asking for only blue skittles in their show riders. Okay, maybe they don’t all turn out that way, but if it sounds like a slippery slope it’s because it is one. In complex systems science, this is an example of a positive feedback loop, where what happens in the future is reliant upon and tends to compound what happens just before it. If you get a little fame, you’re more likely to get more, and with that you’re more likely to get even more, and so on until Lady Gaga and Mick Jagger.

Rishidev Chaudhuri does a great job explaining this with bunnies, showing that if 10% of rabbits reproduce a year, starting with a hundred, in a year there’d be 110, in two there’d be 121, in twenty-five there’d be a thousand, and in a hundred years there’d be over a million rabbits. Feedback systems (so-named because the past results feed back on themselves to the future) multiply rather than add, with effects increasing exponentially quickly. When books or articles are read, each new citation increases its chances of being read and cited again, until a few scholarly publications end up with thousands or hundreds of thousands of citations when most have only a handful.

This effect holds true in Nobel prize-winning science, box office hits, music stars, and many other areas where it is hard to discern between popularity and quality, and the former tends to compound while exponentially increasing the perception of the latter. It’s why a group of musicians who are every bit as skilled as Pink Floyd wind up never selling outside their own city if they don’t get a lucky break, and why two equally impressive books might have such disproportionate citations. Add to that the limited quantity of ‘elite seats’ (Merton’s 40 chairs) and you get a situation where only a fraction of the deserving get the rewards, and sometimes the most deserving go unnoticed entirely.

Different musical worlds

But I promised to talk  about computational models, contingency, and sensitivity to initial conditions, and I’ve covered none of that so far. And before I get to it, I’d like to talk about music a bit more, this time somewhat more empirically. Salganik, Dodds, and Watts (2006; 10.1126/science.1121066) recently performed a study on about 15,000 individuals that mapped pretty closely to the social aspects of musical taste I described above. They bring up some literature suggesting popularity doesn’t directly and deterministically map on to musical proficiency; instead, while quality does play a role, much of the deciding force behind who gets fame is a stochastic (random) process driven by social interactivity. Unfortunately, because history only happened once, there’s no reliable way to replay time to see if the same musicians would reach fame the second time around.

Remember Napster? via.

Luckily Salganik, Dodds, and Watts are pretty clever, so they figured out how to make history happen a few times. They designed a music streaming site for teens which, unbeknownst to the teens but knownst to us, was not actually the same website for everyone who visited. The site asked users to listen to previously unknown songs and rate them, and then gave them an option to download the music.  Some users who went to the site were only given these options, and the music was presented to them in no particular order; this was the control group. Other users, however, were presented with a different view. Besides the control group, there were eight other versions of the site that were each identical at the outset, but could change depending on the actions of its members. Users were randomly assigned to reside in one of these eight ‘worlds,’ which they would come back to every time they logged in, and each of these worlds presented a list of most downloaded songs within that world. That is, Betty listened to a song in world 3, rated it five stars, and downloaded it. Everyone in world 3 would now see that the song had been downloaded once, and if other users downloaded it within that world, the download count would iterate up as expected.

The ratings assigned to each song in the control world, where download counts were not visible, were taken to be the independent measure of quality of each song. As expected, in the eight social influence worlds the most popular songs were downloaded a lot more than the most popular songs in the control world, because of the positive feedback effect of people seeing highly downloaded songs and then listening to and downloading them as well, which in turn increased their popularity even more. It should also come as no surprise that the ‘best’ songs, according to their rating in the independent world, rarely did badly in their download/rating counts in the social worlds, and the ‘worst’ songs under the same criteria rarely did well in the social worlds, but the top songs differed from one social world to the next, with the hugely popular hits with orders of magnitude more downloads being completely different in each social world. Their study concludes

We conjecture, therefore, that experts fail to predict success not because they are incompetent judges or misinformed about the preferences of others, but because when individual decisions are subject to social influence, markets do not simply aggregate pre-existing individual preferences. In such a world, there are inherent limits on the predictability of outcomes, irrespective of how much skill or information one has.

Contingency and sensitivity to initial conditions

In the complex systems terminology, the above is an example of a system that is highly sensitive to initial conditions and contingent (chance) events. It’s similar to that popular chaos theory claim that a butterfly flapping its wings in China can cause a hurricane years later over Florida. It’s not that one inevitably leads to the other; rather, positive feedback loops make it so that very small changes can quickly become huge causal factors in the system as their effects exponentially increase. The nearly-arbitrary decision for a famous author to cite one paper on computational linguistics over another equally qualified might be the impetus the first paper needs to shoot into its own stardom. The first songs randomly picked and downloaded in each social world of the above music sharing site greatly influenced the eventual winners of the popularity contest disguised as a quality rank.

Some systems are fairly inevitable in their outcomes. If you drop a two-ton stone from five hundred feet, it’s pretty easy to predict where it’ll fall, regardless of butterflies flapping their wings in China or birds or branches or really anything else that might get in the way. The weight and density of the stone are overriding causal forces that pretty much cancel out the little jitters that push it one direction or another. Not so with a leaf; dropped from the same height, we can probably predict it won’t float into space, or fall somewhere a few thousand miles away, but barring that prediction is really hard because the system is so sensitive to contingent events and initial conditions.

There does exist, however, a set of systems right at the sweet spot between those two extremes; stochastic enough that predicting exactly how it will turn out is impossible, but ordered enough that useful predictions and explanations can still be made. Thankfully for us, a lot of human activity falls in this class.

Tracking Hurricane Ike with models. Notice how short-term predictions are pretty accurate. (Click image watch this model animated). via.

Nate Silver, the expert behind the political prediction blog fivethirtyeight, published a book a few weeks ago called The Signal and the Noise: why so many predictions fail – but some don’t. Silver has an excellent track record of accurately predicting what large groups of people will do, although I bring him up here to discuss what his new book has to say about the weather. Weather predictions, according to Silver, are “highly vulnerable to inaccuracies in our data.” We understand physics and meteorology well enough that, if we had a powerful enough computer and precise data on environmental conditions all over the world, we could predict the weather with astounding precision. And indeed we do; the National Hurricane Center has become 350% more accurate in the last 25 years alone, giving people two or three day warnings for fairly exact locations with regard to storms. However, our data aren’t perfect, and slightly inaccurate or imprecise measurements abound. These small imprecisions can have huge repercussions in weather prediction models, with a few false measurements sometimes being enough to predict a storm tens or hundreds of miles off course.

To account for this, meteorologists introduce stochasticity into the models themselves. They run the same models tens, hundreds, or thousands of times, but each time they change the data slightly, accounting for where their measurements might be wrong. Run the model once pretending the wind was measured at one particular speed in one particular direction; run the model again with the wind at a slightly different speed and direction. Do this enough times, and you wind up with a multitude of predictions guessing the storm will go in different directions. “These small changes, introduced intentionally in order to represent the inherent uncertainty in the quality of the observational data, turn the deterministic forecast into a probabilistic one.” The most extreme predictions show the furthest a hurricane is likely to travel, but if most runs of the model have the hurricane staying within some small path, it’s a good bet that this is the path the storm will travel.

Silver uses a similar technique when predicting American elections. Various polls show different results from different places, so his models take this into account by running many times and then revealing the spread of possible outcomes; those outcomes which reveal themselves most often might be considered the most likely, but Silver also is careful to use the rest of the outcomes to show the uncertainty in his models and the spread of other plausible occurrences.

Going back to the music sharing site, while the sensitivity of the system would prevent us from exactly predicting the most-popular hits, the musical evaluations of the control world still give us a powerful predictive capacity. We can use those rankings to predict the set of most likely candidates to become hits in each of the worlds, and if we’re careful, all or most of the most-downloaded songs will have appeared in our list of possible candidates.

The payoff: simulating history

Simulating the plague in 19th century Canada. via.

So what do hurricanes, elections, and musical hits have to do with computer models and the humanities, specifically history? The fact of the matter is that a lot of models are abject failures when it comes to their intended use: predicting winners and losers. The best we can do in moderately sensitive systems that have difficult-to-predict positive feedback loops and limited winner space (the French Academy, Nobel laureates, etc.) is to find a large set of possible winners. We might be able to reduce that set so it has fairly accurate recall and moderate precision (out of a thousand candidates to win 10 awards, we can pick 50, and 9 out of the 10 actual winners was in our list of 50). This might not be great betting odds, but it opens the door for a type of history research that’s generally been consigned to the distant and somewhat distasteful realm of speculation. It is closely related to the (too-often scorned) realm of counterfactual history (What if the Battle of Gettysburg had been won by the other side? What if Hitler had never been born?), and is in fact driven by the ability to ask counterfactual questions.

The type of historiography of which I speak is the question of evolution vs. revolution; is history driven by individual, world-changing events and Great People, or is the steady flow of history predetermined, marching inevitably in some direction with the players just replaceable cogs in the machine? The dichotomy is certainly a false one, but it’s one that has bubbled underneath a great many historiographic debates for some time now. The beauty of historical stochastic models 3 is exactly their propensity to yield likely and unlikely paths, like the examples above. A well-modeled historical simulation 4 can be run many times; if only one or a few runs of the model reveal what we take as the historical past, then it’s likely that set of events was more akin to the ‘revolutionary’ take on historical changes. If the simulation takes the same course every time, regardless of the little jitters in preconditions, contingent occurrences, and exogenous events, then that bit of historical narrative is likely much closer to what we take as ‘inevitable.’

Models have many uses, and though many human systems might not be terribly amenable to predictive modeling, it doesn’t mean there aren’t many other useful questions a model can help us answer. The balance between inevitability and contingency, evolution and revolution, is just one facet of history that computational models might help us explore.


  1. Music has a biological aspect as well. Most cultures with music tend towards discrete pitches, discernible (discrete) rhythm, ‘octave’-type systems with relatively few notes looping back around, and so forth. This suggests we’re hard-wired to appreciate music within a certain set of constraints, much in the same way we’re hard-wired to see only certain wavelengths of light or to like the taste of certain foods over others (Peretz 2006; doi:10.1016/j.cognition.2005.11.004). These tendencies can certainly be overcome, but to suggest the pre-defined structure of our wet thought-machine plays no role in our musical preferences is about as far-fetched as suggesting it plays the only role.
  2. I must thank Miriam Posner for this wonderful turn of phrase.
  3. presuming the historical data and model specifications are even accurate, which is a whole different can of worms to be opened in a later post
  4. Seriously, see the last note, this is really hard to do. Maybe impossible. But this argument is just assuming it isn’t, for now.

More heavy-handed culturomics

A few days ago, Gao, Hu, Mao, and Perc posted a preprint of their forthcoming article comparing social and natural phenomena. The authors, apparently all engineers and physicists, use the google ngrams data to come to the conclusion that “social and natural phenomena are governed by fundamentally different processes.” The take-home message is that words describing natural phenomena increase in frequency at regular, predictable rates, whereas the use of certain socially-oriented words change in unpredictable ways. Unfortunately, the paper doesn’t necessarily differentiate between words and what they describe.

Specifically, the authors invoke random fractal theory (sort of a descendant of chaos theory) to find regular patterns in 1-grams. A 1-gram is just a single word, and this study looks at how the frequency of certain words grow or shrink over time. A “hurst parameter” is found for 24 words, a dozen pertaining to nature (earthquake, fire, etc.), and another dozen “social” words (war, unemployment, etc.). The hurst parameter (H) is a number which, essentially, reveals whether or not a time series of data is correlated with itself. That is, given a set of observations over the last hundred years, autocorrelated data means the observation for this year will very likely follow a predictable trend from the past.

If H is between 0.5 and 1, that means the dataset has “long-term positive correlation,” which is roughly equivalent to saying that data quite some time in the past will still positively and noticeably effect data today. If H is under 0.5, data are negatively correlated with their past, suggesting that a high value in the past implies a low value in the future, and if H = 0.5, the data likely describe Brownian motion (they are random). H can exceed 1 as well, a point which I’ll get to momentarily.

The authors first looked at the frequency of 12 words describing natural phenomena between 1770 and 2007. In each case, H was between 0.5 and 1, suggesting a long-term positive trend in the use of the terms. That is, the use of the term “earthquake” does not fluctuate terribly wildly from year to year; looking at how frequently it was used in the past can reasonably predict how frequently it will be used in the future. The data have a long “memory.”

Natural 1-grams from Gao et al. (2012)

The paper then analyzed 12 words describing social phenomena, with very different results. According to the authors, “social phenomena, apart from rare exceptions, cannot be classified solely as processes with persistent-long range correlations.” For example, the use of the word “war” bursts around World War I and World War II; these are unpredictable moments in the discussion of social phenomena. The way “war” was used in the past was not a good predictor of how “war” would be used around 1915 and 1940, for obvious reasons.

Social 1-grams from Gao et al. (2012)

You may notice that, for many of the social terms, H is actually greater than 1, “which indicates that social phenomena are most likely to be either nonstationary, on-off intermittent, or Levy walk-like process.” Basically, the H parameter alone is not sufficient to describe what’s going on with the data. Nonstationary processes are, essentially, unpredictable. A stationary process can be random, but at least certain statistical properties of that randomness remain persistent. Nonstationary processes don’t have those persistent statistical properties. The authors point out that not all social phenomena will have H >1, citing famine, because it might relate to natural phenomena. They also point out that “the more the social phenomena can be considered recent (unemployment, recession, democracy), the higher their Hurst parameter is likely to be.”

In sum, they found that “The prevalence of long-term memory in natural phenomena [compels them] to conjecture that the long-range correlations in the usage frequency of the corresponding terms is predominantly driven by occurrences in nature of those phenomena,” whereas “it is clear that all these processes [describing social phenomena] are fundamentally different from those describing natural phenomena.” That the social phenomena follow different laws is not unexpected, they say, because they themselves are more complex; they rely on political, economic, and social forces, as well as natural phenomena.

While this paper is exceptionally interesting, and shows a very clever use of fairly basic data (24 one-dimensional variables, just looking at word use per year), it lacks the same sort of nuance also lacking in the original culturomics paper. Namely, in this case, it lacks the awareness that social and natural phenomena are not directly coupled with the words used to describe them, nor the frequency with which those words are used. The paper suggests that natural and social phenomena are governed by different scaling laws when, realistically, it is the way they are discussed, and how those discussions are published which are governed by the varying scaling laws. Further, although they used words exemplifying the difference between “nature” and “society,” the two are not always so easily disentangled, either in language or the underlying phenomena.

Perhaps the sort of words used to describe social events change differently than the sort used to describe natural events. Perhaps, because natural phenomena are often immediately felt across vast distances, whereas news of social phenomena can take some time to diffuse, how rapidly some words are discussed may take very different forms. Discussions and word-usage are always embedded in a larger network. Also needing to be taken into account is who is discussing social vs. natural phenomena, and which is more likely to get published and preserved to eventually be scanned by Google Books.

Without a doubt the authors have noticed a very interesting trend, but rather than matching the phenomena directly to word, as they did, we should be using this sort of study to look at how language changes, how people change, and ultimately what relationship people have with the things they discuss and publish. At this point, the engineers and physicists still have a greater comfort with the statistical tools needed to fully utilize the google books corpus, but there are some humanists out there already doing absolutely fantastic quantitative work with similar data.

This paper, while impressive, is further proof that the quantitative study of culture should not be left to those with (apparently) little background in the subject. While it is not unlikely that different factors do, in fact, determine the course of natural disasters versus that of human interaction, this paper does not convincingly tease those apart. It may very well be that the language use is indicative of differences in underlying factors in the phenomena described, however no study is cited suggesting this to be the case. Claims like “social and natural phenomena are governed by fundamentally different processes,” given the above language data, could easily have been avoided, I think, with a short discussion between the authors and a humanist.