Culturomics 2: The Search for More Money

“God willing, we’ll all meet again in Spaceballs 2: The Search for More Money.” -Mel Brooks, Spaceballs, 1987

A long time ago in a galaxy far, far away (2012 CE, Indiana), I wrote a few blog posts explaining that, when writing history, it might be good to talk to historians (1,2,3). They were popular posts for the Irregular, and inspired by Mel Brooks’ recent interest in making Spaceballs 2,  I figured it was time for a sequel of my own. You know, for all the money this blog pulls in. 1


Two teams recently published very similar articles, attempting cultural comparison via a study of historical figures in different-language editions of Wikipedia. The first, by Gloor et al., is for a conference next week in Japan, and frames itself as cultural anthropology through the study of leadership networks. The second, by Eom et al. and just published in PLoS ONE, explores cross-cultural influence through historical figures who span different language editions of Wikipedia.

Before reading the reviews, keep in mind I’m not commenting on method or scientific contribution—just historical soundness. This often doesn’t align with the original authors’ intents, which is fine. My argument isn’t that these pieces fail at their goals (science is, after all, iterative), but that they would be markedly improved by adhering to the same standards of historical rigor as they adhere to in their home disciplines, which they could accomplish easily by collaborating with a historian.

The road goes both ways. If historians don’t want physicists and statisticians bulldozing through history, we ought to be open to collaborating with those who don’t have a firm grasp on modern historiography, but who nevertheless have passion, interest, and complementary skills. If the point is understanding people better, by whatever means relevant, we need to do it together.

Cultural Anthropology

“Cultural Anthropology Through the Lens of Wikipedia – A Comparison of Historical Leadership Networks in the English, Chinese, Japanese and German Wikipedia” by Gloor et al. analyzes “the historical networks of the World’s leaders since the beginning of written history, comparing them in the four different Wikipedias.”

Their method is simple (simple isn’t bad!): take each “people page” in Wikipedia, and create a network of people based on who else is linked within that page. For example, if Wikipedia’s article on Mozart links to Beethoven, a connection is drawn between them. Connections are only drawn between people whose lives overlap; for example, the Mozart (1756-1791) Wikipedia page also links to Chopin (1810-1849), but because they did not live concurrently, no connection is drawn.

Figure 1 from
Figure 1 from Gloor et al

A separate network is created for four different language editions of Wikipedia (English, Chinese, Japanese, German), because biographies in each edition are rarely exact translations, and often different people will be prominent within the same biography across all four languages. PageRank was calculated for all the people in the resulting networks, to get a sense of who the most central figures are according to the Wikipedia link structure.

“Who are the most important people of all times?” the authors ask, to which their data provides them an answer. 2 In China and Japan, they show, only warriors and politicians make the cut, whereas religious leaders, artists, and scientists made more of a mark on Germany and the English-speaking world. Historians and biographers wind up central too, given how often their names appear on the pages of famous contemporaries on whom they wrote.

Diversity is also a marked difference: 80% of the “top 50” people for the English Wikipedia were themselves non-English, whereas only 4% of the top people from the Chinese Wikipedia are not Chinese. The authors conclude that “probing the historical perspective of many different language-specific Wikipedias gives an X-ray view deep into the historical foundations of cultural understanding of different countries.”

Figure 3
Figure 3 from Gloor et al

Small quibbles aside (e.g. their data include the year 0 BC, which doesn’t exist), the big issue here is the ease with which they claim these are the “most important” actors in history, and that these datasets provides an “X-ray” into the language cultures that produced them. This betrays the same naïve assumptions that plague much of culturomics research: that you can uncritically analyze convenient datasets as a proxy for analyzing larger cultural trends.

You can in fact analyze convenient datasets as a proxy for larger cultural trends, you just need some cultural awareness and a critical perspective.

In this case, several layers of assumptions are open for questioning, including:

  • Is the PageRank algorithm a good proxy for historical importance? (The answer turns out to be yes in some situations, but probably not this one.)
  • Is the link structure in Wikipedia a good proxy for historical dependency? (No, although it’s probably a decent proxy for current cultural popularity of historical figures, which would have been a better framing for this article. Better yet, these data can be used to explore the many well-known and unknown biases that pervade Wikipedia.)
  • Can differences across language editions of Wikipedia be explained by any factors besides cultural differences? (Yes. For example, editors of the German-language Wikipedia may be less likely to write a German biography if one already exists in English, given that ≈64% of Germany speaks English.)

These and other questions, unexplored in the article, make it difficult to take at face value that this study can reveal important historical actors or compare cultural norms of importance. Which is a shame, because simple datasets and approaches like this one can produce culturally and scientifically valid results that wind up being incredibly important. And the scholars working on the project are top-notch, it’s just that they don’t have all the necessary domain expertise to explore their data and questions.

Cultural Interactions

The great thing about PLoS is the quality control on its publications: there isn’t much. As long as primary research is presented, the methods are sound, the data are open, and the experiment is well-documented, you’re in.

It’s a great model: all reasonable work by reasonable people is published, and history decides whether an article is worthy of merit. Contrast this against the current model, where (let’s face it) everything gets published eventually anyway, it’s just a question of how many journal submissions and rounds of peer review you’re willing to sit through. Research sits for years waiting to be published, subject to the whims of random reviewers and editors who may hold long grudges, when it could be out there the minute it’s done, open to critique and improvement, and available to anyone to draw inspiration or to learn from someone’s mistakes.

“Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions” by Eom et al. is a perfect example of this model. Do I consider it a paragon of cultural research? Obviously not, if I’m reviewing it here. Am I happy the authors published it, respectful of their attempt, and willing to use it to push forward our mutual goal of soundly-researched cultural understanding? Absolutely.

Eom et al.’s piece, similar to that of Gloor et al. above, uses links between Wikipedia people pages to rank historical figures and to make cultural comparisons. The article explores 24 different language editions of Wikipedia, and goes one step further, using the data to explore intercultural influence. Importantly, given that this is a journal-length article and not a paper from a conference proceeding like Gloor et al.’s, extra space and thought was clearly put into the cultural biases of Wikipedia across languages. That said, neither of the articles reviewed here include any authors who identify themselves as historians or cultural experts.

This study collected data a bit differently from the last. Instead of a network connecting only those people whose lives overlapped, this network connected all pages within a single-language edition of Wikipedia, based only on links between articles. 3 They then ranked pages using a number of metrics, including but not limited to PageRank, and only then automatically extracted people to find who was the most prominent in each dataset.

In short, every Wikipedia article is linked in a network and ranked, after which all articles are culled except those about people. The authors explain: “On the basis of this data set we analyze spatial, temporal, and gender skewness in Wikipedia by analyzing birth place, birth date, and gender of the top ranked historical figures in Wikipedia.” By birth place, they mean the country currently occupying the location where a historical figure was born, such that Aristophanes, born in Byzantium 2,300 years ago, is considered Turkish for the purpose of this dataset. The authors note this can lead to cultural misattributions ≈3.5% of the time (e.g. Kant is categorized as Russian, having been born in a city now in Russian territory). They do not, however, call attention to the mutability of culture over time.

Table 2 from Eom et al.
Table 2 from Eom et al.

It is unsurprising, though comforting, to note that the fairly different approach to measuring prominence yields many of the same top-10 results as Gloor’s piece: Shakespeare, Napoleon, Bush, Jesus, etc.

Analysis of the dataset resulted in several worthy conclusions:

  • Many of the “top” figures across all language editions hail from Western Europe or the U.S.
  • Language editions bias local heroes (half of top figures in Wikipedia English are from the U.S. and U.K.; half of those in Wikipedia Hindi are from India) and regional heroes (Among Wikipedia Korean, many top figures are Chinese).
  • Top figures are distributed throughout time in a pattern you’d expect given global population growth, excepting periods representing foundations of modern cultures (religions, politics, and so forth).
  • The farther you go back in time, the less likely a top figure from a certain edition of Wikipedia is to have been born in that language’s region. That is, modern prominent figures in Wikipedia English are from the U.S. or the U.K., but the earlier you go, the less likely top figures are born in English-speaking regions. (I’d question this a bit, given cultural movement and mutability, but it’s still a result worth noting).
  • Women are consistently underrepresented in every measure and edition. More recent top people are more likely to be women than those from earlier years.
Figure 4 from Eom et al.
Figure 4 from Eom et al.

The article goes on to describe methods and results for tracking cultural influence, but this blog post is already tediously long, so I’ll leave that section out of this review.

There are many methodological limitations to their approach, but the authors are quick to notice and point them out. They mention that Linnaeus ranks so highly because “he laid the foundations for the modern biological naming scheme so that plenty of articles about animals, insects and plants point to the Wikipedia article about him.” This research was clearly approached with a critical eye toward methodology.

Eom et al. do not fare as well historically as methodologically; opportunities to frame claims more carefully, or to ask different sorts of questions, are overlooked. I mentioned earlier that the research assumes historical cultural consistency, but cultural currents intersect languages and geography at odd angles.

The fact that Wikipedia English draws significantly from other locations the earlier you look should come as no surprise. But, it’s unlikely English Wikipedians are simply looking to more historically diverse subjects; rather, the locus of some cultural current (Christianity, mathematics, political philosophy) has likely moved from one geographic region to another. This should be easy to test with their dataset by looking at geographic clustering and spread in any given year. It’d be nice to see them move in that direction next.

I do appreciate that they tried to validate their method by comparing their “top people” to lists other historians have put together. Unfortunately, the only non-Wikipedia-based comparison they make is to a book written by an astrophysicist and white separatist with no historical training: “To assess the alignment of our ranking with previous work by historians, we compare it with [Michael H.] Hart’s list of the top 100 people who, according to him, most influenced human history.”

Top People

Both articles claim that an algorithm analyzing Wikipedia networks can compare cultures and discover the most important historical actors, though neither define what they mean by “important.” The claim rests on the notion that Wikipedia’s grand scale and scope smooths out enough authorial bias that analyses of Wikipedia can inductively lead to discoveries about Culture and History.

And critically approached, that notion is more plausible than historians might admit. These two reviewed articles, however, don’t bring that critique to the table. 4 In truth, the dataset and analysis lets us look through a remarkably clear mirror into the cultures that created Wikipedia, the heroes they make, and the roots to which they feel most connected.

Usefully for historians, there is likely much overlap between history and the picture Wikipedia paints of it, but the nature of that overlap needs to be understood before we can use Wikipedia to aid our understanding of the past. Without that understanding, boldly inductive claims about History and Culture risk reinforcing the same systemic biases which we’ve slowly been trying to fix. I’m absolutely certain the authors don’t believe that only 5% of history’s most important figures were women, but the framing of the articles do nothing to dispel readers of this notion.

Eom et al. themselves admit “[i]t is very difficult to describe history in an objective way,” which I imagine is a sentiment we can all get behind. They may find an easier path forward in the company of some historians.


  1. net income: -$120/year.
  2. If you’re curious, the 10 most important people in the English-speaking world, in order, are George W. Bush, ol’ Willy Shakespeare, Sidney Lee, Jesus, Charles II, Aristotle, Napoleon, Muhammad, Charlemagne, and Plutarch.
  3. Download their data here.
  4. Actually the Eom et al. article does raise useful critiques, but mentioning them without addressing them doesn’t really help matters.

[Review] The Book of Trees, Manuel Lima

The first line on the first page of The Book of Trees is “This is the book I wish had been available when I was researching my previous book, Visual Complexity: Mapping Patterns of Information.” It’s funny, because this is also the book I wish had been available when I was researching my own project, Knowledge Uprooted. It took Alberto Cairo, reading over a draft of my article, to point out that Manuel Lima was working on a book-length version of a very similar project. If the book had come out a year ago, my own research might look very different. Lima’s book is beautifully designed, well-researched, and a delightful resource for anyone interested in visualizations of knowledge.

Tree of Consanguinity, ca. 1450-1510. Page 52.
Tree of Consanguinity, ca. 1450-1510. Page 52.

Lima’s book is a history of hierarchical visualizations, most frequently as trees, and often representing branches of knowledge. He roots his narrative in trees themselves, describing how their symbolism has touched religions and cultures for millennia. The narrative weaves through Ancient Greece and Medieval Europe, makes a few stops outside of the West and winds its way to the present day. Subsequent chapters are divided into types of tree visualizations: figurative, vertical, horizontal, multidirectional, radial, hyperbolic, rectangular, voronoi, circular, sunbursts, and icicles. Each chapter presents a chronological set of beautiful examples embodying that type.

Biblical Genealogy, ca. 1060. Page 112.
Biblical Genealogy, ca. 1060. Page 112.

Of course, any project with such a wide scope is bound to gloss over or inaccurately portray some of its historical content. I’d quibble, for example, with Lima’s suggestion that the use of these visual diagrams could be understood in the context of ars memorativa, a method for improving memory and understanding in the Middle Ages. Instead, I’d argue that the tradition stemmed from a more innate Aristotelian connection between thinking and seeing. Lima also argues that the scala naturae, depictions of entities on a natural order rising to God, is an obvious reflection on contemporary feudal stratification. The story is a bit more complex than that, with feudal stratification itself being concomitant to the medieval worldview of a natural order. In discussing Ramon Llull, Lima oddly writes “the notion of a unified trunk of science has remained to this day,” a claim which Lima himself shows isn’t exactly true in his earlier book, Visual Complexity. But this isn’t a book written by or for historians, and that’s okay—it’s accurate enough to get a good sense of the progression of trees.

The Blog Tree, 2012. Page 77.
The Blog Tree, 2012. Page 77.

Where the book shines is in its clear, well-cited, contextualized illustrations, which comprise the majority of its contents. Over a hundred illustrations pack the book, each with at least a paragraph of description and, in many cases, translation. This is a book for people passionate about visualizations, and interested in their history. There is not yet a book-length treatment for historians interested in this subject, though Murdoch’s Album of Science (1984) comes close. For those who want to delve even deeper into this history, I’ve compiled a 100+ reference bibliography that is freely available here.

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.


Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!


  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 1

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s my review of part I (Foundation), all chapters. Read Part 2, Part 3, …

Macroanalysis: Digital Methods & Literary History is a book whose time has come. “Individual creativity,” Matthew L. Jockers writes, “is highly constrained, even determined, by factors outside of what we consider to be a writer’s conscious control.” Although Jockers’ book is a work of impressive creativity, it also fits squarely within a larger set of trends. The scents of ‘Digital Humanities’ (DH) and ‘Big Data’ are in the air, the funding-rich smells attracting predators from all corners, and Jockers’ book floats somewhere in the center of it all. As with many DH projects, Macroanalysis attempts the double goal of explaining a new method and exemplifying the type of insights that can be achieved via this method. Unlike many projects, Jockers succeeds masterfully at both. Macroanalysis introduces its readers to large scale quantitative methods for studying literary history, and through those methods explores the nature of creativity and influence in general and the place of Irish literature within its larger context in particular.

I’ve apparently gained a bit of a reputation for being overly critical, and it’s worth pointing out at the beginning of this review that this trend will continue for Macroanalysis. That said, I am most critical of the things I love the most, and readers who focus on any nits I might pick without reading the book themselves should keep in mind that the overall work is staggering in its quality, and if it does fall short in some small areas, it is offset by the many areas it pushes impressively forward.

Macroanalysis arrives on bookshelves eight years after Franco Moretti’s Graphs, Maps, and Trees (2005), and thirteen years after Moretti’s “Conjectures on World Literature” went to press in early 2000, where he coined the phrase “distant reading.” Moretti’s distant reading is a way of seeing literature en masse, of looking at text at the widest angle and reporting what structures and forms only become visible at this scale. Moretti’s early work paved the way, but as might be expected with monograph published the same year as the initial release of Google Books, lack of available data made it stronger in theory than in computational power.

From Moretti's Graphs, Maps, and Trees
From Moretti’s Graphs, Maps, and Trees

In 2010, Moretti and Jockers, the author of Macroanalysis, co-founded the Stanford Lit Lab for the quantitative and digital research of literature. The two have collaborated extensively,  and Jockers acknowledge’s Moretti’s influence on his monograph. That said, in his book, Jockers distances himself slightly from Moretti’s notion of distant reading, and it is not the first time he has done so. His choice of “analysis” over “reading” is an attempt to show that what his algorithms are doing at this large scale is very different from our normal interpretive process of reading; it is simply gathering and aggregating data, the output of which can eventually be read and interpreted instead of or in addition to the texts themselves. The term macroanalysis was inspired by the difference between macro- and microeconomics, and Jockers does a good job justifying the comparison. Given that Jockers came up with the comparison in 2005, one does wonder if he would have decided on different terminology after our recent financial meltdown and the ensuing large-scale distrust of macroeconomic methods. The quantitative study of history, cliometrics, also had its origins in economics and suffered its own fall from grace decades ago; quantitative history still hasn’t recovered.

Part I: Foundation

I don’t know whether the allusion was intended, but lovers of science fiction and quantitative cultural studies will enjoy the title of Part I: “Foundation.” It shares a name with a series of books by Isaac Asimov, centering around the ability to combine statistics and human-centric research to understand and predict people’s behaviors. Punny titles aside, the section provides the structural base of the monograph.

The story of Foundation in a nutshell. Via c0ders.
The story of Foundation in a nutshell. Via c0ders.

Much of the introductory chapters are provocative statements about the newness of the study at hand, and they are not unwarranted. Still, I can imagine that the regular detractors of technological optimism might argue their usual arguments in response to Jockers’ pronouncements of a ‘revolution.’ The second chapter, on Evidence, raises some particularly important (and timely) points that are sure to raise some hackles. “Close reading is not only impractical as a means of evidence gathering in the digital library, but big data render it totally inappropriate as a method of studying literary history.” Jockers hammers home this point again and again, that now that anecdotal evidence based on ‘representative’ texts is no longer the best means of understanding literature, there’s no reason it should still be considered the gold standard of evidentiary support.

Not coming from a background of literary history or criticism, I do wonder a bit about these notions of representativeness (a point also often brought up by Ted Underwood, Ben Schmidt, and Jockers himself). This is probably something lit-researchers worked out in the 70s, but it strikes me that the questions being asked of a few ‘exemplary, representative texts’ are very different than the ones that ought to be asked of whole corpora of texts. Further, ‘representative’ of what? As this book appears to be aimed not only at traditional literary scholars, it would have been beneficial for Jockers to untangle these myriad difficulties.

One point worth noting is that, although Jockers calls his book Macroanalysis, his approach calls for a mixed method, the combination of the macro/micro, distant/close. The book is very careful and precise in its claims that macroanalysis augments and opens new questions, rather than replaces. It is a combination of both approaches, one informing the other, that leads to new insights. “Today’s student of literature must be adept at reading and gathering evidence from individual texts and equally adept at accessing and mining digital-text repositories.” The balance struck here is impressive: to ignore macroanalysis as a superior source of evidence for many types of large questions would be criminal, but its adoption alone does not make for good research (further, either without the other would be poorly done). For example, macroanalysis can augment close reading approaches by contextualizing a text within its broad historical and cultural moment, showing a researcher precisely where their object of research fits in the larger picture.

Historians would do well to heed this advice, though they are not the target audience. Indeed, historians play a perplexing role in Jockers’ narrative; not because his description is untrue, but because it ought not be true. In describing the digital humanities, Jockers calls it an “ambiguous and amorphous amalgamation of literary formalists, new media theorists, tool builders, coders, and linguists.” What place historians? Jockers places their role earlier, tracing the wide-angle view to the Annales historians and their focus on longue durée history. If historian’s influence ends there, we are surely in a sad state; that light, along with those of cliometrics and quantitative history, shone brightest in the 1970s before a rapid decline. Unsworth recently attributed the decline to the fallout following Time on the cross (Fogel & Engerman, 1974), putting quantitative methods in history “out of business for decades.” The ghost of cliometrics still haunts historians to such an extent that the best research in that area, to this day, comes more from information scientists and applied mathematicians than from historians. Digital humanities may yet exorcise that ghost, but it has not happened yet, as evidenced in part by the glaring void in Jockers’ introductory remarks.

It is with this framing in mind that Jockers embarks on his largely computational and empirical study of influence and landscape in British and American literature.

Google Maps for the Ancient World

The title of this post comes from an oft-quoted passage of my previous one describing ORBIS, a scholarly argument cleverly disguised as a web tool. Of ORBIS I wrote

…given any two cities in the ancient world, it returns the fastest, cheapest, or shortest route between them, given the month, the mode of transportation, and various other options. It’s Google Maps for the ancient world, complete with the “Avoid Highways” feature.

In writing that review, I neglected to mention the many fantastic resources out there that already map the ancient world, including the Digital Atlas of Roman and Medieval Civilization and PLEIADES, a gazetteer and graph of ancient places. The most impressive full-featured online GIS application I’ve seen is called Antiquity À-la-carte, shown below. The classicists have once again proven themselves to be at the bleeding edge of technology. When they keep developing cool toys like these, I sometimes regret being an early modernist. Sometimes.

Antiquity À-la-carte, a GIS application for the ancient western world.

The cool toy I speak of now is of course no toy, but a serious scholarly endeavor which will doubtless set the bar for future online historical maps. In many ways, the Digital Map of the Roman Empire offers less than the sites I listed above. It doesn’t allow you to turn on or off particular layers, and it certainly doesn’t include all of the information the others have. In this case, however, less is more. It’s a really easy to use map of the ancient world, online. That’s it. It doesn’t tell you how to get from point A to point B, it doesn’t allow you to see the location of shipwrecks or the borders of countries at different time periods; it’s just a base map, depicting the Greek and Roman world in its entirety, asking the world to do with it what it will.

Johan Ahlfeldt wrote about his creation:

The aim of my work with Pelagios has been to create a static (non-layered) map of the ancient places in the Pleiades dataset with the capacity to serve as a background layer to online mapping applications of the Ancient World. Because it is based on ancient settlements and uses ancient placenames, our map presents a visualisation more tailored to archaeological and historical research, for which modern mapping interfaces, such as Google Maps, are hardly appropriate; it even includes non-settlement data such as the Roman roads network, some aqueducts and defence walls (limes, city walls). Thus, for example, the tiles can be used as a background layer to display the occurrence of find-spots, archaeological sites, etc., thereby creating new opportunities to put data of these kinds in their historical context.

The ancient base map.

As I wrote last year, accurate base maps are extremely important for contextualizing research. With this underneath, for example, ORBIS could provide a much richer experience of the ancient world. What’s more, the PELAGIOS group has opened up the map with a CC-BY license, allowing anyone to build on it so long as they include proper scholarly attribution. It can be used with Openlayers, Google, and Bing maps, so anybody who already has these systems in place can easily swap out the map tiles with these historical ones. Johan’s post includes examples of all of these implemented.

My posts are usually long and rambling, but I’ll keep this one short and to the point, much like the tool I’m reviewing. This is the first easily mashable base map of the ancient world, and for that it is awesome. Go explore!

ORBIS: The next step in Digital Humanities

Every once in a while, a new project comes around bearing a message loud and clear: this is a sign of things to come. ORBIS, the Stanford Geospatial Network Model of the Roman World, is one such project.

ORBIS was created by Walter Scheidel, Elijah Meeks, and a host of others. At the very beginning, I should point out I am not a classicist. The below review is of the nature rather than the content of ORBIS as a scholarly product.

Roman Travel Network

ORBIS is many things but, most simply, it is an interface allowing researchers to experience the geography of the Roman world from an ancient perspective. The executive summary: given any two cities in the ancient world, it returns the fastest, cheapest, or shortest route between them, given the month, the mode of transportation, and various other options. It’s Google Maps for the ancient world, complete with the “Avoid Highways” feature.

I was among the lucky few to see an early version of the tool, and after sending back an informal review, Elijah Meeks invited me to review the site publicly via my blog. The first section explains what I feel is the most important contribution of ORBIS to the Digital Humanities; it is a reflexive tool that allows the humanist to engage with the process as well as the product. I then highlight some of the cool features, and finally list some rough edges and desiderata for future iterations or similar projects.

Tool As Argument

Beyond being an exceptionally well-made and useful tool, it is not the tool itself which makes ORBIS stand out. Walter Scheidel and Elijah Meeks could have posted the automated map portion of the site by itself, and it would have garnered deserving praise, but they went well beyond that goal; they made a reflexive tool.

ORBIS is among the first digital scholarly tools for the humanities (that I have encountered) that really lives up to the name “digital scholarly tool for the humanities.” Beyond being a simple tool, ORBIS is an explicit and transparent argument, a way of presenting research that also happens to allow, by its very existence, further research to be done. It is a map that allows the user to engage in the process of map-making, and a presentation of a process that allows the user to make and explore in ways the initial creators could not have foreseen. Of course, as with any project there are a few rough edges and desired features, which I’ll get into further down below.

Elevation data to help model the difficulty in getting from one place to another.

Along with the map, the Makers of this project (by which I mean authors, developers, data gatherers, …) present a fairly interactive documentary of the map-making process, including historical accounts, data sources, algorithmic explanations, visual aids, downloadable data, and a forthcoming API. They built an explicit model of the ancient world, taking into account roads and rivers, oceans and coastlines, weather and geographic features, various modes of transportation for civilian and military purposes, and put it all together so any researcher can sit down and figure out how long it would have taken, or how expensive it would have been, to travel between 751 locations in the ancient Roman world. Rather than asking us to trust that their data are accurate, the makers revealed their model – their underlying argument – for critique and extension.

Exploring the Ancient World

The ORBIS model includes 751 sites covering about 4 million square miles of ancient space, including over 50,000 miles of road or desert tracks, nearly 20,000 miles of navigable rivers and canals, and almost 1,000 sea routes between sea ports. As I mentioned earlier, the model works like Google Maps; given two locations, it tells you the cheapest, shortest, or fastest route between them. These calculations take into account the time-of-year and usual weather, elevation changes between sites, fourteen modes of travel (ox cart, foot, army on march, camel caravan, etc.), river travel (including extra difficulty moving upstream), etc.

The ORBIS Interface

Another exciting feature on ORBIS is the distance cartogram. This visualization reveals the impact of travel speed and transport prices on overall connectivity; it allows the researcher to see how far other cities were with respect to a certain core city (for instance Constantinople) from the perspective of cost and travel time rather than mere geographical distance. This feature brings the researcher closer to the actual ancient Roman experience. A larger insight is revealed when taking a “distant reading” approach to the cartogram: “Distance cartograms show that due to massive cost differences between aquatic and terrestrial modes of transport, peripheries were far more remote from the center in terms of price than in terms of time.”

Constantinople Cartogram


ORBIS is a big step forward in designing digital scholarly objects for the digital humanities. It is a tool that is both useful and reflexive, offering engagement with both process and product. It also exemplifies an increasingly popular mode of scholarly communication: the published online object. Because the mode is still (even after decades of online DH projects) not quite solidified, ORBIS lacks a few of the basic features of common scholarly communication, and by straddling both the new and the old, ORBIS doesn’t quite live up to the best qualities of either digital or analog publication.

First of all, although their team sent a preliminary version of the site out to many people, it never went through any formal review process. Readers of this blog will know that I am no advocate of traditional publication systems or the antiquated marriage of publication and peer-review, but at this point it is worth noting that ORBIS (to my knowledge) has only been reviewed informally, by sympathetic reviewers like myself. Perhaps this means that adoption of the tool should be approached with greater caution until it is more formally reviewed by a post-publication periodical like the Journal of Digital Humanities.

That being said, the site does try remain true to humanistic and traditional publication roots. A paper version is in the works, and it is written such that we researchers can engage in the process of the tool. Unfortunately, it perhaps stays a bit too true to the paper model. The site is designed to read top-to-bottom, left-to-right, and none of the internal references to other sections include links to aid in navigation. Further, if the intent is to simultaneously allow exploration of the tool and its creation, the design does not realize this goal. The map appears at “the end” of the site, all the way on the right, and because of the layout, it is impossible to view it alongside the text describing it without opening a new window. There is quite a bit of white space to the right of the text on my wide-screen monitor – perhaps a smaller version of the tool can be embedded in that space.

One of the strengths of the project is the explicit nature of its creation. Data can be downloaded, and the sources, provenance, algorithms, and technologies are clearly stated. The model as an argument is, in short, visible and comprehensible even to those with little prior knowledge on these technologies. What this does is bridge the gap between code and humanistic inquiry, adding levels of model explication and tool-use between them. ORBIS is by far not the first project to make the creation of a tool explicit, but usually that explication is simply a public posting of the code and some limited comments or descriptions of how that code works. Unfortunately, although ORBIS does include a better bridge to explicate its argument, it does not offer the code. It’s a bit like David Copperfield explaining how he made the Statue of Liberty disappear; the explanation would certainly be helpful, but if he really wanted other people to be able to create similar illusions, he’d offer up the materials as well. (Alright, the metaphor doesn’t completely work, but stick with it.) The digital humanities seems finally to be getting into code sharing, and this is a good thing. The cost for sharing code is essentially free (although there’s a much greater price for sharing good code – all the extra time spent marking it up and making it pretty), and the benefits should go without saying: More things like ORBIS, much faster. Better tools built collectively and suiting all our individual needs.

The last, most important, and most difficult of my desires deals with uncertainty. There’s been a lot of talk about data uncertainty in the humanities lately, not least of which stemming from Stanford, the home university of ORBIS. It’s a difficult problem to solve, but presented as it is, the ORBIS project lends itself to the varieties of critiques common in the work of Johanna Drucker and others. How do you know that these were the shortest routes? What about missing information? What about the fact that every bit of travel was its own experience, with different human and environmental factors playing in, perhaps delays for sick relatives or mutineering seamen? These questions are swept under the table when ORBIS presents one route and one set of numbers per query: here, this is the fastest route, these are the cities, this is how much it would cost. The visualization and end-products create an illusion of certainty in the data, although in the text, the makers are quick to point out that a researcher should not take it as certain. One solution, and this extends to all data-driven DH projects, is to model uncertainty in the data from the ground up. How much more certain is one route than another? How certain are you of the weather in one location compared to the weather elsewhere? This sort of information flows naturally into models of Bayesian data analysis, and would allow ORBIS to deliver a list of credible routes, revealing which parts of those routes are more or less certain, and including other information like the probability of a ship being lost at sea on a particular route. Of course, data uncertainty is only part of the problem, and this would only be a partial solution.

This isn’t the place to detail exactly how uncertainty should be modeled in the data, and exactly what ought to be done with it, but the fact is there is already rich knowledge in the model and in the data available dealing with the uncertainty of travel, but that information disappears as soon as it is presented in the map interface. If ORBIS represents the next step in humanities tool production, it doesn’t quite (yet) live up to the promise of humanities data analysis, impressive as their analysis is. There is still not yet a clear enough representation of uncertainty and interpretation to reach that goal. To be fair, I’ve yet to see a single project living up to that promise at anything close to large-scale; the tools just haven’t been developed yet. Perhaps that promise is impossible at large scale, although I certainly hope that is not the case.

The View From Here

Despite my long list of rough edges and desiderata, I still stand by my statement that this tool is an exemplar of a shift in digital humanities projects. The tool itself is profoundly impressive and will prove useful for a variety of research, but what stands out from the humanities standpoint is the explicit nature of the ORBIS underbelly. It blurs the line between tool and argument. There are other profoundly impressive and useful tools out there (topic modeling comes to mind). However, with topic modeling, the assumptions are still obscure to the unfamiliar, despite my own best efforts and the even better efforts of others. This is because the software topic modeling is packaged with, the software we use to run the analyses, does not simultaneously engage in the process of its own creation in the way that ORBIS does. Going forward, I predict the most used (or at least the most useful) digital tools for humanists will include that engagement, rather than existing as black boxes out of which results spring forth, fully armed and ready to battle as Athena from Zeus’s forehead. ORBIS is by no means the first to attempt such a feat but, I think, it is as-yet the most successful.


Early Modern Letters Online

Early modern history! Science! Letters! Data! Four of my favoritest things have been combined in this brand new beta release of Early Modern Letters Online from Oxford University.



EMLO (what an adorable acronym, I kind of what to tickle it) is Oxford’s answer to a metadata database (metadatabase?) of, you guessed it, early modern letters. This is pretty much a gold standard metadata project. It’s still in beta, so there are some interface kinks and desirable features not-yet-implemented, but it has all the right ingredients for a great project:

  • Information is free and open; I’m even told it will be downloadable at some point.
  • Developed by a combination of historians (via Cultures of Knowledge) and librarians (via the Bodleian Library) working in tandem.
  • The interface is fast, easy, and includes faceted browsing.
  • Has a fantastic interface for adding your own data.
  • Actually includes citation guidelines thank you so much.
  • Visualizations for at-a-glance understanding of data.
  • Links to full transcripts, abstracts, and hard-copies where available.
  • Lots of other fantastic things.

Sorry if I go on about how fantastic this catalog is – like I said, I love letters so much. The index itself includes roughly 12,000 people, 4,000 locations, 60,000 letters, 9,000 images, and 26,000 additional comments. It is without a doubt the largest public letters database currently available. Between the data being compiled by this group, along with that of the CKCC in the Netherlands, the Electronic Enlightenment Project at Oxford, Stanford’s Mapping the Republic of Letters project, and R.A. Hatch‘s research collection, there will without a doubt soon be hundreds of thousands of letters which can be tracked, read, and analyzed with absolute ease. The mind boggles.

Bodleian Card Catalogue Summaries

Without a doubt, the coolest and most unique feature this project brings to the table is the digitization of Bodleian Card Catalogue, a fifty-two drawer index-card cabinet filled with summaries of nearly 50,000 letters held in the library, all compiled by the Bodleian staff many years ago. In lieu of full transcriptions, digitizations, or translations, these summary cards are an amazing resource by themselves. Many of the letters in the EMLO collection include these summaries as full-text abstracts.

One of the Bodleian summaries showing Heinsius looking far and wide for primary sources, much like we’re doing right now…

The collection also includes the correspondences of John Aubrey (1,037 letters), Comenius (526), Hartlib (4,589 many including transcripts), Edward Lhwyd (2,139 many including transcripts), Martin Lister (1,141), John Selden (355), and John Wallis (2,002). The advanced search allows you to look for only letters with full transcripts or abstracts available. As someone who’s worked with a lot of letters catalogs of varying qualities, it is refreshing to see this one being upfront about unknown/uncertain values. It would, however, be nice if they included the editor’s best guess of dates and locations, or perhaps inferred locations/dates from the other information available. (For example, if birth and death dates are known, it is likely a letter was not written by someone before or after those dates.)


In the interest of full disclosure, I should note that, much like with the CKCC letters interface, I spent some time working with the Cultures of Knowledge team on visualizations for EMLO. Their group was absolutely fantastic to work with, with impressive resources and outstanding expertise. The result of the collaboration was the integration of visualizations in metadata summaries, the first of which is a simple bar chart showing the numbers of letters written, received, and mentioned in per year of any given individual in the catalog. Besides being useful for getting an at-a-glance idea of the data, these charts actually proved really useful for data cleaning.

Sir Robert Crane (1604-1643)

In the above screenshot from previous versions of the data, Robert Crane is shown to have been addressed letters in the mid 1650s, several years after his reported death. While these could also have been spotted automatically, there are many instances where a few letters are dated very close to a birth or death date, and they often turn out to miss-reported. Visualizations can be great tools for data cleaning as a form of sanity test. This is the new, corrected version of Robert Crane’s page. They are using d3.js, a fantastic javascript library for building visualizations.

Because I can’t do anything with letters without looking at them as a network, I decided to put together some visualizations using Sci2 and Gephi. In both cases, the Sci2 tool was used for data preparation and analysis, and the final network was visualized in GUESS and Gephi, respectively. The first graph shows network in detail with edges, and names visible for the most “central” correspondents. The second visualization is without edges, with each correspondent clustered according to their place in the overall network, with the most prominent figures in each cluster visible.

Built with Sci2/Guess
Built with Sci2/Gephi

The graphs show us that this is not a fully connected network. There are many islands of one or two letters or a small handful of letters. These can be indicative of a prestige bias in the data. That is, the collection contains many letters from the most prestigious correspondents, and increasingly fewer as the prestige of the correspondent decreases. Put in another way, there are many letters from a few, and few letters from many. This is a characteristic shared with power law and other “long tail” distributions. The jumbled community structure at the center of the second graph is especially interesting, and it would be worth comparing these communities against institutions and informal societies at the time. Knowledge of large-scale patterns in a network can help determine what sort of analyses are best for the data at hand. More on this in particular will be coming in the next few weeks.

It’s also worth pointing out these visualizations as another tool for data-checking. You may notice, on the bottom left-hand corner of the first network visualization, two separate Edward Lhwyds with virtually the same networks of correspondence. This meant there were two distinct entities in their database referring to the same individual – a problem which has since been corrected.

More Letters!

Notice that the EMLO site makes it very clear that they are open to contributions. There are many letters datasets out there, some digitized, some still languishing idly on dead trees, and until they are all combined, we will be limited in the scope of the research possible. We can always use more. If you are in any way responsible for an early-modern letters collection, meta-data or full-text, please help by opening that collection up and making it integrable with the other sets out there. It will do the scholarly world a great service, and get us that much closer to understanding the processes underlying scholarly communication in general. The folks at Oxford are providing a great example, and I look forward to watching this project as it grows and improves.

Alchemy, Text Analysis, and Networks! Oh my!

“Newton wrote and transcribed about a million words on the subject of alchemy.” —


Beside bringing us things like calculus, universal gravitation, and perhaps the inspiration for certain Pink Floyd albums, Isaac Newton spent many years researching what was then known as “chymistry,” a multifaceted precursor to, among other things, what we now call chemistry, pharmacology, and alchemy.

Pink Floyd and the Occult: Discuss.

Researchers at Indiana University, notably William R. Newman, John A. Walsh, Dot Porter, and Wallace Hooper, have spent the last several years developing The Chymistry of Isaac Newton, an absolutely wonderful history of science resource which, as of this past month, has digitized all 59 of Newton’s alchemical manuscripts assembled by John Keynes in 1936. Among the sites features are heavily annotated transcriptions, manuscript images, often scholarly synopses, and examples of alchemical experiments. That you can try at home. That’s right, you can do alchemy with this website. They also managed to introduce alchemical symbols into unicode (U+1F700 – U+1F77F), which is just indescribably cool.

Alchemical experiments at home!

What I really want to highlight, though, is a brand new feature introduced by Wallace Hooper: automated Latent Semantic Analysis (LSA) of the entire corpus. For those who are not familiar with it, LSA is somewhat similar LDA, the algorithm driving the increasingly popular Topic Models used in Digital Humanities. They both have their strengths and weaknesses, but essentially what they do is show how documents and terms relate to one another.

Newton Project LSA

In this case, the entire corpus of Newton’s alchemical texts is fed into the LSA implementation (try it for yourself), and then based on the user’s preferences, the algorithm spits out a network of terms, documents, or both together. That is, if the user chooses document-document correlations, a list is produced of the documents that are most similar to one another based on similar word use within them. That list includes weights – how similar are they to one another? – and those weights can be used to create a network of document similarity.

Similar Documents using LSA

One of the really cool features of this new service is that it can export the network either as CSV for the technical among us, or as an nwb file to be loaded into the Network Workbench or the Sci² Tool. From there, you can analyze or visualize the alchemical networks, or you can export the files into a network format of your choice.

Network of how Newton’s alchemical documents relate to one-another visualized using NWB.

It’s great to see more sophisticated textual analyses being automated and actually used. Amber Welch recently posted on Moving Beyond the Word Cloud using the wonderful TAPoR, and Michael Widner just posted a thought-provoking article on using Voyeur Tools for the process of paper revision. With tools this easy to use, it won’t be long now before the first thing a humanist does when approaching a text (or a million texts) is to glance at all the high-level semantic features and various document visualizations before digging in for the close read.