On the importance of a single historical author

I have a dirty admission to make: I think yesterday happened. Actually. Objectively. Stuff happened. I hear that’s still a controversial statement in some corners of the humanities, but I can’t say for sure; I generally avoid those corners. And I think descriptions of the historical evidence can vary in degrees of accuracy, separating logically coherent but historically implausible conspiracy theories from more likely narratives.

At the same time, what we all think of as the past is a construct. A bunch of people – historians, cosmologists, evolutionary biologists, your grandmother who loves to tell stories, you – have all worked together to construct and reconstruct the past. Lots of pasts, actually, because no two people can ever wholly agree; everybody sees the evidence through the lens of their own historical baggage.

I’d like to preface this post with the cautious claim that I am an outsider explaining something I know less about than I should. The hats I wear are information/data scientist and historian of science, and through some accident of the past, historians and historians of science have followed largely separate cultural paths. Which is to say, neither the historian of science in me nor the information scientist has a legitimate claim to the narrative of general history and the general historical process, but I’m attempting to write one anyway. I welcome any corrections or admonishments.

The Narrativist Individual

I use in this post (and in life in general) the vocabulary definitions of Aviezer Tucker, who is doing groundbreaking work on simple stuff like defining “history” and asking questions about what we can know about the past. 1 “History,” Tucker defines, is simply stuff that happened: the past itself. “Historians” are anybody who inquires about the past, from cosmologists to historical linguists. A “historiography” is a knowledge of the past, or more concretely, something a historian has written about the past. “Historiographic research” is what we historians do when we try to find out about the past, and a “historiographic narrative” is generally the result of a lot of that research strung together. 2

Narratives are important. In the 1970s, a bunch of historians began realizing that historians create narratives when they collect their historiographic research 3; that is, people tell stories about the past, using the same sorts of literary and rhetorical devices used in many other places. History itself is a giant jumble of events and causal connections, and representing it as it actually happened would be completely unintelligible and philosophically impossible, without recreating the universe from scratch. Historians look at evidence of the past and then impose an order, a pattern, in reconstructing the events to create their own unique historiographic narratives.

The narratives historians write are inescapably linked to their own historical baggage. Historians are biased and imperfect, and they all read history through the filter of themselves. Historiographic reconstructions, then, are as much windows into the historians themselves as they are windows into the past. The narrativist turn in historiography did a lot to situate the historian herself as a primary figure in her narrative, and it became widely accepted that instead of getting closer to some ground truth of history, historians were in the business of building consistent and legible narratives, their own readings of the past, so long as they were also consistent with the evidence. Those narratives became king, both epistemologically and in practice; historical knowledge is narrative knowledge.

Because narrative knowledge is a knowledge derived from lived experience – the historian sees the past in his own unique light – this emphasized the importance of the individual in historiographic research. Because historians neither could (nor by and large were) attempting to reach an objective ground-truth about the past, any claim to knowledge rested in the lone historian and how he read the past and how he presented his narrative. What resulted was a (fairly justified, given their conceptualization of historiographic knowledge) fetishization of the individual, the autonomous historian.

When multiple authors write a historiographic narrative, something almost ineffable is lost: the individual perspective which drives the narrative argument, part of the essential claim-to-knowledge. In a recent discussion with Ben Schmidt about autonomous humanities work vs. collaboration (the original post; my post; Ben’s reply), Ben pointed out “all the Stanley Fishes out there have reason to be discomfited that DHers revel so much in doing away with not only the printed monograph, traditional peer review, and close reading, but also the very institution of autonomous, individual scholarship. Erasmus could have read the biblical translations out there or hired a translator, but he went out and learned Greek [emphasis added].” I think a large part of that drive for autonomy (beyond the same institutional that’s-how-we’ve-always-done-it inertia that lone natural scientists felt at the turn of the last century) is the situatedness-as-a-way-of-knowing that imbues historiographic research, and humanistic research in general.

I’m inclined to believe that historians need to move away from an almost purely narrative epistemology; keeping in sight that all historiographic knowledge is individually constructed, remaining aware that our overarching cultural knowledge of the past is socio-technically constructed, but not letting that cripple our efforts at coordinating research, at reaching for some larger consilience with the other historical research programs out there, like paleontology and cosmology and geology. Computational methodologies will pave the way for collaborative research both because they allow it, and because they require it.

Collaboratively Constructing Paris

This is a map of Paris.

Map of Paris with dots representing photos taken and posted on Flickr. Red dots are pictures taken by tourists, blue are by locals, and yellow are unknown. via Eric Fischer.

On top of this map of Paris are red, blue, and yellow dots. The red dots are the locations of pictures taken and posted to Flickr by tourists to Paris; blue dots are where locals took pictures; yellow dots are unknown. The resulting image maps and differentiates touristic and local space by popularity, at least among Flickr users. It is a representation that would have been staggeringly difficult for an outsider to create without this sort of data-analytic approach, and yet someone intimately familiar with the city could look at this map and not be surprised. Perhaps they could even recreate it themselves.

What is this knowledge of Paris? It’s surely not a subjective representation of the city, not unless we stretch the limits of the word beyond the internally experienced and toward the collective. Neither is it an objective map 4 of the city, external to the whims of the people milling about within. The map represents an aggregate of individual experiences, a kind of hazy middle ground within the usual boundaries we draw between subjective and objective truth. This is an epistemological and ontological problem I’ve been wondering about for some time, without being able to come up with a good word for it until a conversation with a colleague last year.

“This is my problem,” I told Charles van den Heuvel, explaining my difficulties in placing these maps and other similar projects on the -jectivity scale. “They’re not quite intersubjective, not in the way the word is usually used,” I said, and Charles just looked at me like I was missing something excruciatingly obvious. “What is it when a group of people believe or do or think common things in aggregate?”—Charles asked—”isn’t that just called culture?” I initially disagreed, but mostly because it was so obvious that I couldn’t believe I’d just passed it over entirely.

In 1976, the infamous Stanley Milgram and co-author Denise Jodelet 5 responded to Durkheim’s emphasis on “the objectivity of social facts” by suggesting “that we understand things from the actor’s point of view.” To drive this point home, Milgram decides to map Paris. People “have a map of the city [they live in] in their minds,” Milgram suggests, and their individual memories and attitudes flavor those internal representations.

This is a good example to use, for Milgram, because cities themselves are socially constructed entities; what is a city without its people who live in and build it? Milgram goes on to suggest that people’s internal representations of cities are similarly socially constructed, that “such representations are themselves the products of social interaction with the physical environment.” In the ensuing study, Milgram asks 218 subjects to draw a non-tourist map of Paris as it seems to them, including whatever features they feel relevant. “Through selection, emphasis and distortion, the maps became projections of life styles.”

Milgram then compares all the maps together, seeking what unifies them: first and foremost, the city limits and the Seine. The river is distorted in a very particular way in nearly all maps, bypassing two districts entirely and suggesting they are of little importance to those who drew the maps. The center of the city, Notre Dame and the Île de la Cité, also remains constant. Milgram opposes this to a city like New York, the subject of a later similar study, whose center shifts slowly northward as the years roll by. Many who drew maps of either New York or Paris included elements they were not intimately familiar with, but they knew were socially popular, or were frequent spots of those in their social circles. Milgram concludes “the social representations of the city are more than disembodied maps; they are mechanisms whereby the bricks, streets, and physical geography of a place are endowed with social meaning.”

It’s worth posting a large chunk of Milgram’s earlier article on the matter:

A city is a social fact. We would all agree to that. But we need to add an important corollary: the perception of a city is also a social fact, and as such needs to be studied in its collective as well as its individual aspect. It is not only what exists but what is highlighted by the community that acquires salience in the mind of the person. A city is as much a collective representation as it is an assemblage of streets, squares, and buildings. We discern the major ingredients of that representation by studying not only the mental map in a specific individual, but by seeing what is shared among individuals.

Collaboratively Constructing History

Which brings us back to the past. 6 Can collaborating historians create legitimate narratives if they are not well-founded in personal experience? What sort of historical knowledge is collective historical knowledge? To this question, I turn to blogger Alice Bell, who wrote a delightfully short post discussing the social construction of science.  She writes about scientific knowledge, simply, “Saying science is a social construction does not amount to saying science is make believe.” Alice compares knowledge not to a city, like Paris, but to a building like St. Paul’s Cathedral or a scientific compound like CERN; socially constructed, but physically there. Real. Scientific ideas are part of a similar complex.

The social construction of historiographic narratives is painfully clear even without co-authorships, in our endless circles of acknowledgements and references. Still, there seems to be a good deal of push-back against explicit collaboration, where the entire academic edifice no longer lies solely in the mind of one historian (if indeed it ever had). In some cases, this push-back is against the epistemological infrastructure that requires the person in personal narrative. In others, it is because without full knowledge of each of the moving parts in a work of scholarship, that work is more prone to failure due to theories or methodologies not adequately aligning.

Building historiography together. via the Smithsonian.

I fear this is a dangerous viewpoint, one that will likely harm both our historiographic research and our cultural relevancy, as other areas of academia become more comfortable with large-scale collaboration. Single authorship for its own sake is as dangerous as collaboration for its own sake, but it has the advantage of being a tradition. We must become comfortable with the hazy middle ground between an unattainable absolute objectivity and an unscalable personal subjectivity, willing to learn how to construct our knowledge as Parisians construct their city. The individual experiences of Parisians are without a doubt interesting and poignant, but it is the combined experiences of the locals and the tourists that makes the city what it is. Moving beyond the small and individual isn’t just getting past the rut of microhistories that historiography is still trying to escape—it is also getting past the rut of individually driven narratives and toward unified collective historiographies. We have to work together.




  1. Tucker, Aviezer. 2004. Our Knowledge of the Past: A Philosophy of Historiography. Cambridge University Press.
  2. Tucker, Aviezer, ed. 2009. A Companion to the Philosophy of History and Historiography.
  3. Kuukkanen, Jouni-Matti. 2012. “The Missing Narrativist Turn in the Histiography of Science.” History and Theory 51 (3): 340–363. doi:10.1111/j.1468-2303.2012.00632.x.
  4. of anything besides the geolocations of Flickr pictures, in and of itself not particularly interesting
  5. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, ed. Proshansky, Ittelson, and Rivlin, 104–124. New York.
    Milgram, Stanley. 1982. “Cities as Social Representations.” In Social Representations, ed. R. Farr and S. Moscovici, 289–309.
  6. As opposed to bringing us Back to the Future, which would probably be more fun.

In Defense of Collaboration

Being a very round-about review of the new work of fiction by Robin Sloan, Mr. Penumbra’s 24-Hour Bookstore.

Ship’s Logs and Collaborative DH

Ben Schmidt has stolen the limelight of the recent digital humanities blogosphere, writing a phenomenal series of not one, not two, not three, not four, not five, not six, but seven posts about ship logs and digital history. They’re a whale of a read, and whale worth it too (okay, okay, I’m sorry, I had to), but the point for the purpose of this post is his conclusion:

The central conclusion is this: To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences. This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.

– Ben Schmidt

He goes on to add “A historian whose access is mediated by an archivist tends to know how best to interpret her sources; one plugging at databases through dimly-understood methods has lost his claim to expertise.”  Ben makes many great points, and he himself, with this series of posts, exemplifies the power of humanistic competency and technical expertise combined in one wrinkled protein sponge. It’s a powerful mix, and one just beginning to open a whole new world of inquiry.

Yes, I know this is not how brains work. It’s still explanatory. via.

This conclusion inspired a twitter discussion where Ben and Ted Underwood questioned whether there was a limit to the division-of-labor/collaboration model in the digital humanities.  Which of course I disagreed with. Ben suggested that humanists “prize source familiarity more. You can’t teach Hitler studies without speaking German.” The humanist needs to actually speak German; they can’t just sit there with a team of translators and expect to do good humanistic work.

This opens up an interesting question: how do we classify all this past work involving collaboration between humanists and computer scientists, quals and quants, epistêmê and technê?  Is it not actually digital humanities? Will it eventually be judged bad digital humanities, that noisy pre-paradigmatic stuff that came before the grand unification of training and pervasive dual-competencies? My guess is that, if there are limits to collaboration, they are limits which can be overcome with careful coordination and literacy.

I’m not suggesting collaboration is king, nor that it will always produce faster or better results. We can’t throw nine women and nine men in a room and hope to produce a baby in a month’s time, with the extra help. However, I imagine that there are very few, if any, situations where some conclusion can’t be reached by two people with complementary competencies that can be produced by one person with both. Scholarship works on trust. Academics are producing knowledge every day that relies on their trusting the competencies of the secondary sources they cite, so that they do not need methodological or content expertise in the entire hypothetical lattice extending from their conclusions down to the most basic elements of their arguments.

And I predict that as computationally-driven humanities matures and utilizes increasingly-complex datasets and algorithms, our reliance on these networks of trust (and our need to methodologically formalize them) will only grow. This shift occurred many years ago in the natural sciences, as scientists learned to rely on physical tools and mathematical systems that they did not fully understand, as they began working in ever-growing teams where no one person could reconstruct the whole. Our historical narratives also began to shift, moving away from the idea that the most important ideas in history sprung forth fully developed from the foreheads of “Great Men,” as we realized that an entire infrastructure was required to support them.

How we used to think science worked. via.

What we need in the digital humanities is not combined expertise (although that would probably make things go faster, at the outset), but multiple literacies and an infrastructure to support collaboration; a system in place we can trust to validate methodologies and software and content and concepts. By multiple literacies, I mean the ability for scholars to speak the language of the experts they collaborate with. Computer scientists who can speak literary studies, humanists who can speak math, dedicated translators who can bridge whatever gaps might exist, and enough trust between all the collaborators that each doesn’t need to reinvent the wheel for themselves. Ben rightly points out that humanists value source expertise, that you can’t teach Hitler without speaking German; true, but the subject, scope, and methodologies of traditional humanists have constrained them from needing to directly rely on collaborators to do their research. This will not last.

The Large Hadron Collider is arguably the most complex experiment the world has ever seen. Not one person understands all, most, or even a large chunk of it. Physics and chemistry could have stuck with experiments and theories that could reside completely and comfortably in one mind, for there was certainly a time when this was the case, but in order to grow (to scale), a translational trust infrastructure needed to be put in place. If you take it for granted that humanities research (that is, research involving humans and their interactions with each other and the world, taking into account the situated nature of the researcher) can scale, then in order for it to do so, we as individuals must embrace a reliance on things we do not completely understand. The key will be figuring out how to balance blind trust with educated choice, and that key lies in literacies, translations, and trust-granting systems in the academy or social structure, as well as solidified standard practices. These exist in other social systems and scholarly worlds (like the natural sciences), and I think they can exist for us as well, and to some extent already do.

Timely Code Cracking

Coincidentally enough, the same day Ben tweeted about needing to know German to study Hitler in the humanities, Wired posted an article reviewing some recent(-ish) research involving a collaboration between a linguist, a computer scientist, and a historian to solve a 250-year-old cipher. The team decoded a German text describing an 18th century secret society, and it all started when one linguist (Christiane Schaefer) was given photocopies of this manuscript about 15 years ago. She toyed with the encoded text for some time, but never was able to make anything substantive of it.

After hearing a talk by machine translation expert and computer scientist Kevin Knight, who treats translations as ciphers, Schaefer was inspired to bring the code to Knight. At the time, neither knew what language the original was written in, nor really anything else about it. In short order, Knight utilized algorithmic analysis and some educated guesswork to recognize textual patterns suggesting the text to be German. “Knight didn’t speak a word of German, but he didn’t need to. As long as he could learn some basic rules about the language—which letters appeared in what frequency—the machine would do the rest.”

Copiale cipher. via.

Within weeks, Knight’s analysis combined with a series of exchanges between him and Schaefer and a colleague of hers led to the deciphering of the text, revealing its original purpose. “Schaefer stared at the screen. She had spent a dozen years with the cipher. Knight had broken the whole thing open in just a few weeks.” They soon enlisted the help of a historian of secret societies to help further understand and contextualize the results they’d discovered, connecting the text to a group called the Oculists and connecting them with the Freemasons.

If this isn’t a daring example of digital humanities at its finest, I don’t know what is. Sure, if one researcher had the competencies of all four, the text wouldn’t have sat dormant for a dozen years, and likely a few assumptions still exist in the dataset that might be wrong or improved upon. But this is certainly an example of a fruitful collaboration. Ben’s point still stands – a humanist bungling her way through a database without a firm grasp of the process of data creation or algorithmic manipulation has lost her claim to expertise – but there are ways around these issues; indeed, there must be, if we want to start asking more complex questions of more complex data.

Mr. Penumbra’s 24-Hour Bookstore

You might have forgotten, but this post is actually a review of a new piece of fiction by Robin Sloan. The book, Mr. Penumbra’s 24-Hour Bookstore, is a love letter. That’s not to say the book includes love (which I suppose it does, to some degree), but that the thing itself is a love letter, directed at the digital humanities. Possibly without the author’s intent.

This is a book about collaboration. It’s about data visualization, and secret societies, and the history of the book. It’s about copyright law and typefaces and book scanning. It’s about the strain between old and new ways of knowing and learning. In short, this book is about the digital humanities. Why is this book review connected with a defense of collaboration in the digital humanities? I’ll attempt to explain the connection without spoiling too much of the book, which everyone interested enough to read this far should absolutely read.

The book begins just before the main character, an out-of-work graphic designer named Clay, gets hired at a mysterious and cavernous used bookstore run by the equally mysterious Mr. Penumbra. Strange things happen there. Crazy people with no business being up during Clay’s night shift run into the store, intent on retrieving one particular book, leaving with it only to return some time later seeking another one. The books are illegible. The author doesn’t say as much, but the reader suspects some sort of code is involved.

Intent on discovering what’s going on, Clay enlists the help of a Google employee, a programming wiz, to visualize the goings on in the bookstore. Kat, the Googler, is “the kind of girl you can impress with a prototype,” and the chemistry between them as they try to solve the puzzle fantastic in the nerdiest of ways. Without getting into too many details, they and a group of friends wind up solving a puzzle using data analysis in mere weeks that most people take years to discover in their own analog ways. Some of those people who did spend years trying to solve the aforementioned puzzle are quite excited by this new technique; some, predictably, are not. For their part, the rag-tag group of friends who digitally solved it don’t quite understand what it is they’d solved, not in the way the others have. If this sounds familiar, you’ve probably heard of culturomics.

Mr. Penumbra’s 24-Hour Bookstore. via.

A group of interdisciplinary people, working with Google, who figure out in weeks what should have taken years (and generally does). A few of the old school researchers taking their side, going along with them against the herd, an establishment that finds their work Wrong in so many ways. Essentially, if you read this book, you’ll have read a metaphorical, fictional argument that aligns quite closely with what I’ve argued in the blog post above.

So go out and buy the book. The physical book, mind you, not the digital version, and make sure to purchase the hardcover. It was clearly published with great care and forethought; the materiality of the book, its physical attributes and features, were designed cleverly to augment the book itself in ways that are not revealed until you have finished it. While the historical details in the novel are fictional, the historical among you will recognize many connections to actual people and events, and those digitally well-versed will find similarly striking connections. Also, I want you to buy the book so I have other people to talk to about it with, because I think the author was wrong about his main premise. We can start a book-club. I’d like to thank Paige Morgan for letting me know Sloan had turned his wonderful short story into a novel. And re-read this post after you’ve finished reading the book – it’ll make a lot more sense.


Each of these three sections were toward one point: collaboration in the digital humanities is possible and, for certain projects as we go forward, will become essential. That last section won’t make much sense in support of this argument until you actually read the novel, so go out and do that. It’s okay, I’ll wait.

To Ben and Ted’s credit, they weren’t saying collaboration was futile. They were arguing for increasingly well-rounded competencies, which I think we can all get behind. But I also think we need to start establishing some standard practices and to create a medium wherein we can develop methodologies that can be peer-reviewed and approved, so that individual scholars can have an easier time doing serious and theoretically compelling computational work without having to relearn the entire infrastructure supporting it. Supporting more complex ways of knowing in the field of humanities will require us as individuals becoming more comfortable with not knowing everything.

method personal research

Analyzing submissions to Digital Humanities 2013

Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a new method of matching submissions to reviewers. It’s a bidding process; reviewers take a look at the many submissions and state their reviewing preferences or, when necessary, conflicts of interest. It’s unclear the extent to which these preferences will be accommodated, as this is an experiment on their part. Bethany Nowviskie describes it here. As a potential reviewer, I just went through the process of listing my preferences, and managed to do some data scraping while I was there. How could I not? All 348 submission titles were available to me, as well as their authors, topic selections, and keywords, and given that my submission for this year is all about quantitatively analyzing DH, it was an opportunity I could not pass up. Given that these data are sensitive, and those who submitted did so under the assumption that rejected submissions would remain private, I’m opting not to release the data or any non-aggregated information. I’m also doing my best not to actually read the data in the interest of the privacy of my peers; I suppose you’ll all just have to trust me on that one, though.

So what are people submitting? According to the topics authors assigned to their 348 submissions, 65 submitted articles related to “literary studies,” trailed closely by 64 submissions which pertained to “data mining/ text mining.” Work on archives and visualizations are also up near the top, and only about half as many authors submitted historical studies (37) as those who submitted literary ones (65). This confirms my long suspicion that our current wave of DH (that is, what’s trending and exciting) focuses quite a bit more on literature than history. This makes me sad.  You can see the breakdown in Figure 1 below, and further analysis can be found after.

Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).

The majority of authors attached fewer than five topics to their submissions; a small handful included over 15.  Figure 2 shows the number of topics assigned to each document.

Figure 2: The number of topics attached to each document, in order of rank.

I was curious how strongly each topic coupled with other topics, and how topics tended to cluster together in general, so I extracted a topic co-occurrence network. That is, whenever two topics appear on the same document, they are connected by an edge (see Networks Demystified Pt. 1 for a brief introduction to this sort of network); the more times two topics co-occur, the stronger the weight of the edge between them.

Topping off the list at 34 co-occurrences were “Data Mining/ Text Mining” and “Text Analysis,” not terrifically surprising as the the latter generally requires the former, followed by “Data Mining/ Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis” and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m saying here is that Literary Studies, Mining, and Analysis seem to go hand-in-hand.

Knowing my readers, about half of you are already angry with me counting co-occurrences, and rightly so. That measurement is heavily biased by the sheer total number of times a topic is used; if “literary studies” is attached to 65 submissions, it’s much more likely that it will co-occur with any particular topic than topics (like “teaching and pedagogy”) which simply appear more infrequently. The highest frequency topics will co-occur with one another simply by an accident of magnitude.

To account for this, I measured the neighborhood overlap of each node on the topic network. This involves first finding the number of other topics  a pair of two topics shares. For example, “teaching and pedagogy” and “digital humanities – pedagogy and curriculum” each co-occur with several other of the same topics, including “programming,” “interdisciplinary collaboration,” and “project design, organization, management.” I summed up the number topical co-occurrences between each pair of topics, and then divided that total by the number of co-occurrences each node in the pair had individually. In short, I looked at which pairs of topics tended to share similar other topics, making sure to take into account that some topics which are used very frequently might need some normalization. There are better normalization algorithms out there, but I opt to use this one for its simplicity for pedagogical reasons. The method does a great job leveling the playing field between pairs of infrequently-used topics compared to pairs of frequently-used topics, but doesn’t fair so well when looking at a pair where one topic is popular and the other is not. The algorithm is well-described in Figure 3, where the darker the edge, the higher the neighborhood overlap.

Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar.

Neighborhood overlap paints a slightly different picture of the network. The pair of topics with the largest overlap was “Internet / World Wide Web” and “Visualization,” with 90% of their neighbors overlapping. Unsurprisingly, the next-strongest pair was “Teaching and Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The data might be used to suggest multiple topics that might be merged into one, and this pair seems to be a pretty good candidate. “Visualization” also closely overlaps “Data Mining/ Text Mining”, which itself (as we saw before) overlaps with “Cultural Studies” and “Literary Studies.” What we see from this close clustering both in overlap and in connection strength is the traces of a fairly coherent subfield out of DH, that of quantitative literary studies. We see a similarly tight-knit cluster between topics concerning archives, databases, analysis, the web, visualizations, and interface design, which suggests another genre in the DH community: the (relatively) recent boom of user interfaces as workbenches for humanists exploring their archives. Figure 4 represents the pairs of topics which overlap to the highest degree; topics without high degrees of pair correspondence don’t appear on the network graph.

Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.

The topics authors chose for each submission were from a controlled vocabulary. Authors also had the opportunity to attach their own keywords to submissions, which unsurprisingly yielded a much more diverse (and often redundant) network of co-occurrences. The resulting network revealed a few surprises: for example, “topic modeling” appears to be much more closely coupled with “visualization” than with “text analysis” or “text mining.” Of course some pairs are not terribly surprising, as with the close connection between “Interdisciplinary” and “Collaboration.” The graph also shows that the organizers have done a pretty good job putting the curated topic list together, as a significant chunk of the high thresholding keywords are also available in the topic list, with a few notable exceptions. “Scholarly Communication,” for example, is a frequently used keyword but not available as a topic – perhaps next year, this sort of analysis can be used to help augment the curated topic list. The keyword network appears in Figure 5. I’ve opted not to include a truly high resolution image to dissuade readers from trying to infer individual documents from the keyword associations.

Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.

There’s quite a bit of rich data here to be explored, and anyone who does have access to the bidding can easily see that the entire point of my group’s submission is exploring the landscape of DH, so there’s definitely more to come on the subject from this blog. I especially look forward to seeing what decisions wind up being made in the peer review process, and whether or how that skews the scholarly landscape at the conference.

On a more reflexive note, looking at the data makes it pretty clear that DH isn’t as fractured as some occasionally suggest (New Media vs. Archives vs. Analysis, etc.). Every document is related to a few others, and they are all of them together connected in a rich family, a network, of Digital Humanities. There are no islands or isolates. While there might be no “The” Digital Humanities, no unifying factor connecting all research, there are Wittgensteinian family resemblances  connecting all of these submissions together, in a cohesive enough whole to suggest that yes, we can reasonably continue to call our confederation a single community. Certainly, there are many sub-communities, but there still exists an internal cohesiveness that allows us to differentiate ourselves from, say, geology or philosophy of mind, which themselves have their own internal cohesiveness.


Hire me.

I just added a new static page to the Irregular asking you to hire me for short consulting jobs. It’s not that I need the money (although it certainly wouldn’t hurt), but rather I enjoy being part of new DH projects and love traveling to new places. I’ve done a bunch of consulting work in the past; my favorites were helping out at Early Modern Letters Online at Oxford University, and leading a team to design and implement a virtual research environment for the CKCC in the Netherlands. I’ve also helped plan and worked on DH startup grants, Digging Into Data projects, and a handful of other projects you can find on my CV. In an effort to get more experience in various types of projects before I finish my Ph.D., as well as to travel more, I’m offering my experience to help plan digital humanities projects, create criteria for hiring DH faculty, design curricula, etc. If you want more information or know anyone who would like assistance in anything from computational methodologies to event planning, follow this link!

DH gun for hire. via.

Currated syllabi

Those who follow me on twitter already know about my new curated course syllabi page, but here’s an announcement for the RSS crowd (I’m looking at you, Elijah). Basically, there are a bunch of DH courses out there (and great lists of syllabi at CUNY, the DH Working Group on Zotero, and the DH Education Group on Zotero), but I know of no resource that specifically focuses on the computational / algorithmic / quantitative humanities syllabi, so I decided to put one together. You can find it here, as well as in the pages menu up top, at this point including 19 courses taught over 24 semesters, and I welcome suggestions of more.

via xkcd

Google Maps for the Ancient World

The title of this post comes from an oft-quoted passage of my previous one describing ORBIS, a scholarly argument cleverly disguised as a web tool. Of ORBIS I wrote

…given any two cities in the ancient world, it returns the fastest, cheapest, or shortest route between them, given the month, the mode of transportation, and various other options. It’s Google Maps for the ancient world, complete with the “Avoid Highways” feature.

In writing that review, I neglected to mention the many fantastic resources out there that already map the ancient world, including the Digital Atlas of Roman and Medieval Civilization and PLEIADES, a gazetteer and graph of ancient places. The most impressive full-featured online GIS application I’ve seen is called Antiquity À-la-carte, shown below. The classicists have once again proven themselves to be at the bleeding edge of technology. When they keep developing cool toys like these, I sometimes regret being an early modernist. Sometimes.

Antiquity À-la-carte, a GIS application for the ancient western world.

The cool toy I speak of now is of course no toy, but a serious scholarly endeavor which will doubtless set the bar for future online historical maps. In many ways, the Digital Map of the Roman Empire offers less than the sites I listed above. It doesn’t allow you to turn on or off particular layers, and it certainly doesn’t include all of the information the others have. In this case, however, less is more. It’s a really easy to use map of the ancient world, online. That’s it. It doesn’t tell you how to get from point A to point B, it doesn’t allow you to see the location of shipwrecks or the borders of countries at different time periods; it’s just a base map, depicting the Greek and Roman world in its entirety, asking the world to do with it what it will.

Johan Ahlfeldt wrote about his creation:

The aim of my work with Pelagios has been to create a static (non-layered) map of the ancient places in the Pleiades dataset with the capacity to serve as a background layer to online mapping applications of the Ancient World. Because it is based on ancient settlements and uses ancient placenames, our map presents a visualisation more tailored to archaeological and historical research, for which modern mapping interfaces, such as Google Maps, are hardly appropriate; it even includes non-settlement data such as the Roman roads network, some aqueducts and defence walls (limes, city walls). Thus, for example, the tiles can be used as a background layer to display the occurrence of find-spots, archaeological sites, etc., thereby creating new opportunities to put data of these kinds in their historical context.

The ancient base map.

As I wrote last year, accurate base maps are extremely important for contextualizing research. With this underneath, for example, ORBIS could provide a much richer experience of the ancient world. What’s more, the PELAGIOS group has opened up the map with a CC-BY license, allowing anyone to build on it so long as they include proper scholarly attribution. It can be used with Openlayers, Google, and Bing maps, so anybody who already has these systems in place can easily swap out the map tiles with these historical ones. Johan’s post includes examples of all of these implemented.

My posts are usually long and rambling, but I’ll keep this one short and to the point, much like the tool I’m reviewing. This is the first easily mashable base map of the ancient world, and for that it is awesome. Go explore!


On Simplicity

You can build complex arguments on a very simple foundation
Ted Underwood

Celestial Navigation

The world is full of very complex algorithms honed to solve even more complex problems. When you use your phone as a GPS, it’s not a simple matter of triangulating signals from towers or satellites. Because your GPS receiver has to know the precise time (in nanoseconds) at the satellites it’s receiving signals from, and because the satellites are moving at various speeds and orbiting at an altitude where the force of gravity is significantly different, calculating times gets quite complicated due to the effects of relativity. The algorithms that allow the GPS to work have to take relativity into account, often on the fly, and without those complex algorithms the GPS would not be nearly so precise.

Precision and complexity go hand-in-hand, and often the relationship between the two is non-linear. That is, a little more precision often requires lot more complexity. Ever-higher levels of precision get exponentially more difficult to achieve. The traditional humanities is a great example of this; a Wikipedia article can sum up most of what one needs to know regarding, say, World War I, but it takes many scholars many lifetimes to learn  everything. And the more that’s already been figured out, the more work we need to do to find the smallest next piece to understand.

This level of precision is often important, insightful, and necessary to make strides in a field. Whereas before an earth-centered view of the universe was good enough to aid in navigation and predict the zodiac and the seasons, a heliocentric model was required for more precise predictions of the movements of the planets and stars. However, these complex models are not always the best ones for a given situation, and sometimes simplicity and a bit of cleverness can go much further than whatever convoluted equation yields the most precise possible results.

Sticking with the example of astronomy, many marine navigation schools still teach the geocentric model; not because they don’t realize the earth moves, but because navigation is simply easier when you pretend the earth is fixed and everything moves around it. They don’t need to tell you exactly when the next eclipse will be, they just need to figure out where they are. Similarly, your phone can usually pinpoint you within a few blocks by triangulating itself between cellphone towers, without ever worrying about satellites or Einstein.

Geocentric celestial navigation chart from a class assignment.

Whether you need to spend the extra time figuring out relativistic physics or heliocentric astronomical models depends largely on your purpose. If you’re just trying to find your way from Miami to New York City, and for some absurd reason you can only rely on technology you’ve created yourself for navigation, the simpler solution is probably the best way to go.

Simplicity and Macroanalysis

If I’ve written over-long on navigation, it’s because I believe it to be a particularly useful metaphor for large-scale computational humanities. Franco Moretti calls it distant reading, Matthew Jockers calls it macroanalysis, and historians call it… well, I don’t think we’ve come up with a name for it yet. I’d like to think we large-scale computational historians share a lot in spirit with big history, though I rarely see that connection touched on. My advisor likes shifting the focus from what we’re looking at to what we’re looking through, calling tools that help us see the whole of something macroscopes, as opposed to the microscopes which help us reach ever-greater precision.

Whatever you choose to call it, the important point is the shifting focus from precision to contextualization. Rather than exploring a particular subject with ever-increasing detail and care, it’s important to sometimes take a step back and look at how everything fits together. It’s a tough job because ‘everything’ is really quite a lot of things, and it’s easy to get mired in the details. It’s easy to say “well, we shouldn’t look at the data this way because it’s an oversimplification, and doesn’t capture the nuance of the text,” but capturing the nuance of the text isn’t the point. I have to admit, I sent an email to that effect to Matthew Jockers regarding his recent DH2012 presentation,  suggesting that time and similarity were a poor proxy for influence. But that’s not the point, and I was wrong in doing so, because the data still support his argument of authors clustering stylistically and thematically by gender, regardless of whether he calls the edges ‘influence’ or ‘similarity.’

I wrote a post a few months back on halting conditions, figuring out that point when adding more and more detailed data stops actually adding to the insight and instead just contributes to the complexity of the problem. I wrote

Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.

And this is the crux of the matter. If we’re trying to contextualize our data, if we’re trying to open our arms to collect everything that’s available, we need to keep it simple. We need a map, a simple way to navigate the deluge that is human history and creativity. This map will not be hand-drawn with rulers and yard-sticks, it will be taken via satellite, where only the broadest of strokes are clear. Academia, and especially the humanities, fetishizes the particular at the expense of the general. General knowledge is overview knowledge, is elementary knowledge. Generalist is a dirty word lobbed at popular authors who wouldn’t know a primary source if it fell on their head from the top shelf, covering them in dust and the smell of old paper.

Generality is not a vice. Simplicity can, at times, be a virtue. Sometimes you just want to know where the hell you are.

For these maps, a reasonable approximation is often good enough for most situations. Simple triangulation is good enough to get you from Florida to New York, and simply counting the number of dissertations published at ProQuest in a given year for a particular discipline is good enough to approximate the size of one discipline compared to another. Both lack nuance and are sure to run you into some trouble at the small scale, but often that scale is not necessary for the task at hand.

Two situations clearly shout for reasonable approximations; general views and contextualization. In the image below Matthew Jockers showed that formal properties of novels tend to split around the genders of their authors; that is, men wrote differently and about different things than women.

Network graph of 19th century novels, with nodes (novels) colored according to the gender of their authors.

Of course this macroanalysis lacks a lot of nuance, and one can argue for years which set of measurements might yield the best proxy for novel similarity, but as a base approximation the split is so striking that there is little doubt the apparent split is indicative of something interesting actually going on. Jockers has successfully separated signal from noise. This is a great example of how a simple approximation is good enoughto provide a general overview, a map offering one useful view of the literary landscape.

Beyond general overviews and contextualizations, simple models and quantifications can lead to surprisingly concrete and particular results. Take Strato, a clever observer who died around 2,300 years ago. There’s a lot going on after a rainstorm. The sun glistens off the moist grass, little insects crawl their way out of the ground, water pours from the corners of the roof. Each one of these events are themselves incredibly complex and can be described in a multitude of ways; with water pouring from a roof, for example, you can describe the thickness of the stream, or the impression the splash makes on the ground below, or the various murky colors it might take depending on where it’s dripping from. By isolating one property of the pouring rainwater, the fact that it tends to separate into droplets as it gets closer to the ground, Strato figured out that the water moved faster the further it had fallen. That is, falling bodies accelerate. Exactly measuring that acceleration, and quite how it worked, would elude humankind for over a thousand years, but a very simple observation that tended to hold true in many situations was good enough to discover a profound property of physics.

A great example of using simple observations to look at specific historical developments is Ben Schmidt’s Poor man’s sentiment analysis. By looking at words that occur frequently after the word ‘free’ in millions of books,  Ben is able to show the decreasing centrality of ‘freedom of movement’ after its initial importance in the 1840s, or the drop in the use of ‘free men’ after the fall of slavery. Interesting changes are also apparent in the language of markets and labor which both fit well with our understanding of the history of the concepts, and offer new pathways to study, especially around inflection points.

Ben Schmidt looks at the words that appear most frequently following the word ‘free’ in the google ngrams corpus.

 Toward Simplicity

Complex algorithms and representations are alluring. Ben Fry says of network graphs

Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it.

And the same is often true of algorithms; the more complex the algorithm, the more we feel it somehow ‘fits’ the data, because we know our data are so complex to begin with. Oftentimes, however, the simplest solution is good enough (and often really good) for the broadest number of applications.

If there is any take-home message of this post, as a follow-up to my previous one on Halting Conditions, it’s that diminishing returns doesn’t just apply to the amount of data you’ve collected, it also applies to the analysis you plan on running them through. More data aren’t always better, and newer, more precise, more complex algorithms aren’t always better.

Spend your time coming up with clever, simpler solutions so you have more time to interpret your results soundly.


Topic Modeling for Humanists: A Guided Tour

It’s that time again! Somebody else posted a really clear and enlightening description of topic modeling on the internet. This time it was Allen Riddell, and it’s so good that it inspired me to write this post about topic modeling that includes no actual new information, but combines a lot of old information in a way that will hopefully be useful. If there’s anything I’ve missed, by all means let me know and I’ll update accordingly.

Introducing Topic Modeling

Topic models represent a class of computer programs that automagically extracts topics from texts. What a topic actually is will be revealed shortly, but the crux of the matter is that if I feed the computer, say, the last few speeches of President Barack Obama, it’ll come back telling me that the president mainly talks about the economy, jobs, the Middle East, the upcoming election, and so forth. It’s a fairly clever and exceptionally versatile little algorithm that can be customized to all sorts of applications, and a tool that many digital humanists would do well to have in their toolbox.

From the outset it’s worth clarifying some vocabulary, and mentioning what topic models can and cannot do. “LDA” and “Topic Model” are often thrown around synonymously, but LDA is actually a special case of topic modeling in general produced by David Blei and friends  in 2002. It was not the first topic modeling tool, but is by far the most popular, and has enjoyed copious extensions and revisions in the years since. The myriad variations of topic modeling have resulted in an alphabet soup of names that might be confusing or overwhelming to the uninitiated; ignore them for now. They all pretty much work the same way.

When you run your text through a standard topic modeling tool, what comes out the other end first is several lists of words. Each of these lists is supposed to be a “topic.” Using the example from before of presidential addresses, the list might look like:

  1. Job Jobs Loss Unemployment Growth
  2. Economy Sector Economics Stock Banks
  3. Afghanistan War  Troops Middle-East Taliban Terror
  4. Election Romney Upcoming President
  5. … etc.

The computer gets a bunch of texts and spits out several lists of words, and we are meant to think those lists represent the relevant “topics” of a corpus. The algorithm is constrained by the words used in the text; if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription of your dream of bear-fights and big caves, the algorithm will tell you nothing about your father and your mother; it’ll only tell you things about bears and caves. It’s all text and no subtext. Ultimately, LDA is an attempt to inject semantic meaning into vocabulary; it’s a bridge, and often a helpful one. Many dangers face those who use this bridge without fully understanding it, which is exactly what the rest of this post will help you avoid.

Network generated by Elijah Meeks to show how digital humanities documents relate to one another via the topics they share.

Learning About Topic Modeling

The pathways to topic modeling are many and more, and those with different backgrounds and different expertise will start at different places. This guide is for those who’ve started out in traditional humanities disciplines and have little background in programming or statistics, although the path becomes more strenuous as we get closer Blei’s original paper on LDA (as that is our goal.) I will try to point to relevant training assistance where appropriate. A lot of the following posts repeat information, but there are often little gems in each which make them all worth reading.

No Experience Necessary

The following posts, read in order, should be completely understandable to pretty much everyone.

The Fable

Perhaps the most interesting place to start is the stylized account of topic modeling by Matt Jockers, who weaves a tale of authors sitting around the LDA buffet, taking from it topics with which to write their novels. According to Jockers, the story begins in a quaint town, . . .

somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

The blog post is a fun read, and gets at the general idea behind the process of a topic model without delving into any of the math involved. Start here if you are a humanist who’s never had the chance to interact with topic models.

A Short Overview

Clay Templeton over at MITH wrote a short, less-stylized overview of topic modeling which does a good job discussing the trio of issues currently of importance: the process of the model, the software itself, and applications in the humanities.

In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.

Templeton’s piece is concise, to the point, and offers good examples of topic models used for applications you’ll actually care about. It won’t tell you any more about the process of topic modeling than Jockers’ article did, but it’ll get you further into the world of topic modeling as it is applied in the humanities.

An Example: The American Political Science Review

Now that you know the basics of what a topic model actually is, perhaps the best thing is to look at an actual example to ground these abstract concepts. David Blei’s team shoved all of the journal articles from The American Political Science Review into a topic model, resulting in a list of 20 topics that represent the content of that journal. Click around on the page; when you click one of the topics, it sends you to a page listing many of the words in that topic, and many of the documents associated with it. When you click on one of the document titles, you’ll get a list of topics related to that document, as well as a list of other documents that share similar topics.

This page is indicative of the sort of output topic modeling will yield on a corpus. It is a simple and powerful tool, but notice that none of the automated topics have labels associated with them. The model requires us to make meaning out of them, they require interpretation, and without fully understanding the underlying algorithm, one cannot hope to properly interpret the results.

First Foray into Formal Description

Written by yours truly, this next description of topic modeling begins to get into the formal process the computer goes through to create the topic model, rather than simply the conceptual process behind it. The blog post begins with a discussion of the predecessors to LDA in an attempt to show a simplified version of how LDA works, and then uses those examples to show what LDA does differently. There’s no math or programming, but the post does attempt to bring up relevant vocabulary and define them in terms familiar to those without programming experiencing.

With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.

Only the first half of this post is relevant to our topic modeling guided tour. The second half, a section on topic modeling and network analysis, discusses various extended uses that are best left for later.

Computational Process

Ted Underwood provides the next step in understanding what the computer goes through when topic modeling a text.

. . . it’s a long step up from those posts to the computer-science articles that explain “Latent Dirichlet Allocation” mathematically. My goal in this post is to provide a bridge between those two levels of difficulty.

Computer scientists make LDA seem complicated because they care about proving that their algorithms work. And the proof is indeed brain-squashingly hard. But the practice of topic modeling makes good sense on its own, without proof, and does not require you to spend even a second thinking about “Dirichlet distributions.” When the math is approached in a practical way, I think humanists will find it easy, intuitive, and empowering. This post focuses on LDA as shorthand for a broader family of “probabilistic” techniques. I’m going to ask how they work, what they’re for, and what their limits are.

His is the first post that talks in any detail about the iterative process going into algorithms like LDA, as well as some of the assumptions those algorithms make. He also shows the first formula appearing in this guided tour, although those uncomfortable with formulas need not fret. The formula is not essential to understanding the post, but for those curious, later posts will explicate it. And really, Underwood does a great job of explaining a bit about it there.

Be sure to read to the very end of the post. It discusses some of the important limitations of topic modeling, and trepidations that humanists would be wise to heed.  He also recommends reading Blei’s recent article on Probabilistic Topic Models, which will be coming up shortly in this tour.

Computational Process From Another Angle

It may not matter whether you read this or the last article by Underwood first; they’re both first passes to what the computer goes through to generate topics, and they explain the process in slightly different ways. The highlight of Edwin Chen’s blog post is his section on “Learning,” followed a section expanding that concept.

And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability).

This post both explains the meaning of these statistical notations, and tries to actually step the reader through the process using a metaphor as an example, a bit like Jockers’ post from earlier but more closely resembling what the computer is going through. It’s also worth reading through the comments on this post if there are parts that are difficult to understand.

This ends the list of articles and posts that require pretty much no prior knowledge. Reading all of these should give you a great overview of topic modeling, but you should by no means stop here. The following section requires a very little bit of familiarity with statistical notation, most of which can be found at this Wikipedia article on Bayesian Statistics.

Some Experience Required

Not much experience! You can even probably ignore most of the formulae in these posts and still get quite a great deal out of them. Still, you’ll get the most out of the following articles if you can read signs related to probability and summation, both of which are fairly easy to look up on Wikipedia. The dirty little secret of most papers that include statistics is that you don’t actually need to understand all of the formulae to get the gist of the article. If you want to  fully understand everything below, however, I’d highly suggest taking an introductory course or reading a textbook on Bayesian statistics. I second Allen Riddell in suggesting Hoff’s A First Course in Bayesian Statistical Methods (2009), Kruschke’s Doing Bayesian Data Analysis (2010), or Lee’s Bayesian Statistics: An Introduction (2004). My own favorite is Kruschke’s; there are puppies on the cover.

Return to Blei

David Blei co-wrote the original LDA article, and his descriptions are always informative. He recently published a great introduction to probabilistic topic models for those not terribly familiar with it, and although it has a few formulae, it is the fullest computational description of the algorithm, gives a brief overview of Bayesian statistics, and provides a great framework with which to read the following posts in this series. Of particular interest are the sections on “LDA and Probabilistic Models” and “Posterior Computation for LDA.”

LDA and other topic models are part of the larger field of probabilistic modeling. In generative probabilistic modeling, we treat our data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both the observed and hidden random variables. We perform data analysis by using that joint distribution to compute the conditional distribution of the hidden variables given the observed variables. This conditional distribution is also called the posterior distribution.

Really, read this first. Even if you don’t understand all of it, it will make the following reads easier to understand.

Back to Basics

The post that inspired this one, by Allen Riddell, explains the mixture of unigrams model rather than the LDA model, which allows Riddell to back up and explain some important concepts. The intended audience of the post is those with an introductory background in Bayesian statistics but it offers a lot even to those who do not have that. Of particular interest is the concrete example he uses, articles from German Studies journals, and how he actually walks you through the updating procedure of the algorithm as it infers topic and document distributions.

The second move swaps the position of our ignorance. Now we guess which documents are associated with which topics, making the assumption that we know both the makeup of each topic distribution and the overall prevalence of topics in the corpus. If we continue with our example from the previous paragraph, in which we had guessed that “literary” was more strongly associated with topic two than topic one, we would likely guess that the seventh article, with ten occurrences of the word “literary”, is probably associated with topic two rather than topic one (of course we will consider all the words, not just “literary”). This would change our topic assignment vector to z=(1,1,1,1,1,1,2,1,1,1,2,2,2,2,2,2,2,2,2,2). We take each article in turn and guess a new topic assignment (in many cases it will keep its existing assignment).

The last section, discussing the choice of number of topics, is not essential reading but is really useful for those who want to delve further.

Some Necessary Concepts in Text Mining

Both a case study and a helpful description, David Mimno’s recent article on Computational Historiography from ACM Transactions on Computational Logic goes through a hundred years of Classics journals to learn something about the field (very similar Riddell’s article on German Studies). While the article should be read as a good example of topic modeling in the wild, of specific interest to this guide is his “Methods” section, which includes an important discussion about preparing text for this sort of analysis.

In order for computational methods to be applied to text collections, it is first necessary to represent text in a way that is understandable to the computer. The fundamental unit of text is the word, which we here define as a sequence of (unicode) letter characters. It is important to distinguish two uses of word: a word type is a distinct sequence of characters, equivalent to a dictionary headword or lemma; while a word token is a specific instance of a word type in a document. For example, the string “dog cat dog” contains three tokens, but only two types (dog and cat).

What follows is a description of the primitive objects of a text analysis, and how to deal with variations in words, spelling, various languages, and so forth. Mimno also discusses smoothed distributions and word distance, both important concepts when dealing with these sorts of analyses.

Further Reading

By now, those who managed to get through all of this can probably understand most of the original LDA paper by Blei, Ng, and Jordan (most of it will be review!), but there’s a lot more out there than that original article. Mimno has a wonderful bibliography of topic modeling articles, and they’re tagged by topic to make finding the right one for a particular application that much easier.

Applications: How To Actually Do This Yourself

David Blei’s website on topic modeling has a list of available software, as does a section of Mimno’s Bibliography. Unfortunately, almost everything in those lists requires some knowledge of programming, and as yet I know of no really simple implementation of topic modeling. There are a few implementations for humanists that are supposed to be released soon, but to my knowledge, at the time of this writing the simplest tool to run your text through is called MALLET.

MALLET is a tool that does require a bit of comfort with the command-line, though it’s really just the same four commands or so over and over again. It’s a fairly simply software to run once you’ve gotten the hang of it, but that first part of the learning curve could be a bit more like a learning cliff.

On their website, MALLET has a link called “Tutorial” – don’t click it. Instead, after downloading and installing the software, follow the directions on the “Importing Data” page. Then, follow the directions on the “Topic Modeling” page. If you’re a Windows user, Shawn Graham, Ian Milligan, and I wrote a tutorial on how to get it running when you run into a problem (and if this is your first time, you will), and it also includes directions for Macs. Unfortunately, a more detailed tutorial is beyond the scope of this tour, but between these links you’ve got a good chance of getting your first topic model up and running.

Examples in the DH World

There are a lot of examples of topic modeling out there, and here are some that I feel are representative of the various uses it can be put to. I’ve already mentioned David Mimno’s computational historiography of classics journals, as well as Allen Riddell’s similar study of German Studies publications. Both papers are good examples of using topic modeling as a meta-analysis of a discipline. Turning the gaze towards our collective navels, Matt Jockers used LDA to find what’s hot in the Digital Humanities, and Elijah Meeks has a great process piece looking at topics in definitions of digital humanities and humanities computing.

Lisa Rhody has an interesting exploratory topical analysis of poetry, and Rob Nelson as well discusses (briefly) making an argument via topic modeling applied to poetry, which he expands in this New York Times blog post. Continuing in the literary vein, Ted Underwood talks a bit about the relationship of words to topics, as well as a curious find linking topic models and family relations.

One of the great and oft-cited examples of topic modeling in the humanities is Rob Nelson’s Mining the Dispatch, which looks at the changing discussion during the American Civil War through an analysis of primary texts. Just as Nelson looks at changing topics in the news over time, so too does Newman and Block  in an analysis of eighteenth century newspapers, as well as Yang, Torget, and Mihalcea in a more general look at topic modeling and newspapers. In another application using primary texts, Cameron Blevins uses MALLET to run an in-depth analysis of an eighteenth century diary.

Future Directions

This is not actually another section of the post. This is your conscience telling you to go try topic modeling for yourself.


The Myth of Text Analytics and Unobtrusive Measurement

Text analytics are often used in the social sciences as a way of unobtrusively observing people and their interactions. Humanists tend to approach the supporting algorithms with skepticism, and with good reason. This post is about the difficulties of using words or counts as a proxy for some secondary or deeper meaning. Although I offer no solutions here, readers of the blog will know I am hopeful of the promise of these sorts of measurements if used appropriately, and right now, we’re still too close to the cutting edge to know exactly what that means. There are, however, copious examples of text analytics used well in the humanities (most recently, for example, Joanna Guldi’s  publication on the history of walking).

The Confusion

Klout is a web service which ranks your social influence based on your internet activity. I don’t know how Klout’s algorithm works (and I doubt they’d be terribly forthcoming if I asked), but one of the products of that algorithm is a list of topics about which you are influential. For instance, Klout believes me to be quite influential with regards to Money (really? I don’t even have any of that.) and Journalism (uhmm.. no.), somewhat influential in Juggling (spot on.), Pizza (I guess I am from New York…), Scholarship (Sure!), and iPads (I’ve never touched an iPad.), and vaguely influential on the topic of Cars (nope) and Mining (do they mean text mining?).

By Ildar Sagdejev (Specious) (Own work) [GFDL ( or CC-BY-SA-3.0-2.5-2.0-1.0 (], via Wikimedia Commons
My pizza expertise is clear.
Thankfully careers don’t ride on this measurement (we have other metrics for that), but the danger is still fairly clear: the confusion of vocabulary and syntax for semantics and pragmatics. There are clear layers between the written word and its intended meaning, and those layers often depend on context and prior knowledge. Further, regardless of the intended meaning of the author, how her words are interpreted in the larger world can vary wildly. She may talk about money and pizza until she is blue in the face, but if the whole world disagrees with her, that is no measurement of expertise nor influence (even if angry pizza-lovers frequently shout at her about her pizza opinions).

We see very simple examples of this in sentiment  analysis, a way to extract the attitude of the writer toward whatever it was he’s written. An old friend who recently dipped his fingers in sentiment analysis wrote this:

According to his algorithm, that sentence was a positive one. Unless I seriously misunderstand my social cues (which I suppose wouldn’t be too unlikely), I very much doubt the intended positivity of the author. However, most decent algorithms would pick up that this was a tweet from somebody who was positive about Sarah Jessica Parker.

Unobtrusive Measurements

This particular approach to understanding humans belongs to the larger methodological class of unobtrusive measurements. Generally speaking, this topic is discussed in the context of the social sciences and is contrasted with more ‘obtrusive’ measurements along the lines of interviews or sticking people in labs. Historians generally don’t need to talk about unobtrusive measurements because, hey, the only way we could be obtrusive to our subjects would require exhuming bodies. It’s the idea that you can cleverly infer things about people from a distance, without them knowing that they are being studied.

Notice the disconnect between what I just said, and the word itself. ‘Unobtrusive’ against “without them knowing that they are being studied.” These are clearly not the same thing, and that distinction between definition and word is fairly important – and not merely in the context of this discussion. One classic example (Doob and Gross, 1968) asks how somebody’s social status determines whether someone might take aggressive action against them. They specifically measures a driver’s likelihood to honk his horn in frustration based on the perceived social status of the driver in front of them. Using a new luxury car and an old rusty station wagon, the researchers would stop at traffic lights that had turned green and would wait to see whether the car behind them honked. In the end, significantly more people honked at the low status car. More succinctly: status affects decisions of aggression.  Honking and the perceived worth of the car were used as proxies for aggression and perceptions of status, much like vocabulary is used as a proxy for meaning.

In no world would this be considered unobtrusive from the subject’s point of view. The experimenters intruded on their world, and their actions and lives changed because of it. All it says is that the subjects won’t change their behavior based on the knowledge that they are being studied. However, when an unobtrusive experiment becomes large enough, even one as innocuous as counting words, even that advantage no longer holds. Take, for example, citation analysis and the h-index. Citation analysis was initially construed as an unobtrusive measurement; we can say things about scholars and scholarly communication by looking at their citation patterns rather than interviewing them directly. However, now that entire nations (like Australia or the UK) use quantitative analysis to distribute funding to scholarship, the measurements are no longer unobtrusive. Scholars know how the new scholarly economy works, and have no problem changing their practices to get tenure, funding, etc.

The Measurement and The Effect: Untested Proxies

A paper was recently published (O’Boyle Jr. and Aguinis, 2012) on the non-normality of individual performance. The idea is that we assume that people’s performance (for example students in a classroom) are normally distributed along a bell curve. A few kids get really good grades, a few kids get really bad grades, but most are ‘C’ students. The authors challenge this view, suggesting performance takes on more of a power-law distribution, where very few people perform very well, and the majority perform very poorly, with 80% of people performing worse than the statistical average. If that’s hard to imagine, it’s because people are trained to think of averages on a bell curve, where 50% are greater than average and 50% are worse than average. Instead, imagine one person gets a score of 100, and another five people get scores of 10. The average is (100 + (10 * 5)) / 6 = 25, which means five out of the six people performed worse than average.

It’s an interesting hypothesis, and (in my opinion) probably a correct one, but their paper does not do a great job showing that. The reason is (you guessed it) they use scores as a proxy for performance.  For example, they look at the number of published papers individuals have in top-tier journals, and show that some authors are very productive whereas most are not. However, it’s a fairly widely-known phenomena that in science, famous names are more likely to be published than obscure ones (there are many anecdotes about anonymous papers being rejected until the original, famous author is revealed, at which point the paper is magically accepted). The number of accepted papers may be as much a proxy for fame as it is for performance, so the results do not support their hypothesis. The authors then look at awards given to actors and writers, however those awards suffer the same issues: the more well-known an actor, the the more likely they’ll be used in good movies, the more likely they’ll be visible to award-givers, etc. Again, awards are not a proxy for the quality of a performance. The paper then goes on to measure elected officials based on votes in elections. I don’t think I need to go on about how votes might not map one-to-one on the performance and prowess of an elected official.

I blogged a review of the most recent culturomics paper, which used google ngrams to look at the frequency of recurring natural disasters (earthquakes, floods, etc.) vs. the frequency of recurring social events (war, unemployment, etc.). The paper concludes that, because of differences in the frequency of word-use for words like ‘war’ or ‘earthquake’, the phenomena themselves are subject to different laws. The authors use word frequency as a proxy for the frequency of the events themselves, much in the same way that Klout seems to measure influence based on word-usage and counting. The problem, of course, is that the processes which govern what people decide to write down do not enjoy a one-to-one relationship to what people experience. Using words as proxies for events is just as problematic as using them for proxies of expertise, influence, or performance. The underlying processes are simply far more complicated than these algorithms give them credit for.

It should be noted, however, that the counts are not meaningless; they just don’t necessarily work as proxies for what these ngram scholars are trying to measure. Further, although the underlying processes are quite complex, the effect size of social or political pressure on word-use may be negligible to the point that their hypothesis is actually correct. The point isn’t that one cannot use one measurement as a proxy for something else; rather, the effectiveness of that proxy is assumed rather than actually explored or tested in any way. We need to do a better job, especially as humanists, of figuring out exactly how certain measurements map onto effects we seek.

A beautiful case study that exemplifies this point was written by famous statistician Andrew Gelman, and it aims to use unobtrusive and indirect measurements to find alien attacks and zombie outbreaks. He uses Google Trends to show that the number of zombies in the world are growing at a frightening rate.

Zombies will soon take over!



Halting Conditions

Occasionally, in computer science, the term “halting condition” is thrown around as the point at which the program should stop running.

Say I’ve got a robot that watches my roommate and I play scrabble, and I want it to count how many scrabble pieces we use, and tell us who won and what the highest scoring word was. Unfortunately, let’s say, I’m also Superman, so our scrabble games frequently end early when I hear cries for help and run off to the nearest phone booth. Our robot has to decide what conditions mean the game is over so it can give us the winner report; in this case, it is either when one player runs out of pieces, or when nobody plays a piece for a significant amount of time, because games often end early. Those are our halting conditions.

Scrabble Robot from

When it comes to data collection, humanists have no halting conditions. We don’t even have decent halting heuristics. Lisa Rhody just blogged a fantastically important piece about the difficulties of data collection in the humanities, and her points are worth stressing. “You need to know,” Rhody writes, “when it’s time to cut the rope and release what might be done.” She points out that humanists need to be discerning in what data we do collect, and we need to be comfortable with analyzing and releasing imperfect data. “The decision not to be perfect is the right choice, but it isn’t an easy one.”

Research Design

Many (but not all!) of the natural sciences have it easy. You design an experiment, you get the data you planned to get, then you analyze and release it. The halting conditions, when to stop collecting and cleaning data, are usually fairly easily pre-determined and stuck to. Psychology and the social sciences are usually similar; they often either use data that already exists, or else collect it themselves under pre-specified conditions.

The humanities, well… we’re used to a tradition that involves very deep and particular reading. The tiniest stones of our studied objects do not go unturned. The idea that a first pass, an incomplete pass, can lead to anything at all, let alone analysis and release, is almost anathema to the traditional humanistic mindset.

Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.

Measuring the Coast

While I won’t suggest that humanists should take a more natural-scientific approach to research, beginning with a specific hypothesis and pre-specified data that could either confirm or deny it, we should look to them for inspiration on how to plan research. Thinking about what sort of specific analyses you’d like to perform with the data at the end can reasonably constrain what you try to collect from the beginning. Think about what bits of data are redundant, or would yield diminishing returns on your time and money investment of data collection.

Being Comfortable With Imperfection

In her blog post, Lisa wrote about her experience at MITH. She had a four month fellowship to research 4,500 poems; she could easily have spent the whole time collecting increasingly minute data about each poem. In the end, she settled on only collecting the gender of the poet and whether the poem pertained to a work of art, opting not to include information like when each poem was published, what work of art it referred to, etc. She would then go in later and use other large-scale analytic tools (like text analysis), augmenting those results with the tags she entered about each poem.

A lot of valuable, rich information was lost in this data collection, but the important thing is that Lisa was still able to go in with a specific question, and collect only that which she needed most to explore it. The data may not have been perfect, and they may not have described everything, but they were sufficient and useful.

Her story reminded me a  lot of my undergraduate years. I spent all of them collecting data on early modern letters for my old advisor. Letters, of course, generally have various locations and dates attached to them, and this presented us with no end of problems. Sometimes the places mentioned were cities, or houses, or states; granularities differed. Over the course of two hundred years, cities would change names, move, or wink out of or into existence entirely. Sometimes they would subsumed into new or different empires. Computers, unfortunately, need fairly regularized data to perform comparative analyses, so we had to make a lot of editorial decisions when entering locations that would make answering our questions easier, but would lose some of the nuance otherwise available.

Similarly, my colleague Jeana Jorgensen recently spent several months painstakingly hand-collecting data about the usage of body parts in fairy tales for her dissertation. Of particular interest in her case was the overtly interpretive layer she added to the collection; for example, did a reference somehow embody the “grotesque?” By allowing herself the freedom to use interpretive frameworks, she embraced the subjective nature of data collection, and was able to analyze her data accordingly.

Of course, by allowing this sort of humanistic nuance, the amount of data one could collect for any single sentence is effectively infinite, and so Jeana had to constrain herself to only collecting for that which she could eventually use. It nevertheless took her months of daily collection, but if she tried to make her data perfect or complete, it would have taken her over a lifetime. She still managed to produce really interesting and thoughtful results for her dissertation.

Perfect or complete data is impossible in the humanities. The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.

We can still see the landscape, even though not every piece is in place.

The trick and art is knowing the right halting conditions. How much is too much? What data will actually be useful? These are not easy questions, and their answers differ for every project. The important thing to remember is to just do it. Too many projects get hung up because they just haven’t quite collected enough yet, or if they just spend a few more months cleaning their data will be so much better. There will never be a point when your data are perfect. Do your analysis now, release it, and be comfortable with the fact that you’ve fairly accurately mapped the coastline, even if you haven’t quite worked out the jitters of the tides.