Not Enough Perspectives, Pt. 1

Right now DH is all texts, but not enough perspectives. –Andrew Piper

Summary: Digital Humanities suffers from a lack of perspectives in two ways: we need to focus more on the perspectives of those who interact with the cultural objects we study, and we need more outside academic perspectives. In Part 1, I cover Russian Formalism, questions of validity, and what perspective we bring to our studies. In Part 2, 1 I call for pulling inspiration from even more disciplines, and for the adoption and exploration of three new-to-DH concepts: Appreciability, Agreement, and Appropriateness. These three terms will help tease apart competing notions of validity.


Syuzhet

Let’s begin with the century-old Russian Formalism, because why not? 2 Syuzhet, in that context, is juxtaposed against fabula. Syuzhet is a story’s order, structure, or narrative framework, whereas fabula is the underlying fictional reality of the world. Fabula is the story the author wants to get across, and syuzhet is the way she decides to tell it.

It turns out elements of Russian Formalism are resurfacing across the digital humanities, enough so that there’s an upcoming Stanford workshop on DH & Russian Formalism, and even I co-authored a piece that draws on work of Russian formalists. Syuzhet itself has a new meaning in the context of digital humanities: it’s a piece of code that chews books and spits out plot structures.

You may have noticed a fascinating discussion developing recently on statistical analysis of plot arcs in novels using sentiment analysis. A lot of buzz especially has revolved around Matt Jockers and Annie Swafford, and the discussion has bled into larger academia and inspired 246 (and counting) comments on reddit. Eileen Clancy has written a two-part broad link summary (I & II).

From Jockers' first post describing his method of deriving plot structure from running sentiment analysis on novels.
From Jockers’ first post describing his method of deriving plot structure from running sentiment analysis on novels.

The idea of deriving plot arcs from sentiment analysis has proven controversial on a number of fronts, and I encourage those interested to read through the links to learn more. The discussion I’ll point to here centers around “validity“, a word being used differently by different voices in the conversation. These include:

  • Do sentiment analysis algorithms agree with one another enough to be considered valid?
  • Do sentiment analysis results agree with humans performing the same task enough to be considered valid?
  • Is Jockers’ instantiation of aggregate sentiment analysis validly measuring anything besides random fluctuations?
  • Is aggregate sentiment analysis, by human or machine, a valid method for revealing plot arcs?
  • If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to map onto plot arcs, can they still be valid measurements of anything at all?
  • Can a subjective concept, whether measured by people or machines, actually be considered invalid or valid?

The list goes on. I contributed to a Twitter discussion on the topic a few weeks back. Most recently, Andrew Piper wrote a blog post around validity in this discussion.

Hermeneutics of DH, from Piper's blog.
Hermeneutics of DH, from Piper’s blog.

In this particular iteration of the discussion, validity implies a connection between the algorithm’s results and some interpretive consensus among experts. Piper points out that consensus doesn’t yet exist, because:

We have the novel data, but not the reader data. Right now DH is all texts, but not enough perspectives.

And he’s right. So far, DH seems to focus its scaling up efforts on the written word, rather than the read word.

This doesn’t mean we’ve ignored studying large-scale reception. In fact, I’m about to argue that reception is built into our large corpora text analyses, even though it wasn’t by design. To do so, I’ll discuss the tension between studying what gets written and what gets read through distant reading.

The Great Unread

The Great Unread is a phrase popularized by Franco Moretti 3 to indicate the lost literary canon. In his own words:

[…] the “lost best-sellers” of Victorian Britain: idiosyncratic works, whose staggering short-term success (and long-term failure) requires an explanation in their own terms.

The phrase has since become synonymous with large text databases like Google Books or HathiTrust, and is used in concert with distant reading to set digital literary history apart from its analog counterpart. Distant reading The Great Unread, it’s argued,

significantly increase[s] the researcher’s ability to discuss aspects of influence and the development of intellectual movements across a broader swath of the literary landscape. –Tangherlini & Leonard

Which is awesome. As I understand it, literary history, like history in general, suffers from an exemplar problem. Researchers take a few famous (canonical) books, assume they’re a decent (albeit shining) example of their literary place and period, and then make claims about culture, art, and so forth based on those novels which are available.

Matthew Lincoln raised this point the other day, as did Matthew Wilkins in his recent article on DH in the study of literature and culture. Essentially, both distant- and close-readers make part-to-whole generalized inferences, but the process of distant reading forces those generalizations to become formal and explicit. And hopefully, by looking at The Great Unread (the tens of thousands of books that never made it into the canon), claims about culture can better represent the nuanced literary world of the past.

Franco Moretti's Distant Reading.
Franco Moretti’s Distant Reading.

But this is weird. Without exemplars, what the heck are we studying? This isn’t a representation of what’s stood the test of time—that’s the canon we know and love. It’s also not a representation of what was popular back then (well, it sort of was, but more on that shortly), because we don’t know anything about circulation numbers. Most of these Google-scanned books surely never caught the public eye, and many of the now-canonical pieces of literature may not have been popular at the time.

It turns out we kinda suck at figuring out readership statistics, or even at figuring out what was popular at any given time, unless we know what we’re looking for. A folklorist friend of mine has called this the Sophus Bauditz problem. An expert in 19th century Danish culture, my friend one day stumbled across a set of nicely-bound books written by Sophus Bauditz. They were in his era of expertise, but he’d never heard of these books. “Must have been some small print run”, he thought to himself, before doing some research and discovering copies of these books he’d never heard of were everywhere in private collections. They were popular books for the emerging middle class, and sold an order of magnitude more copies than most books of the era; they’d just never made it into the canon. In another century, 50 Shades of Grey will likely suffer the same fate.

Tsundoku

In this light, I find The Great Unread to be a weird term.  The Forgotten Read, maybe, to refer to those books which people actually did read but were never canonized, and The Great Tsundoku 4 for those books which were published, lasted to the present, and became digitized, but for which we have no idea whether anyone bothered to read them. The former would likely be more useful in understanding reception, cultural zeitgeist, etc.; the latter might find better use in understanding writing culture and perhaps authorial influence (by seeing whose styles the most other authors copy).

s
Tsundoku is Japanese for the ever-increasing pile of unread books that have been purchased and added to the queue. Illustrated by Reddit user Wemedge’s 12-year-old daughter.

In the present data-rich world we live in, we can still only grasp at circulation and readership numbers. Library circulation provides some clues, as does the number, size, and sales of print editions. It’s not perfect, of course, though it might be useful in separating zeitgeist from actual readership numbers.

Mathematician Jordan Ellenberg recently coined the tongue-in-cheek Hawking Index, because Stephen Hawking’s books are frequently purchased but rarely read, to measure just that. In his Wall Street Journal article, Ellenberg looked at popular books sold on Amazon Kindle to see where people tended to socially highlight their favorite passages. Highlights from Kahneman’s “Thinking Fast and Slow”, Hawking’s “A Brief History of Time”, and Picketty’s “Capital in the Twenty-First Century” all tended to cluster in the first few pages of the books, suggesting people simply stopped reading once they got a few chapters in.

Kindle and other ebooks certainly complicate matters. It’s been claimed that one reason behind 50 Shades of Grey‘s success was the fact that people could purchase and read it discreetly, digitally, without worry about embarrassment. Digital sales outnumbered print sales for some time into its popularity. As Dan Cohen and Jennifer Howard pointed out, it’s remarkably difficult to understand the ebook market, and the market is quite different among different constituencies. Ebook sales accounted for 23% of the book market this year, yet 50% of romance books are sold digitally.

And let’s not even get into readership statistics for novels that are out copyright, or sold used, or illegally attained: they’re pretty much impossible to count. Consider It’s a Wonderful Life (yes, the 1946 Christmas movie). A clerical accident pushed the movie into the public domain (sort of) in 1974. It had never really been popular before then, but once TV stations could play it without paying royalties, and VHS companies could legally produce and sell copies for free, the movie shot to popularity. Importantly, it shot to popularity in a way that was impossible to see on official license reports, but which Google ngrams reveals quite clearly.

Google ngram count of "It's a Wonderful Life", showing its rise to popularity after the copyright lapse.
Google ngram count of It’s a Wonderful Life, showing its rise to popularity after the 1974 copyright lapse.

This ngram visualization does reveal one good use for The Great Tsundoku, and that’s to use what authors are writing about as finger on the pulse of what people care to write about. This can also be used to track things like linguistic influence. It’s likely no coincidence, for example, that American searches for the word “folks” doubled during the first month’s of President Obama’s bid for the White House in 2007. 5

American searches for the word "folks" during Obama's first presidential bid.
American searches for the word “folks” during Obama’s first presidential bid.

Matthew Jockers has picked up on this capability of The Great Tsundoku for literary history in his analyses of 19th century literature. He compares books by various similar features, and uses that in a discussion of literary influence. Obviously the causal chain is a bit muddled in these cases, culture being ouroboric as it is, and containing a great deal more influencing factors than published books, but it’s a good set of first steps.

But this brings us back to the question of The Great Tsundoku vs. The Forgotten Read, or, what are we learning about when we distant read giant messy corpora like Google Books? This is by no means a novel question. Ted Underwood, Matt Jockers, Ben Schmidt, and I had an ongoing discussion on corpus representativeness a few years back, and it’s been continuously pointed to by corpus linguists 6 and literary historians for some time.

Surely there’s some appreciable difference when analyzing what’s often read versus what’s written?

Surprise! It’s not so simple. Ted Underwood points out:

we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.

He continues

if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.

While his claim skirts the sorts of issues raised by Ellenberg’s Hawking Index, it does present a very reasonable natural experiment: if you ask the same question of three databases (1. The entire messy, reprint-ridden corpus; 2. Single editions of The Forgotten Read, those books which were popular whether canonized or not; 3. The entire Great Tsundoku, everything that was printed at least once, regardless of whether it was read), what will you find?

Underwood performed 2/3rds of this experiment, comparing The Forgotten Read against the entire HathiTrust corpus on an analysis of the emergence of literary diction. He found that the trend results across both were remarkably similar.

Underwood's analysis of all HathiTrust prose (left), vs. The Forgotten Read (right).
Underwood’s analysis of all HathiTrust prose (47,549 volumes, left), vs. The Forgotten Read (773 volumes, right).

Clearly they’re not precisely the same, but the fact that their trends are so similar is suggestive that the HathiTrust corpus at least shares some traits with The Forgotten Read. The jury is out on the extent of those shared traits, or whether it shares as much with The Great Tsundoku.

The cause of the similarities between historically popular books and books that made it into HathiTrust should be apparent: 7 historically popular books were more frequently reprinted and thus, eventually, more editions made it into the HathiTrust corpus. Also, as Allen Riddell showed, it’s likely that fewer than 60% of published prose from that period have been scanned, and novels with multiple editions are more likely to appear in the HathiTrust corpus.

This wasn’t actually what I was expecting. I figured the HathiTrust corpus would track more closely to what’s written than to what’s read—and we need more experiments to confirm that’s not the case. But as it stands now, we may actually expect these corpora to reflect The Forgotten Read, a continuously evolving measurement of readership and popularity. 8

Lastly, we can’t assume that greater popularity results in larger print runs in every case, or that those larger print runs would be preserved. Ephemera such as zines and comics, digital works produced in the 1980s, and brittle books printed on acidic paper in the 19th century all have their own increased likelihoods of vanishing. So too does work written by minorities, by the subjected, by the conquered.

The Great Unreads

There are, then, quite a few Great Unreads. The Great Tsundoku was coined with tongue planted firmly in-cheek, but we do need a way of talking about the many varieties of Great Unreads, which include but aren’t limited to:

  • Everything ever written or published, along with size of print run, number of editions, etc. (Presumably Moretti’s The Great Unread.)
  • The set of writings which by historical accident ended up digitized.
  • The set of writings which by historical accident ended up digitized, cleaned up with duplicates removed, multiple editions connected and encoded, etc. (The Great Tsundoku.)
  • The set of writings which by historical accident ended up digitized, adjusted for disparities in literacy, class, document preservation, etc. (What we might see if history hadn’t stifled so many voices.)
  • The set of things read proportional to what everyone actually read. (The Forgotten Read.)
  • The set of things read proportional to what everyone actually read, adjusted for disparities in literacy, class, etc.
  • The set of writings adjusted proportionally by their influence, such that highly influential writings are over-represented, no matter how often they’re actually read. (This will look different over time; in today’s context this would be closest to The Canon. Historically it might track closer to a Zeitgeist.)
  • The set of writings which attained mass popularity but little readership and, perhaps, little influence. (Ellenberg’s Hawking-Index.)

And these are all confounded by hazy definitions of publication; slowly changing publication culture; geographic, cultural, or other differences which influence what is being written and read; and so forth.

The important point is that reading at scale is not clear-cut. This isn’t a neglected topic, but nor have we laid much groundwork for formal, shared notions of “corpus”, “collection”, “sample”, and so forth in the realm of large-scale cultural analysis. We need to, if we want to get into serious discussions of validity. Valid with respect to what?

This concludes Part 1. Part 2 will get into the finer questions of validity, surrounding syuzhet and similar projects, and will introduce three new terms (Appreciability, Agreement, and Appropriateness) to approach validity in a more humanities-centric fashion.

Notes:

  1. Coming in a few weeks because we just received our proofs for The Historian’s Macroscope and I need to divert attention there before finishing this.
  2. And anyway I don’t need to explain myself to you, okay? This post begins where it begins. Syuzhet.
  3. The phrase was originally coined by Margaret Cohen.
  4. (see illustration below)
  5. COCA and other corpus tools show the same trend.
  6. Heather Froelich always has good commentary on this matter.
  7. Although I may be reading this as a just-so story, as Matthew Lincoln pointed out.
  8. This is a huge oversimplification. I’m avoiding getting into regional, class, racial, etc. differences, because popularity obviously isn’t universal. We can also argue endlessly about representativeness, e.g. whether the fact that men published more frequently than women should result in a corpus that includes more male-authored works than female-authored, or whether we ought to balance those scales.

Bridging Token and Type

There’s an oft-spoken and somewhat strawman tale of how the digital humanities is bridging C.P. Snow’s “Two Culture” divide, between the sciences and the humanities. This story is sometimes true (it’s fun putting together Ocean’s Eleven-esque teams comprising every discipline needed to get the job done) and sometimes false (plenty of people on either side still view the other with skepticism), but as a historian of science, I don’t find the divide all that interesting. As Snow’s title suggests, this divide is first and foremost cultural. There’s another overlapping divide, a bit more epistemological, methodological, and ontological, which I’ll explore here. It’s the nomothetic(type)/idiographic(token) divide, and I’ll argue here that not only are its barriers falling, but also that the distinction itself is becoming less relevant.

Nomothetic (Greek for “establishing general laws”-ish) and Idiographic (Greek for “pertaining to the individual thing”-ish) approaches to knowledge have often split the sciences and the humanities. I’ll offload the hard work onto Wikipedia:

Nomothetic is based on what Kant described as a tendency to generalize, and is typical for the natural sciences. It describes the effort to derive laws that explain objective phenomena in general.

Idiographic is based on what Kant described as a tendency to specify, and is typical for the humanities. It describes the effort to understand the meaning of contingent, unique, and often subjective phenomena.

These words are long and annoying to keep retyping, and so in the longstanding humanistic tradition of using new words for words which already exist, henceforth I shall refer to nomothetic as type and idiographic as token. 1 I use these because a lot of my digital humanities readers will be familiar with their use in text mining. If you counted the number of unique words in a text, you’d be be counting the number of types. If you counted the number of total words in a text, you’d be counting the number of tokens, because each token (word) is an individual instance of a type. You can think of a type as the platonic ideal of the word (notice the word typical?), floating out there in the ether, and every time it’s actually used, it’s one specific token of that general type.

The Token/Type Distinction
The Token/Type Distinction

Usually the natural and social sciences look for general principles or causal laws, of which the phenomena they observe are specific instances. A social scientist might note that every time a student buys a $500 textbook, they actively seek a publisher to punch, but when they purchase $20 textbooks, no such punching occurs. This leads to the discovery of a new law linking student violence with textbook prices. It’s worth noting that these laws can and often are nuanced and carefully crafted, with an awareness that they are neither wholly deterministic nor ironclad.

[via]
[via]
The humanities (or at least history, which I’m more familiar with) are more interested in what happened than in what tends to happen. Without a doubt there are general theories involved, just as in the social sciences there are specific instances, but the intent is most-often to flesh out details and create a particular internally consistent narrative. They look for tokens where the social scientists look for types. Another way to look at it is that the humanist wants to know what makes a thing unique, and the social scientist wants to know what makes a thing comparable.

It’s been noted these are fundamentally different goals. Indeed, how can you in the same research articulate the subjective contingency of an event while simultaneously using it to formulate some general law, applicable in all such cases? Rather than answer that question, it’s worth taking time to survey some recent research.

A recent digital humanities panel at MLA elicited responses by Ted Underwood and Haun Saussy, of which this post is in part itself a response. One of the papers at the panel, by Long and So, explored the extent to which haiku-esque poetry preceded what is commonly considered the beginning of haiku in America by about 20 years. They do this by teaching the computer the form of the haiku, and having it algorithmically explore earlier poetry looking for similarities. Saussy comments on this work:

[…] macroanalysis leads us to reconceive one of our founding distinctions, that between the individual work and the generality to which it belongs, the nation, context, period or movement. We differentiate ourselves from our social-science colleagues in that we are primarily interested in individual cases, not general trends. But given enough data, the individual appears as a correlation among multiple generalities.

One of the significant difficulties faced by digital humanists, and a driving force behind critics like Johanna Drucker, is the fundamental opposition between the traditional humanistic value of stressing subjectivity, uniqueness, and contingency, and the formal computational necessity of filling a database with hard decisions. A database, after all, requires you to make a series of binary choices in well-defined categories: is it or isn’t it an example of haiku? Is the author a man or a woman? Is there an author or isn’t there an author?

Underwood addresses this difficulty in his response:

Though we aspire to subtlety, in practice it’s hard to move from individual instances to groups without constructing something like the sovereign in the frontispiece for Hobbes’ Leviathan – a homogenous collection of instances composing a giant body with clear edges.

But he goes on to suggest that the initial constraint of the digital media may not be as difficult to overcome as it appears. Computers may even offer us a way to move beyond the categories we humanists use, like genre or period.

Aren’t computers all about “binary logic”? If I tell my computer that this poem both is and is not a haiku, won’t it probably start to sputter and emit smoke?

Well, maybe not. And actually I think this is a point that should be obvious but just happens to fall in a cultural blind spot right now. The whole point of quantification is to get beyond binary categories — to grapple with questions of degree that aren’t well-represented as yes-or-no questions. Classification algorithms, for instance, are actually very good at shades of gray; they can express predictions as degrees of probability and assign the same text different degrees of membership in as many overlapping categories as you like.

Here we begin to see how the questions asked of digital humanists (on the one side; computational social scientists are tackling these same problems) are forcing us to reconsider the divide between the general and the specific, as well as the meanings of categories and typologies we have traditionally taken for granted. However, this does not yet cut across the token/type divide: this has gotten us to the macro scale, but it does not address general principles or laws that might govern specific instances. Historical laws are a murky subject, prone to inducing fits of anti-deterministic rage. Complex Systems Science and the lessons we learn from Agent-Based Modeling, I think, offer us a way past that dilemma, but more on that later.

For now, let’s talk about influence. Or diffusion. Or intertextuality. 2 Matthew Jockers has been exploring these concepts, most recently in his book Macroanalysis. The undercurrent of his research (I think I’ve heard him call it his “dangerous idea”) is a thread of almost-determinism. It is the simple idea that an author’s environment influences her writing in profound and easy to measure ways. On its surface it seems fairly innocuous, but it’s tied into a decades-long argument about the role of choice, subjectivity, creativity, contingency, and determinism. One word that people have used to get around the debate is affordances, and it’s as good a word as any to invoke here. What Jockers has found is a set of environmental conditions which afford certain writing styles and subject matters to an author. It’s not that authors are predetermined to write certain things at certain times, but that a series of factors combine to make the conditions ripe for certain writing styles, genres, etc., and not for others. The history of science analog would be the idea that, had Einstein never existed, relativity and quantum physics would still have come about; perhaps not as quickly, and perhaps not from the same person or in the same form, but they were ideas whose time had come. The environment was primed for their eventual existence. 3

An example of shape affording certain actions by constraining possibilities and influencing people. [via]
An example of shape affording certain actions by constraining possibilities and influencing people. [via]
It is here we see the digital humanities battling with the token/type distinction, and finding that distinction less relevant to its self-identification. It is no longer a question of whether one can impose or generalize laws on specific instances, because the axes of interest have changed. More and more, especially under the influence of new macroanalytic methodologies, we find that the specific and the general contextualize and augment each other.

The computational social sciences are converging on a similar shift. Jon Kleinberg likes to compare some old work by Stanley Milgram 4, where he had people draw maps of cities from memory, with digital city reconstruction projects which attempt to bridge the subjective and objective experiences of cities. The result in both cases is an attempt at something new: not quite objective, not quite subjective, and not quite intersubjective. It is a representation of collective individual experiences which in its whole has meaning, but also can be used to contextualize the specific. That these types of observations can often lead to shockingly accurate predictive “laws” isn’t really the point; they’re accidental results of an attempt to understand unique and contingent experiences at a grand scale. 5

Manhattan. Dots represent where people have taken pictures; blue dots are by locals, red by tourists, and yellow unsure. [via Eric Fischer]
Manhattan. Dots represent where people have taken pictures; blue dots are by locals, red by tourists, and yellow are uncertain. [via Eric Fischer]
It is no surprise that the token/type divide is woven into the subjective/objective divide. However, as Daston and Galison have pointed out, objectivity is not an ahistorical category. 6 It has a history, is only positively defined in relation to subjectivity, and neither were particularly useful concepts before the 19th century.

I would argue, as well, that the nomothetic and idiographic divide is one which is outliving its historical usefulness. Work from both the digital humanities and the computational social sciences is converging to a point where the objective and the subjective can peaceably coexist, where contingent experiences can be placed alongside general predictive principles without any cognitive dissonance, under a framework that allows both deterministic and creative elements. It is not that purely nomothetic or purely idiographic research will no longer exist, but that they no longer represent a binary category which can usefully differentiate research agendas. We still have Snow’s primary cultural distinctions, of course, and a bevy of disciplinary differences, but it will be interesting to see where this shift in axes takes us.

Notes:

  1. I am not the first to do this. Aviezer Tucker (2012) has a great chapter in The Oxford Handbook of Philosophy of Social Science, “Sciences of Historical Tokens and Theoretical Types: History and the Social Sciences” which introduces and historicizes the vocabulary nicely.
  2. Underwood’s post raises these points, as well.
  3. This has sometimes been referred to as environmental possibilism.
  4. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, edited by Proshansky, Ittelson, and Rivlin, 104–124. New York.

    ———. 1982. “Cities as Social Representations.” In Social Representations, edited by R. Farr and S. Moscovici, 289–309.

  5. If you’re interested in more thoughts on this subject specifically, I wrote a bit about it in relation to single-authorship in the humanities here
  6. Daston, Lorraine, and Peter Galison. 2007. Objectivity. New York, NY: Zone Books.

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.

Metadata

Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use bookworm.culturomics.org to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!

Notes:

  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 1

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s my review of part I (Foundation), all chapters. Read Part 2, Part 3, …

Macroanalysis: Digital Methods & Literary History is a book whose time has come. “Individual creativity,” Matthew L. Jockers writes, “is highly constrained, even determined, by factors outside of what we consider to be a writer’s conscious control.” Although Jockers’ book is a work of impressive creativity, it also fits squarely within a larger set of trends. The scents of ‘Digital Humanities’ (DH) and ‘Big Data’ are in the air, the funding-rich smells attracting predators from all corners, and Jockers’ book floats somewhere in the center of it all. As with many DH projects, Macroanalysis attempts the double goal of explaining a new method and exemplifying the type of insights that can be achieved via this method. Unlike many projects, Jockers succeeds masterfully at both. Macroanalysis introduces its readers to large scale quantitative methods for studying literary history, and through those methods explores the nature of creativity and influence in general and the place of Irish literature within its larger context in particular.

I’ve apparently gained a bit of a reputation for being overly critical, and it’s worth pointing out at the beginning of this review that this trend will continue for Macroanalysis. That said, I am most critical of the things I love the most, and readers who focus on any nits I might pick without reading the book themselves should keep in mind that the overall work is staggering in its quality, and if it does fall short in some small areas, it is offset by the many areas it pushes impressively forward.

Macroanalysis arrives on bookshelves eight years after Franco Moretti’s Graphs, Maps, and Trees (2005), and thirteen years after Moretti’s “Conjectures on World Literature” went to press in early 2000, where he coined the phrase “distant reading.” Moretti’s distant reading is a way of seeing literature en masse, of looking at text at the widest angle and reporting what structures and forms only become visible at this scale. Moretti’s early work paved the way, but as might be expected with monograph published the same year as the initial release of Google Books, lack of available data made it stronger in theory than in computational power.

From Moretti's Graphs, Maps, and Trees
From Moretti’s Graphs, Maps, and Trees

In 2010, Moretti and Jockers, the author of Macroanalysis, co-founded the Stanford Lit Lab for the quantitative and digital research of literature. The two have collaborated extensively,  and Jockers acknowledge’s Moretti’s influence on his monograph. That said, in his book, Jockers distances himself slightly from Moretti’s notion of distant reading, and it is not the first time he has done so. His choice of “analysis” over “reading” is an attempt to show that what his algorithms are doing at this large scale is very different from our normal interpretive process of reading; it is simply gathering and aggregating data, the output of which can eventually be read and interpreted instead of or in addition to the texts themselves. The term macroanalysis was inspired by the difference between macro- and microeconomics, and Jockers does a good job justifying the comparison. Given that Jockers came up with the comparison in 2005, one does wonder if he would have decided on different terminology after our recent financial meltdown and the ensuing large-scale distrust of macroeconomic methods. The quantitative study of history, cliometrics, also had its origins in economics and suffered its own fall from grace decades ago; quantitative history still hasn’t recovered.

Part I: Foundation

I don’t know whether the allusion was intended, but lovers of science fiction and quantitative cultural studies will enjoy the title of Part I: “Foundation.” It shares a name with a series of books by Isaac Asimov, centering around the ability to combine statistics and human-centric research to understand and predict people’s behaviors. Punny titles aside, the section provides the structural base of the monograph.

The story of Foundation in a nutshell. Via c0ders.
The story of Foundation in a nutshell. Via c0ders.

Much of the introductory chapters are provocative statements about the newness of the study at hand, and they are not unwarranted. Still, I can imagine that the regular detractors of technological optimism might argue their usual arguments in response to Jockers’ pronouncements of a ‘revolution.’ The second chapter, on Evidence, raises some particularly important (and timely) points that are sure to raise some hackles. “Close reading is not only impractical as a means of evidence gathering in the digital library, but big data render it totally inappropriate as a method of studying literary history.” Jockers hammers home this point again and again, that now that anecdotal evidence based on ‘representative’ texts is no longer the best means of understanding literature, there’s no reason it should still be considered the gold standard of evidentiary support.

Not coming from a background of literary history or criticism, I do wonder a bit about these notions of representativeness (a point also often brought up by Ted Underwood, Ben Schmidt, and Jockers himself). This is probably something lit-researchers worked out in the 70s, but it strikes me that the questions being asked of a few ‘exemplary, representative texts’ are very different than the ones that ought to be asked of whole corpora of texts. Further, ‘representative’ of what? As this book appears to be aimed not only at traditional literary scholars, it would have been beneficial for Jockers to untangle these myriad difficulties.

One point worth noting is that, although Jockers calls his book Macroanalysis, his approach calls for a mixed method, the combination of the macro/micro, distant/close. The book is very careful and precise in its claims that macroanalysis augments and opens new questions, rather than replaces. It is a combination of both approaches, one informing the other, that leads to new insights. “Today’s student of literature must be adept at reading and gathering evidence from individual texts and equally adept at accessing and mining digital-text repositories.” The balance struck here is impressive: to ignore macroanalysis as a superior source of evidence for many types of large questions would be criminal, but its adoption alone does not make for good research (further, either without the other would be poorly done). For example, macroanalysis can augment close reading approaches by contextualizing a text within its broad historical and cultural moment, showing a researcher precisely where their object of research fits in the larger picture.

Historians would do well to heed this advice, though they are not the target audience. Indeed, historians play a perplexing role in Jockers’ narrative; not because his description is untrue, but because it ought not be true. In describing the digital humanities, Jockers calls it an “ambiguous and amorphous amalgamation of literary formalists, new media theorists, tool builders, coders, and linguists.” What place historians? Jockers places their role earlier, tracing the wide-angle view to the Annales historians and their focus on longue durée history. If historian’s influence ends there, we are surely in a sad state; that light, along with those of cliometrics and quantitative history, shone brightest in the 1970s before a rapid decline. Unsworth recently attributed the decline to the fallout following Time on the cross (Fogel & Engerman, 1974), putting quantitative methods in history “out of business for decades.” The ghost of cliometrics still haunts historians to such an extent that the best research in that area, to this day, comes more from information scientists and applied mathematicians than from historians. Digital humanities may yet exorcise that ghost, but it has not happened yet, as evidenced in part by the glaring void in Jockers’ introductory remarks.

It is with this framing in mind that Jockers embarks on his largely computational and empirical study of influence and landscape in British and American literature.