Ghosts in the Machine

Musings on materiality and cost after a tour of The Shoah Foundation.

Forgetting The Holocaust

As the only historian in my immediate family, I’m responsible for our genealogy, saved in a massive GEDCOM file. Through the wonders of the web, I now manage quite the sprawling tree: over 100,000 people, hundreds of photos, thousands of census records & historical documents. The majority came from distant relations managing their own trees, with whom I share.

Such a massive well-kept dataset is catnip for a digital humanist. I can analyze my family! The obvious first step is basic stats, like the most common last name (Aber), average number of kids (2), average age at death (56), or most-frequently named location (New York). As an American Jew, I wasn’t shocked to see New York as the most-common place name in the list. But I was unprepared for the second-most-common named location: Auschwitz.

I’m lucky enough to write this because my great grandparents all left Europe before 1915. My grandparents don’t have tattoos on their arms or horror stories about concentration camps, though I’ve met survivors their age. I never felt so connected to The Holocaust, HaShoah, until I took time to see explore the hundreds of branches of my family tree that simply stopped growing in the 1940s.

Aerial photo of Auschwitz-Birkenau. [via wikipedia]
1 of every 16 Jews in the entire world were murdered in Auschwitz, about a million in all. Another 5 million were killed elsewhere. The global Jewish population before the Holocaust was 16.5 million, a number we’re only now approaching again, 70 years later. And yet, somehow, last month a school official and national parliamentary candidate in Canada admitted she “didn’t know what Auschwitz was”.

I grew up hearing “Never Forget” as a mantra to honor the 11 million victims of hate and murder at the hands of Nazis, and to ensure it never happens again. That a Canadian official has forgotten—that we have all forgotten many of the other genocides that haunt human history—suggests how easy it is to forget. And how much work it is to remember.

The material cost of remembering 50,000 Holocaust survivors & witnesses

Yad Vashem (“a place and a name”) represents the attempt to inscribe, preserve, and publicize the names of Jewish Holocaust victims who have no-one to remember them. Over four million names have been collected to date.

The USC Shoah Foundation, founded by Steven Spielberg in 1994 to remember Holocaust survivors and witnesses, is both smaller and larger than Yad Vashem. Smaller because the number of survivors and witnesses still alive in 1994 numbered far fewer than Yad Vashem‘s 4.3 million; larger because the foundation conducted video interviews: 100,000 hours of testimony from 50,000 individuals, plus recent additions of witnesses and survivors of other genocides around the world. Where Yad Vashem remembers those killed, the Shoah Foundation remembers those who survived.  What does it take to preserve the memories of 50,000 people?

I got a taste of the answer to that question at a workshop this week hosted by USC’s Digital Humanities Program, who were kind enough to give us a tour of the Shoah Foundation facilities. Sam Gustman, the foundation’s CTO and Associate Dean of USC’s Libraries, gave the tour.

Shoah Foundation Digitization Facility
Shoah Foundation Digitization Facility [via my camera]
Digital preservation it a complex process. In this case, it began by digitizing 235,000 analog Betacam SP Videocassettes, on which the original interviews had been recorded, a process which took from 2008-2012. This had to be done quickly (automatically/robotically), given that cassette tapes are prone to become sticky, brittle, and unplayable within a few decades due to hydrolysis. They digitized about 30,000 hours per year. The process eventually produced 8 petabytes (link to more technical details) of  lossless JPEG 2000 videos, roughly the equivalent of 2 million DVDs. Stacked on top of each other, those DVDs would reach three times higher than Burj Khalifa, the world’s tallest tower.

From there, the team spent quite some time correcting errors that existed in the original tapes, and ones that were introduced in the process of digitization. They employed a small army of signal processing students, patented new technologies for automated error detection & processing/cleaning, and wound up cleaning video from about 12,000 tapes. According to our tour guide, cleaning is still happening.

Lest you feel safe knowing that digitization lengthens the preservation time, turns out you’re wrong. Film lasts longer than most electronic storage, but making film copies would have cost the foundation $140,000,000 and made access incredibly difficult. Digital copies would only cost tens of millions of dollars, even though hard-drives couldn’t be trusted to last more than a decade. Their solution was a RAID hard-drive system in an Oracle StorageTek SL8500 (of which they have two), and a nightly process of checking video files for even the slightest of errors. If an error is found, a backup is loaded to a new cartridge, and the old cartridge is destroyed. Their two StorageTeks each fit over 10,000 drive cartridges, have 55 petabytes worth of storage space, weigh about 4,000 lbs, and are about the size of a New York City apartment. If a drive isn’t backed up and replaced within three years, they throw it out and replace it anyway, just in case. And this setup apparently saved the Shoah Foundation $6 million.

Digital StillCamera
StorageTek SL8500 [via CERN]
Oh, and they have another facility a few states away, connected directly via high-bandwidth fiber optic cables, where everything just described is duplicated in case California falls into the ocean.

Not bad for something that costs libraries $15,000 per year, which is about the same the library would pay for one damn chemistry journal.

So how much does it cost to remember 50,000 Holocaust witnesses and survivors for, say, 20 years? I mean, above and beyond the cost of building a cutting edge facility, developing new technologies of preservation, cooling and housing a freight container worth of hard drives, laying fiber optic cables below ground across several states, etc.? I don’t know. But I do know how much the Shoah Foundation would charge you to save 8 petabytes worth of videos for 20 years, if you were a USC Professor. They’d charge you $1,000/TB/20 years.

The Foundation’s videos take up 8,000 terabytes, which at $1,000 each would cost you $8 million per 20 years, or about half a million dollars per year. Combine that with all the physical space it takes up, and never forgetting the Holocaust is sounding rather prohibitive. And what about after 20 years, when modern operating systems forget how to read JPEG 2000 or interface with StorageTek T10000C Tape Drives, and the Shoah Foundation needs to undertake another massive data conversion? I can see why that Canadian official didn’t manage it.

The Reconcentration of Holocaust Survivors

While I appreciated the guided tour of the exhibit, and am thankful for the massive amounts of money, time, and effort scholars and donors are putting into remembering Holocaust survivors, I couldn’t help but be creeped out by the experience.

Our tour began by entering a high security facility. We signed our names on little pieces of paper and were herded through several layers of locked doors and small rooms. Not quite the way one expects to enter the project tasked with remembering and respecting the victims of genocide.

The Nazi’s assembly-line techniques for mass extermination led to starkly regular camps, like Auschwitz pictured above, laid out in efficient grids for the purpose of efficient control and killings. “Concentration camp”, by the way, refers to the concentration of people into small spaces, coming from “reconcentration camps” in Cuba. Now we’re concentrating 50,000 testimonies into a couple of closets with production line efficiency, reconcentrating the stories of people who dispersed across the world, so they’re all in one easy-to-access place.

Server farm [via wikipedia]
We’ve squeezed 100,000 hours of testimony into a server farm that consists of a series of boxes embedded in a series of larger boxes, all aligned to a grid; input, output, and eventual destruction of inferior entities handled by robots. Audits occur nightly.

The Shoah Foundation materials were collected, developed, and preserved with the utmost respect. The goal is just, the cause respectable, and the efforts incredibly important. And by reconcentrating survivors’ stories, they can now be accessed by the world. I don’t blame the Foundation for the parallels which are as much a construct of my mind as they are of the society in which this technology developed. Still, on Halloween, it’s hard to avoid reflecting on the material, monetary, and ultimately dehumanizing costs of processing ghosts into the machine.

Connecting the Dots

This is the incredibly belated transcript of my HASTAC 2015 keynote. Many thanks to the organizers for inviting me, and to my fellow participants for all the wonderful discussions. The video and slides are also online. You can find citations to some of the historical illustrations and many of my intellectual inspirations here. What I said and what I wrote probably don’t align perfectly.

When you’re done reading this, you should read Roopika Risam’s closing keynote, which connects surprisingly well with this, though we did not plan it.

If you take a second to expand and disentangle “HASTAC”, you get a name of an organization that doubles as a fairly strong claim about the world: that Humanities, Arts, Science, and Technology are separate things, that they probably aren’t currently in alliance with one another, and that they ought to form an alliance.

This intention is reinforced in the theme of this year’s conference: “The Art and Science of Digital Humanities.” Here again we get the four pillars: humanities, arts, science, and technology. In fact, bear with me as I read from the CFP:

We welcome sessions that address, exemplify, and interrogate the interdisciplinary nature of DH work. HASTAC 2015 challenges participants to consider how the interplay of science, technology, social sciences, humanities, and arts are producing new forms of knowledge, disrupting older forms, challenging or reifying power relationships, among other possibilities.

Here again is that implicit message: disciplines are isolated, and their interplay can somehow influence power structures. As with a lot of digital humanities and cultural studies, there’s also a hint of activism: that building intentional bridges is a beneficial activity, and we’re organizing the community of HASTAC around this goal.


This is what I’ll be commenting on today. First, what does disciplinary isolation mean? I put this historically, and argue that we must frame disciplinary isolation in a rhetorical context.

This brings me to my second point about ontology. It turns out the way we talk about isolation is deeply related to the way we think about knowledge, the way we illustrate it, and ultimately the shape of knowledge itself. That’s ontology.

My third point brings us back to HASTAC: that we represent an intentional community, and this intent is to build bridges which positively affect the academy and the world.

I’ll connect these three strands by arguing that we need a map to build bridges, and we need to carefully think about the ontology of knowledge to draw that map. And once we have a map, we can use it to design a better territory.

In short, this plenary is a call-to-action. It’s my vocal support for an intentionally improved academy, my exploration of its historical and rhetorical underpinnings, and my suggestions for affecting positive change in the future.

Matt Might’s Illustrated Guide to the Ph.D.
Let’s begin at the beginning. With isolation.

Stop me if you’ve heard this one before:

Within this circle is the sum of all human knowledge. It’s nice, it’s enclosed, it’s bounded. It’s a comforting thought, that everything we’ve ever learned or created sits comfortably inside these boundaries.

This blue dot is you, when you’re born. It’s a beautiful baby picture. You’ve got the whole world ahead of you, an entire universe to learn, just waiting. You’re at the center because you have yet to reach your proverbial hand out in any direction and begin to learn.

Matt Might's Illustrated Guide to the Ph.D.
Matt Might’s Illustrated Guide to the Ph.D.

But time passes and you grow. You go to highschool, you take your liberal arts and sciences, and you slowly expand your circle into the great known. Rounding out your knowledge, as it were.

Then college happens! Oh, those heady days of youth. We all remember it, when the shape of our knowledge started leaning tumorously to one side. The ill-effects of specialization and declaring a major, I suspect.

As you complete a master’s degree, your specialty pulls your knowledge inexorably towards the edge of the circle of the known. You’re not a jack of all trades anymore. You’re an expert.
Matt Might’s Illustrated Guide to the Ph.D.

Then your PhD advisor yells at you to focus and get even smaller. So you complete your qualifying exams and reach the edge of what’s known. What lies beyond the circle? Let’s zoom in and see!

Matt Might's Illustrated Guide to the Ph.D.
Matt Might’s Illustrated Guide to the Ph.D.

You’ve reached the edge. The end of the line. The sum of all human knowledge stops here. If you want to go further, you’ll need to come up with something new. So you start writing your dissertation.

That’s your PhD. Right there, at the end of the little arrow.

You did it. Congratulations!

You now know more about less than anybody else in the world. You made a dent in the circle, you pushed human knowledge out just a tiny bit further, and all it cost you was your mental health, thirty years of your life, and the promise of a certain future. …Yay?

Matt Might’s Illustrated Guide to the Ph.D.
So here’s the new world that you helped build, the new circle of knowledge. With everyone in this room, I bet we’ve managed to make a lot of dents. Maybe we’ve even managed to increase the circle’s radius a bit!

Now, what I just walked us all through is Matt Might’s illustrated guide to the Ph.D. It made its rounds on the internet a few years back, it was pretty popular.

And, though I’m being snarky about it, it’s a pretty uplifting narrative. It provides that same dual feeling of insignificance and importance that you get when you stare at the Hubble Ultra Deep Field. You know the picture, right?

Hubble Ultra Deep Field
Hubble Ultra Deep Field

There are 10,000 galaxies on display here, each with a hundred billion stars. To think that we, humans, from our tiny vantage point on Earth, could see so far and so much because of the clever way we shape glass lenses? That’s really cool.

And saying that every pinprick of light we see is someone else’s PhD? Well, that’s a pretty fantastic metaphor. Makes getting the PhD seem worth it, right?

Dante and the Early Astronomers; M. A. Orr (Mrs. John Evershed), 1913
Dante and the Early Astronomers; M. A. Orr (Mrs. John Evershed), 1913

It kinda reminds me of the cosmological theories of some of our philosophical ancestors.

The cosmos (Greek for “Order”), consisted of concentric, perfectly layered spheres, with us at the very center.

The cosmos was bordered by celestial fire, the light from heaven, and stars were simply pin-pricks in a dark curtain which let the heavenly light shine through.


So, if we beat Matt Might’s PhD metaphor to death, each of our dissertations are poking holes in the cosmic curtain, letting the light of heaven shine through. And that’s a beautiful thought, right? Enough pinpricks, and we’ll all be bathed in light.

Expanding universe.
Expanding universe.

But I promised we’d talk about isolation, and even if we have to destroy this metaphor to get there, we’ll get there.

The universe is expanding. That circle of knowledge we’re pushing the boundaries of? It’s getting bigger too. And as it gets larger, things that were once close get further and further apart. You and I and Alpha Centauri were all neighbors for the big bang, but things have changed since then, and the star that was once our neighbor is now 5 light years away.

Atlas of Science, Katy Borner (2010).
Atlas of Science, Katy Borner (2010).

In short, if we’re to take Matt Might’s PhD model as accurate, then the result of specialization is inexorable isolation. Let’s play this out.

Let’s say two thousand years ago, a white dude from Greece invented science. He wore a beard.

[Note for readers: the following narrative is intentionally awful. Read on and you’ll see why.]


He and his bearded friends created pretty much every discipline we’re familiar with at Western universities: biology, cosmology, linguistics, philosophy, administration, NCAA football, you name it.

Over time, as Ancient Greek beards finished their dissertations, the boundaries of science expanded in every direction. But the sum of human knowledge was still pretty small back then, so one beard could write many dissertations, and didn’t have to specialize in only one direction. Polymaths still roamed the earth.


Fast forward a thousand years or so. Human knowledge had expanded in the interim, and the first European universities branched into faculties: theology, law, medicine, arts.

Another few hundred years, and we’ve reached the first age of information overload. It’s barely possible to be a master of all things, and though we remember scholars and artists known for their amazing breadth, this breadth is becoming increasingly difficult to manage.

We begin to see the first published library catalogs, since the multitude of books required increasingly clever and systematic cataloging schemes. If you were to walk through Oxford in 1620, you’d see a set of newly-constructed doors with signs above them denoting their disciplinary uses: music, metaphysics, history, moral philosophy, and so on.

The encyclopedia of Diderot & D'alembert
The encyclopedia of Diderot & D’alembert

Time goes on a bit further, the circle of knowledge expands, and specialization eventually leads to fracturing.

We’ve reached the age of these massive hierarchical disciplinary schemes, with learning branching in every direction. Our little circle has become unmanageable.

A few more centuries pass. Some German universities perfect the art of specialization, and they pass it along to everyone else, including the American university system.

Within another 50 years, CP Snow famously invoked the “Two Cultures” of humanities and sciences.

And suddenly here we are


On the edge of our circle, pushing outward, with every new dissertation expanding our radius, and increasing the distance to our neighbors.

Basically, the inevitable growth of knowledge results in an equally inevitable isolation. This is the culmination of super-specialization: a world where the gulf between disciplines is impossible to traverse, filled with language barriers, value differences, and intellectual incommensurabilities. You name it.


By this point, 99% of the room is probably horrified. Maybe it’s by the prospect of an increasingly isolated academy. More likely the horror’s at my racist, sexist, whiggish, Eurocentric account of the history of science, or at my absurdly reductivist and genealogical account of the growth of knowledge.

This was intentional, and I hope you’ll forgive me, because I did it to prove a point: the power of visual rhetoric in shaping our thoughts. We use the word “imagine” to describe every act of internal creation, whether or not it conforms to the root word of “image”. In classical and medieval philosophy, thought itself was a visual process, and complex concepts were often illustrated visually in order to help students understand and remember. Ars memoriae, it was called.

And in ars memoriae, concepts were not only given visual form, they were given order. This order wasn’t merely a clever memorization technique, it was a reflection on underlying truths about the relationship between concepts. In a sense, visual representations helped bridge human thought with divine structure.

This is our entrance into ontology. We’ve essentially been talking about interdisciplinarity for two thousand years, and always alongside a visual rhetoric about the shape, or ontology, of knowledge. Over the next 10 minutes, I’ll trace the interwoven histories of ontology, illustrations, and rhetoric of interdisciplinarity. This will help contextualize our current moment, and the intention behind meeting at a conference like this one. It should, I hope, also inform how we design our community going forward.

Let’s take a look some alternatives to the Matt Might PhD model.

Diagrams of Knowledge
Diagrams of Knowledge

Countless cultural and religious traditions associate knowledge with trees; indeed, in the Bible, the fruit of one tree is knowledge itself.

During the Roman Empire and the Middle Ages, the sturdy metaphor of trees provided a sense of lineage and order to the world that matched perfectly with the neatly structured cosmos of the time. Common figures of speech we use today like “the root of the problem” or “branches of knowledge” betray the strength with which we connected these structures to one another. Visual representations of knowledge, obviously, were also tree-like.

See, it’s impossible to differentiate the visual from the essential here. The visualization wasn’t a metaphor, it was an instantiation of essence. There are three important concepts that link knowledge to trees, which at that time were inseparable.

One: putting knowledge on a tree implied a certain genealogy of ideas. What we discovered and explored first eventually branched into more precise subdisciplines, and the history of those branches are represented on the tree. This is much like any family tree you or I would put together with our parents and grandparents and so forth. The tree literally shows the historical evolution of concepts.

Two: putting knowledge on a tree implied a specific hierarchy that would by the Enlightenment become entwined with how we understood the universe. Philosophy separates into the theoretical and the practical; basic math into geometry and arithmetic. This branching hierarchy gave an importance to the root of the tree, be that root physics or God or philosophy or man, and that importance decreased as you reached the further limbs. It also implied an order of necessity: the branches of math could not exist without the branch of philosophy it stemmed from. This is why today people still think things like physics is the most important discipline.

Three: As these trees were represented, there was no difference between the concept of a branch of knowledge, the branch of knowledge itself, and the object of study of that branch of knowledge. The relationship of physics to chemistry isn’t just genealogical or foundational; it’s actually transcendent. The conceptual separation of genealogy, ontology, and transcendence would not come until much later.

It took some time for the use of the branching tree as a metaphor for knowledge to take hold, competing against other visual and metaphorical representations, but once it did, it ruled victorious for centuries. The trees spread and grew until they collapsed under their own weight by the late nineteenth century, leaving a vacuum to be filled by faceted classification systems and sprawling network visualizations. The loss of a single root as the source of knowledge signaled an epistemic shift in how knowledge is understood, the implications of which are still unfolding in present-day discussions of interdisciplinarity.

By visualizing knowledge itself as a tree, our ancestors reinforced both an epistemology and a phenomenology of knowledge, ensuring that we would think of concepts as part of hierarchies and genealogies for hundreds of years. As we slowly moved away from strictly tree-based representations of knowledge in the last century, we have also moved away from the sense that knowledge forms a strict hierarchy. Instead, we now believe it to be a diffuse system of occasionally interconnected parts.

Of course, the divisions of concepts and bodies of study have no natural kind. There are many axes against which we may compare biology to literature, but even the notion of an axis of comparison implies a commonality against which the two are related which may not actually exist. Still, we’ve found the division of knowledge into subjects, disciplines, and fields a useful practice since before Aristotle. The metaphors we use for these divisions influence our understanding of knowledge itself: structured or diffuse; overlapping or separate; rooted or free; fractals or divisions; these metaphors inform how we think about thinking, and they lend themselves to visual representations which construct and reinforce our notions of the order of knowledge.

Arbor Scientiae, late thirteenth century, Ramon Llull. [via]
Arbor Scientiae, late thirteenth century, Ramon Llull.
Given all this, it should come as no surprise that medieval knowledge was shaped like a tree – God sat at the root, and the great branching of knowledge provided a transcendental order of things. Physics, ethics, and biology branched further and further until tiny subdisciplines sat at every leaf. One important aspect of these illustrations was unity – they were whole and complete, and even more, they were all connected. This mirrors pretty closely that circle from Matt Might.

Christophe de Savigny’s Tableaux: Accomplis de tous les arts liberaux, 1587
Christophe de Savigny’s Tableaux: Accomplis de tous les arts liberaux, 1587

Speaking of that circle I had up earlier, many of these branching diagrams had a similar feature. Notice the circle encompassing this illustration, especially the one on the left here: it’s a chain. The chain locks the illustration down: it says, there are no more branches to grow.

This and similar illustrations were also notable for their placement. This was an index to a book, an early encyclopedia of sorts – you use the branches to help you navigate through descriptions of the branches of knowledge. How else should you organize a book of knowledge than by its natural structure?

Bacon's Advancement of Learning
Bacon’s Advancement of Learning

We start seeing some visual, rhetorical, and ontological changes by the time of Francis Bacon, who wrote “the distributions and partitions of knowledge are […] like branches of a tree that meet in a stem, which hath a dimension and quantity of entireness and continuance, before it come to discontinue and break itself into arms and boughs.”

The highly influential book broke the trends in three ways:

  1. it broke the “one root” model of knowledge.
  2. It shifted the system from closed to open, capable of growth and change
  3. it detached natural knowledge from divine wisdom.

Bacon’s uprooting of knowledge, dividing it into history, poesy, and philosophy, each with its own root, was an intentional rhetorical strategy. He used it to argue that natural philosophy should be explored at the expense of poesy and history. Philosophy, what we now call science, was now a different kind of knowledge, worthier than the other two.

And doesn’t that feel a lot like today?

Bacon’s system also existed without an encompassing chain, embodying the idea that learning could be advanced; that the whole of knowledge could not be represented as an already-grown tree. There was no complete order of knowledge, because knowledge changes.

And, by being an imperfect, incomplete entity, without union, knowledge was notably separated from divine wisdom.

Kircher's Philosophical tree representing all branches of knowledge, from Ars Magna Sciendi (1669), p. 251.
Kircher’s Philosophical tree representing all branches of knowledge, from Ars Magna Sciendi (1669), p. 251.

Of course, divinity and transcendence wasn’t wholly exorcised from these ontological illustrations: Athanasius Kircher put God on the highest branch, feeding the tree’s growth. (Remember, from my earlier circle metaphor, the importance of the poking holes in the fabric of the cosmos to let the light of heaven shine through?). Descartes as well continued to describe knowledge as a tree, whose roots were reliant on divine existence.

Chambers' Cyclopædia
Chambers’ Cyclopædia

But even without the single trunk, without God, without unity, the metaphors were still ontologically essential, even into the 18th century. This early encyclopedia by Ephraim Chambers uses the tree as an index, and Chambers writes:

“the Origin and Derivation of the several Parts, and the relation in which [the disciplines] stand to their common Stock and to each other; will assist in restoring ‘em to their proper Places

Their proper places. This order is still truth with a capital T.

The encyclopedia of Diderot & D'alembert
The encyclopedia of Diderot & D’alembert

It wasn’t until the mid-18th century, with Diderot and d’Alembert’s encyclopedia, that serious thinkers started actively disputing the idea that these trees were somehow indicative of the essence of knowledge. Even they couldn’t escape using trees, however, introducing their enyclopedia by saying “We have chosen a division which has appeared to us most nearly satisfactory for the encyclopedia arrangement of our knowledge and, at the same time, for its genealogical arrangement.

Even if the tree wasn’t the essence of knowledge, it still represented possible truth about the genealogy of ideas. It took until a half century later, with the Encyclopedia Britannica, for the editors to do away with tree illustrations entirely and write that the world was “perpetually blended in almost every branch of human knowledge”. (Notice they still use the word branch.) By now, a philosophical trend that began with Bacon was taking form through the impossibility of organizing giant libraries and encyclopedia: that there was no unity of knowledge, no implicit order, and no viable hierarchy.

Banyan tree [via]
It took another century to find a visual metaphor to replace the branching tree. Herbert Spencer wrote that the branches of knowledge “now and again re-unite […], they severally send off and receive connecting growths; and the intercommunion is ever becoming more frequent, more intricate, more widely ramified.” Classification theorist S.R. Ranganathan compared knowledge to the Banyan tree from his home country of India, which has roots which both grow from the bottom up and the top down.

Otlet 1937
Otlet 1937

The 20th century saw a wealth of new shapes of knowledge. Paul Otlet conceived a sort of universal network, connected through individual’s thought processes. H.G. Wells shaped knowledge very similar to Matt Might’s illustrated PhD from earlier: starting with a child’s experience of learning and branching out. These were both interesting developments, as they rhetorically placed the ontology of knowledge in the realm of the psychological or the social: driven by people rather than some underlying objective reality about conceptual relationships.

Porter’s 1939 Map of Physics [via]
Around this time there was a flourishing of visual metaphors, to fill the vacuum left by the loss of the sturdy tree.There was, uncoincidentally, a flourishing of uses for these illustrations. Some, like this map, was educational and historical, teaching students how the history of physics split and recombined like water flowing through rivers and tributaries. Others, like the illustration to the right, showed how the conceptual relationships between knowledge domains differed from and overlapped with library classification schemes and literature finding aids.

Small & Garfield, 1985
Small & Garfield, 1985

By the 80s, we start seeing a slew of the illustrations we’re all familiar with: those sexy sexy network spaghetti-and-meatball graphs. We often use them to illustrate citation chains, and the relationship between academic disciplines. These graphs, so popular in the 21st century, go hand-in-hand with the ontological baggage we’re used to: that knowledge is complex, unrooted, interconnected, and co-constructed. This fits well with the current return to a concept we’d mostly left in the 19th century: that knowledge is a single, growing unit, that it’s consilient, that everyone is connected. It’s a return to the Republic of Letters from the C.P. Snow’s split of the Two Cultures.

It also notably departs from genealogical, transcendental, and even conceptual discussions of knowledge. These networks, broadly construed, are social representations, and while those relationships may often align with conceptual ones, concepts are not what drive the connections.

Fürbringer's Illustration of Bird Evolution, 1888
Fürbringer’s Illustration of Bird Evolution, 1888

Interestingly, there is precedent in these sorts of illustrations in the history of evolutionary biology. In the late 19th-century, illustrators and scientists began asking what it would look like if you took a slice from the evolutionary tree – or, what does the tree of life look like when you’re looking at it from the top-down?

What you get is a visual structure very similar to the network diagrams we’re now used to. And often, if you probe those making the modern visualizations, they will weave a story about the history of these networks that is reminiscent of branching evolutionary trees.

There’s another set of epistemological baggage that comes along with these spaghetti-and-meatball-graphs. Ben Fry, a well-known researcher in information visualization, wrote:

“There is a tendency when using [networks] to become smitten with one’s own data. Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is ‘complex’ fits just fine with the creator’s own interpretation of it. Graphs have a tendency of making a data set look sophisticated and important, without having solved the problem of enlightening the viewer.”

Actually, were any of you here at last night’s Pink Floyd light show in the planetarium? They’re a lot like that. [Yes, readers, HASTAC put on a Pink Floyd light show.]

And this is where we are now.


Which brings us back to the outline, and HASTAC. Cathy Davidson has often described HASTAC as a social network, which is (at least on the web) always an intentionally-designed medium. Its design grants certain affordances to users: is it easier to communicate individually or in groups? What types of communities, events, or content is prioritized? These are design decisions that affect how the HASTAC community functions and interacts.

And the design decisions going into HASTAC are informed by its intent, so what is that intent? In their groundbreaking 2004 manifesto in the Chronicle, Cathy Davidson and David Goldberg wrote:

“We believe that a new configuration in the humanities must be championed to ensure their centrality to all intellectual enterprises in the university and, more generally, to understanding the human condition and thereby improving it; and that those intellectual changes must be supported by new institutional structures and values.”

This was a HASTAC rallying cry: how can the humanities constructively inform the world? Notice especially how they called for “New Institutional Structures.”

Remember earlier, how I talked about the problem if isolation? While my story about it was problematic, it doesn’t make disciplinary superspecialization any less real a problem. For all its talk of interdisciplinarity, academia is averse to synthesis on many fronts, superspecialization being just one of them. A dissertation based on synthesis, for example, is much less likely to get through a committee than a thorough single intellectual contribution to one specific field.

The academy is also weirdly averse to writing for public audiences. Popular books won’t get you tenure. But every discipline is a popular audience to most other disciplines: you wouldn’t talk to a chemist about history the same way you’d talk to a historian. Synthetic and semi-public work is exactly the sort of work that will help with HASTAC’s goal of a truly integrated and informed academy for social good, but the cards are stacked against it. Cathy and David hit the nail on the head when they target institutional structures as a critical point for improvement.

This is where design comes in.

Richmond, 1954
Richmond, 1954

Recall again the theme this year: The Art and Science of Digital Humanities. I propose we take the next few days to think about how we can use art and science to make HASTAC even better at living up its intent. That is, knowing what we do about collaboration, about visual rhetoric, about the academy, how can we design an intentional community to meet its goals? Perusing the program, it looks like most of us will already be discussing exactly this, but it’s useful to put a frame around it.

When we talk about structure and the social web, there’s many great examples we may learn from. One such example is that of Tara McPherson and her colleagues, in designing the web publishing platform Scalar. As opposed to WordPress, its cousin in functionality, Scalar was designed with feminist and humanist principles in mind, allowing for more expressive, non-hierarchical “pathways” through content.

When talking of institutional, social, and web-based structures, we can also take lessons history. In Early Modern Europe, the great network of information exchange known as the Republic of Letters was a shining example of the influence of media structures on innovation. Scholars would often communicate through “hubs”, which were personified in people nicknamed things like “the mailbox of Europe”. And they helped distribute new research incredibly efficiently through their vast web of social ties. These hubs were essential to what’s been called the scientific revolution, and without their structural role, it’s unlikely you’d see references to a scientific revolution in the 17th century Europe.

Similarly, at that time, the Atlantic slave trade was wreaking untold havoc on the world. For all the ills it caused, we at least can take some lessons from it in the intentional design of a scholarly network. There existed a rich exchange of medical knowledge between Africans and indigenous Americans that bypassed Europe entirely, taking an entirely different sort of route through early modern social networks.

If we take the present day, we see certain affordances of social networks similarly used to subvert or reconfigure power structures, as with the many revolutions in North Africa and the Middle East, or the current activist events taking place around police brutality and racism in the US. Similar tactics that piggy-back on network properties are used by governments to spread propaganda, ad agencies to spread viral videos, and so forth.

The question, then, is how we can intentionally design a community, using principles we learn from historical action, as well as modern network science, in order to subvert institutional structures in the manner raised by Cathy and David?

Certainly we also ought to take into account the research going into collaboration, teamwork, and group science. We’ve learned, for example, that teams with diverse backgrounds often come up with more creative solutions to tricky problems. We’ve learned that many small, agile groups often outperform large groups with the same amount of people, and that informal discussion outside the work-space contributes in interesting ways to productivity. Many great lessons can be found in Michael Nielsen’s book, Reinventing Discovery.

We can use these historical and lab-based examples to inform the design of social networks. HASTAC already work towards this goal through its scholars program, but there are more steps that may be taken, such as strategically seeking out scholars from underrepresented parts of the network.

So this covers covers the science, but what about the art?

Well, I spent the entire middle half of this talk discussing how visual rhetoric is linked to ontological metaphors of knowledge. The tree metaphor of knowledge, for example, was so strongly held that it fooled Descartes into breaking his claims of mind-body dualism.

So here is where the artists in the room can also fruitfully contribute to the same goal: by literally designing a better infrastructure. Visually. Illustrations can be remarkably powerful drivers of reconceptualization, and we have the opportunity here to affect changes in the academy more broadly.

One of the great gifts of the social web, at least when it’s designed well, is its ability to let nodes on the farthest limbs of the network to still wield remarkable influence over the whole structure. This is why viral videos, kickstarter projects, and cats playing pianos can become popular without “industry backing”. And the decisions we make in creating illustrations, in fostering online interactions, in designing social interfaces, can profoundly affect the way those interactions reinforce, subvert, or sidestep power structures.

So this is my call to the room: let’s revisit the discussion about designing the community we want to live in.


Thanks very much.

What’s Counted Counts

tl;dr. Don’t rely on data to fix the world’s injustices. An unusually self-reflective and self-indulgent post.

[Edit: this question was prompted by a series of analyses and visualizations I’ve done in collaboration with Nickoal Eichmann, but I purposefully left her out of the majority of this post, as it was one of self-reflection about my own personal choices. A respected colleague pointed out in private that by doing so, I nullified my female collaborator’s contributions to the project, for which I apologize deeply. Nickoal’s input has been integral to all of this, and she and many others, including particularly Jeana Jorgensen and Heather Froehlich (who has written on this very subject), have played vital roles in my own learning about these issues. Recent provocations by Miriam Posner helped solidify a lot of these thoughts and inspired this post. What follows is a self-exploration, recapping what many people have already said, but hopefully still useful to some. Mistakes below shouldn’t reflect poorly on those who influenced or inspired me. The post from this point on is as it originally appeared.]

Someone asked yesterday why I cared enough 1 about gender equality in academia to make this chart (with Nickoal Eichmann).

Gender representation as authors at DH conferences over the last decade. (Women consistently represent around 33% of authors)
Gender representation as authors at DH conferences over the last decade. Context. (Women consistently represent around 33% of authors)

I didn’t know how to answer the question. Our culture gives some more and better opportunities than others, so in order to make things better for more people, we must reveal and work towards resolving points of inequality. “Why do I care?” Don’t most of us want to make things better, we just go about it in different ways, and have different ideas of what’s “better”?

But the question did make me consider why I’d started with gender equality, when there are clearly so many other equally important social issues to tackle, within and outside academia. The answer was immediately obvious: ease. I’d attempted to explore racial and ethnic diversity as well, but it was simply more fraught, complicated, and less amenable to my methods than gender, so I started with gender and figured I’d work my way into the weeds from there. 2

I’ll cut to the chase. My well-intentioned attempts at battling inequality suffer their own sort of bias: by focusing on measurements of inequality, I bias that which is easily measured. It’s not that gender isn’t complex (see Miriam Posner’s wonderful recent keynote on these and related issues), but at least it’s a little easier to measure than race & ethnicity, when all you have available to you is what you can look up on the internet.

[scroll down]

Saturday Morning Breakfast Cereal. [source]
Saturday Morning Breakfast Cereal. [source]
While this problem is far from new, it takes special significance in a data-driven world. That which is countable counts, and damn the rest. At its heart, this problem is one of classification and categorization: those social divides which have the clearest seams are those most easily counted. And in a data-driven world, it’s inequality along these clear divides which get noticed first, even when injustice elsewhere is far greater.

Sex is easy, compared to gender. At most 2% of people are born intersex according to most standards (but not accounting for dysmorphia & similar). And gender is relatively easy compared to race and ethnicity. Nationality is pretty easy because of bureaucratic requirements for passports and citizenship, and country of residence is even easier, unless you live somewhere like Palestine.

But even the Palestine issue isn’t completely problematic, because counting still works fine when one thing exists in multiple categories, or may be categorized differently in different systems. That’s okay.

Where math gets lost is where there are simply no good borders to draw around entities—or worse, there are borders, but those borders themselves are drawn by insensitive outgroups. We see this a lot in the history of colonialism. Have you ever been to the Pitt Rivers Museum in Oxford? It’s a 19th century museum that essentially shows what the 19th century British mind felt about the world: everything that looks like a flute is in the flute cabinet, everything that looks like a gun is in the gun cabinet, and everything that looks like a threatening foreign religious symbol is in the threatening foreign religious symbol cabinet. Counting such a system doesn’t reveal any injustice except that of the counters themselves.

Pitt Rivers Museum [source]
Pitt Rivers Museum [source]
And I’ll be honest here: I want to help make the world a better place, but I’ve got to work to my strengths and know my limits. I’m a numbers guy. I’m at my best when counting stuff, and when there are no sensitive ways to classify, I avoid counting, because I don’t want to be That Colonizing White Dude who tries to fit everything into boxes of his own invention to make himself feel better about what he’s doing for the world. I probably still fall into that trap a lot anyway.

So why did I care enough to count gender at DH conferences? It was (relatively) easy. And it’s needed, as we saw at DH2015 and we’ve seen throughout the digital humanities – we have a gender issue, and a feminism issue, and they both need to be pointed out and addressed. But we also have lots of other issues that I’ll simply never be able to approach, and don’t know how to approach, and am in danger of ignoring entirely if I only rely on quantitative evidence of inequality.

useless by xkcd
useless by xkcd

Of course, only relying on non-quantitative evidence has its own pitfalls. People evolved and are socialized to spot patterns, to extrapolate from limited information, even when those extrapolations aren’t particularly meaningful or lead to Jesus in a slice of toast. I’m not advocating we avoid metrics entirely (for one, I’d be out of a job), but echoing Miriam Posner’s recent provocation, we need to engage with techniques, approaches, and perspectives that don’t rely on easy classification schemes. Especially, we need to listen when people notice injustice that isn’t easily classified or counted.

“Uh, yes, Scott, who are you writing this for? We already knew this!” most of you are likely asking if you’ve read this far. I’m writing to myself in early college, an engineering student obsessed with counting, who’s slowly learned the holes in a worldview that only relies on quantitative evidence. The one who spent years quantifying his health issues, only to discover the pursuit of a number eventually took precedence over the pursuit of his own health. 3

Hopefully this post helps balance all the bias implicit in my fighting for a better world from a data-driven perspective, by suggesting “data-driven” is only one of many valuable perspectives.


  1. Upon re-reading the original question, it was actually “Why did you do it? (or why are you interested?)”. Still, this post remains relevant.
  2. I’m light on details here because I don’t want this to be an overlong post, but you can read some more of the details on what Nickoal and I are doing, and the decisions we make, in this blog series.
  3. A blog post on mental & physical health in academia is forthcoming.

Acceptances to Digital Humanities 2015 (part 4)


Women are (nearly but not quite) as likely as men to be accepted by peer reviewers at DH conferences, but names foreign to the US are less likely than either men or women to be accepted to these conferences. Some topics are more likely to be written on by women (gender, culture, teaching DH, creative arts & art history, GLAM, institutions), and others more likely to be discussed by men (standards, archaeology, stylometry, programming/software).


You may know I’m writing a series on Digital Humanities conferences, of which this is the zillionth post. 1 This post has nothing to do with DH2015, but instead looks at DH2013, DH2014, and DH2015 all at once. I continue my recent trend of looking at diversity in Digital Humanities conferences, drawing especially on these two posts (1, 2) about topic, gender, and acceptance rates.

This post will be longer than usual, since Heather Froehlich rightly pointed out my methods in these posts aren’t as transparent as they ought to be, and I’d like to change that.

Brute Force Guessing

As someone who deals with algorithms and large datasets, I desperately seek out those moments when really stupid algorithms wind up aligning with a research goal, rather than getting in the way of it.

In the humanities, stupid algorithms are much more likely to get in the way of my research than help it along, and afford me the ability to make insensitive or reductivist decisions in the name of “scale”. For example, in looking for ethnic diversity of a discipline, I can think of two data-science-y approaches to solving this problem: analyzing last names for country of origin, or analyzing the color of recognized faces in pictures from recent conferences.

Obviously these are awful approaches, for a billion reasons that I need not enumerate, but including the facts that ethnicity and color are often not aligned, and last names (especially in the states) are rarely indicative of anything at all. But they’re easy solutions, so you see people doing them pretty often. I try to avoid that.

Sometimes, though, the stars align and the easy solution is the best one for the question. Let’s say we were looking to understand immediate reactions of racial bias; in that case, analyzing skin tone may get us something useful because we don’t actually care about the race of the person, what we care about is the immediate perceived race by other people, which is much more likely to align with skin tone. Simply: if a person looks black, they’re more likely to be treated as such by the world at large.

This is what I’m banking on for peer review data and bias. For the majority of my data on DH conferences, Nickoal Eichmann and I have been going in and hand-coding every single author with a gender that we glean from their website, pictures, etc. It’s quite slow, far from perfect (see my note), but it’s at least more sensitive than the brute force method, we hope to improve it quite soon with user-submitted genders, and it gets us a rough estimate of gender ratios in DH conferences.

But let’s say we want to discuss bias, rather than diversity. In that case, I actually prefer the brute force method, because instead of giving me a sense of the actual gender of an author, it can give me a sense of what the peer reviewers perceive an author’s gender to be. That is, if a peer reviewer sees the name “Mary” as the primary author of an article, how likely is the reviewer to think the author is written by a woman, and will this skew their review?

That’s my goal today, so instead of hand-coding like usual, I went to Lincoln Mullen’s fabulous package for inferring gender from first names in the programming language R. It does so by looking in the US Census and Social Security Database, looking at the percentage of men and women with a certain first name, and then gives you both the ratio of men-to-women with that name, and the most likely guess of the person’s gender.

Inferring Gender for Peer Review

I don’t have a palantír and my DH data access is not limitless. In fact, everything I have I’ve scraped from public or semi-public spaces, which means I have no knowledge of who reviewed what for ADHO conferences, the scores given to submissions, etc. What I do have the titles and author names for every submission to an ADHO conference since 2013 (explanation), and the final program of those conferences. This means I can see which submissions don’t make it to the presentation stage; that’s not always a reflection of whether an article gets accepted, but it’s probably pretty close.

So here’s what I did: created a list of every first name that appears on every submission, rolled the list it into Lincoln Mullen’s gender inference machine, and then looked at how often authors guessed to be men made it through to the presentation stage, versus how often authors guessed to women made it through. That is to say, if an article is co-authored by one man and three women, and it makes it through, I count it as one acceptance for men and three for women. It’s not the only way to do it, but it’s the way I did it.

I’m arguing this can be used as a proxy for gender bias in reviews and editorial decisions: that if first names that look like women’s names are more often rejected 2 than ones that look like men’s names, there’s likely bias in the review process.

Results: Bias in Peer Review?

Totaling all authors from 2013-2015, the inference machine told me 1,008 names looked like women’s names; 1,707 looked like men’s names; and 515 could not be inferred. “Could not be inferred” is code for “the name is foreign-sounding and there’s not enough data to guess”. Remember as well, this is counting every authorship as a separate event, so if Melissa Terras submits one paper in 2013 and one in 2014, the name “Melissa” appears in my list twice.

*drum roll*

Acceptance rates to DH2013-2015 by gender.
Figure 1. Acceptance rates to DH2013-2015 by gender.

So we see that in 2013-2015, 70.3% of woman-authorship-events get accepted, 73.2% of man-authorship-events get accepted, and only 60.6% of uninferrable-authorship-events get accepted. I’ll discuss gender more soon, but this last bit was totally shocking to me. It took me a second to realize what it meant: that if your first name isn’t a standard name on the US Census or Social Security database, you’re much less likely to get accepted to a Digital Humanities conference. Let’s break it out by year.

Figure 2. Acceptance rates to DH2013-2015 by gender and year.

We see an interesting trend here, some surprising, some not. Least surprising is that the acceptance rates for non-US names is most equal this year, when the conference is being held so close to Asia (which the inference machine seems to have the most trouble with). My guess is that A) more non-US people who submit are actually able to attend, and B) reviewers this year are more likely to be from the same sorts of countries that the program is having difficulties with, so they’re less likely to be biased towards non-US first names. There’s also potentially a language issue here: that non-US submissions are more likely to be rejected because they are either written in another language, or written in a way that native English speakers may find difficult to understand.

But the fact of the matter is, there’s a very clear bias against submissions by people with names non-standard to the US. The bias, oddly, is most pronounced in 2014, when the conference was held in Switzerland. I have no good guesses as to why.

So now that we have the big effect out of the way, let’s get to the small one: gender disparity. Honestly, I had expected it to be worse; it is worse this years than the two previous, but that may just be statistical noise. It’s true that women do fair worse overall by 1-3%, which isn’t huge, but it’s big enough to mention. However.

Topics and Gender

However, it turns out that the entire gender bias effect we see is explained by the topical bias I already covered the other day. (Scroll down for the rest of the post.)

Figure 3. Topic by gender. Total size of horizontal grey bar equals the number of submissions to a topic. Horizontal black bar shows the percentage of that topic with women authors. Orange line shows the 38% mark, which is the expected number of submissions by women given the 38% submission ratio to DH conferences. Topics are ordered top-to-bottom by highest proportion of women. The smaller the grey bar, the more statistical noise / less trustworthy the result.

What’s shown here will be fascinating to many of us, and some of it more surprising than others. A full 67% of authors on the 25 DH submissions labeled “gender studies” are labeled as women by Mullen’s algorithm. And remember, many of those may be the same author; for example if “Scott Weingart” is listed as an author on multiple submissions, this chart counts those separately.

Other topics that are heavily skewed towards women: drama, poetry, art history, cultural studies, GLAM, and (importantly), institutional support and DH infrastructure. Remember how I said a large percentage of of those responsible for running DH centers, committees, and organizations are women? This is apparently the topic they’re publishing in.

If we look instead at the bottom of the chart, those topics skewed towards men, we see stylometrics, programming & software, standards, image processing, network analysis, etc. Basically either the CS-heavy topics, or the topics from when we were still “humanities computing”, a more CS-heavy community. These topics, I imagine, inherit their gender ratio problems from the various disciplines we draw them from.

You may notice I left out pedagogical topics from my list above, which are heavily skewed towards women. I’m singling that out specially because, if you recall from my previous post, pedagogical topics are especially unlikely to be accepted to DH conferences. In fact, a lot of the topics women are submitting in aren’t getting accepted to DH conferences, you may recall.

It turns out that the gender bias in acceptance ratios is entirely accounted for by the topical bias. When you break out topics that are not gender-skewed (ontologies, UX design, etc.), the acceptance rates between men and women are the same – the bias disappears. What this means is the small gender bias is coming at the topical level, rather than at the gender level, and since women are writing more about those topics, they inherit the peer review bias.

Does this mean there is no gender bias in DH conferences?

No. Of course not. I already showed yesterday that 46% of attendees to DH2015 are women, whereas only 35% of authors are. What it means is the bias against topics is gendered, but in a peculiar way that actually may be (relatively) easy to solve, and if we do solve it, it’d also likely go a long way in solving that attendee/authorship ratio too.

Get more women peer reviewing for DH conferences.

Although I don’t know who’s doing the peer reviews, I’d guess that the gender ratio of peer reviewers is about the same as the ratio of authors; 34% women, 66% men. If that is true, then it’s unsurprising that the topics women tend to write about are not getting accepted, because by definition these are the topics that men publishing at DH conferences find less interesting or relevant 3. If reviewers gravitate towards topics of their own interest, and if their interests are skewed by gender, it’d also likely skew results of peer review. If we are somehow able to improve the reviewer ratio, I suspect the bias in topic acceptance, and by extension gender acceptance, will significantly reduce.

Jacqueline Wernimont points out in a comment below that another way improving the situation is to break the “gender lines” I’ve drawn here, and make sure to attend presentations on topics that are outside your usual scope if (like me) you gravitate more towards one side than another.

Obviously this is all still preliminary, and I plan to show the breakdown of acceptances by topic and gender in a later post so you don’t just have to trust me on it, but at the 2,000-word-mark this is getting long-winded, and I’d like feedback and thoughts before going on.


  1. rounding up to the nearest zillion
  2. more accurately, if they don’t make it to the final program
  3. see Jacqueline Wernimont’s comment below

Acceptances to Digital Humanities 2015 (part 3)


There’s a disparity between gender diversity in authorship and attendance at DH2015; attendees are diverse, authors aren’t. That said, the geography of attendance is actually pretty encouraging this year. A lot of this work draws a project on the history of DH conferences I’m undertaking with the inimitable Nickoal Eichmann. She’s been integral on the research of everything you read about conferences pre-2013.

Diversity at DH2015: Preliminary Numbers

For those just joining us, I’m analyzing this year’s international Digital Humanities conference being held in Sydney, Australia (part 1, part 2). This is the 10th post in a series of reflective entries on Digital Humanities conferences, throughout which I explore the landscape of Digital Humanities as it is represented by the ADHO conference. There are other Digital Humanities (a great place to start exploring them in Alex Gil’s arounddh), but since this is the biggest event, it’s also an integral reflection on our community to the public and non-DH academic world.

Map from Around DH in 80 Days.
Figure 1. Map from Around DH in 80 Days.

If the DH conference is our public face, we all hope it does a good job of representing our constituent parts, big or small. It does not. The DH conference systematically underrepresents women and people from parts of the world that are not Europe or North America.

Until today, I wasn’t sure whether this was an issue of underrepresentation, an issue of lack of actual diversity among our constituents, or both. Today’s data have shown me it may be more underrepresentation than lack of diversity, although I can’t yet say anything with certainty without data from more conferences.

I come to this conclusion by comparing attendees to the conference to authors of presentations at the conference. My assumption is that if authorship and attendee diversity are equal, and both poor, then we have a diversity problem. If instead attendance is diverse but authorship is not, then we have a representation problem. It turns out, at least in this dataset, the latter is true. I’ve been able to reach the conclusion because the conference organizing committee (themselves a diverse, fantastic bunch) have published and made available the DH2015 attendance list.

Because this is an important subject, this post is more somber and more technically detailed than most others in this series.


The published Attendance List was nice enough to already attach country names to every attendee, so making an interactive map to attendees was a simple manner of cleaning the data (here it is as csv), aggregating it and plugging it into CartoDB.

Despite a lack of South American and African attendees, this is still a pretty encouraging map for DH2015, especially compared to earlier years. The geographic diversity of attendees is actually mirrored in the conference submissions (analyzed here), which to my mind means the ADHO decision to hold the conference somewhere other than North America or Europe succeeded in its goal of diversifying the organization. From what I hear, they hope to continue this trend by moving to a three-year rotation, between North America, Europe, and elsewhere. At least from this analysis, that’s a successful strategy.

DH submissions broken down by UN macro-continental regions.
Figure 2. DH submissions broken down by UN macro-continental regions (details in an earlier post).

If we look at the locations of authors at ADHO conferences from 2004-2013, we see a very different profile than is apparent this year in Sydney. The figure below, made by my collaborator Nickoal Eichmann, shows all author locations from ADHO conferences in this 10-year range.

ADHO conference author locations, 2004-2013. Figure by Nickoal Eichmann.
Figure 3. ADHO conference author locations, 2004-2013. Figure by Nickoal Eichmann.

Notice the difference in geographic profile from this year?

This also hides the sheer prominence of the Americas (really, just North America) at every single ADHO conference since 2004. The figure below shows the percentage of authors from different regions at DH2004-2013, with Europe highlighted in orange during the years the conference was held in Europe.

Geographic home of authors to ADHO conferences 2004-2013. Years when Europe hosted are highlighted in orange.
Figure 4. Geographic home of authors to ADHO conferences 2004-2013. Years when Europe hosted are highlighted in orange.

If you take a second to study this visualization, you’ll notice that with only one major exception in 2012, even when the conference was held in Europe, the majority of authors hailed from the Americas. That’s cray-cray, yo. Compare that to 2015 data from Figure 2; the Americas are still technically sending most of the authors, but the authorship pool is significantly more regionally diverse than the decade of 2004-2013.

Actually, even before the DH conference moved to Australia, we’ve been getting slightly more geographically diverse. Figure 5, below, shows a slight increase in diversity score from 2004-2013.

Regional diversity of authors at ADHO conferences, 2004-2013.
Figure 5. Regional diversity of authors at ADHO conferences, 2004-2013.

In sum, we’re getting better! Also, our diversity of attendance tends to match our diversity of authorship, which means we’re not suffering an underrepresentation problem on top of a lack of diversity. The lack of diversity is obviously still a problem, but it’s improving, and in no small part to the efforts of ADHO to move the annual conference further afield.

Historical Gender

Gravy train’s over, folks. We’re getting better with geography, sure, but what about gender? Turns out our gender representation in DH sucks, it’s always sucked, and unless we forcibly intervene, it’s likely to continue to suck.

We’ve probably inherited our gender problem from computer science, which is weird, because such a large percentage of leadership in DH organizations, committees, and centers are women. What’s more, the issue isn’t that women aren’t doing DH, it’s that they’re not being well-represented at our international conference. Instead they’re going to other conferences which are focused on diversity, which as Jacqueline Wernimont points out, is less than ideal.

So what’s the data here? Let’s first look historically.

Gender ratio of authors to presentations at DH2004-DH2013. First authorship ratio is in red.
Figure 6. Gender ratio of authors to presentations at DH2004-DH2013. First authorship ratio is in red. In collaboration with Nickoal Eichmann.

Figure 6 shows percentage of women authors at DH2004-DH2013. The data were collected in collaboration with Nickoal Eichmann. 1

Notice the alarming tendency for DH conference authorship to hover between 30-35% women. Women fair slightly better as first authors—that is to say, if a woman authors an ADHO presentation, they’re more likely to be a first author than a second or third. This matches well with the fact that a lot of the governing body of DH organizations are women, and yet the ratio does not hold in authorship. I can’t really hazard a guess as to why that is.

Gender in 2015

Which brings us to 2015 in Sydney. I was encouraged to see the organizing committee publish an attendance list, and immediately set out to find the gender distribution of attendees. 2 Hurray! I tweeted. About 46% of attendees to DH2015 were women. That’s almost 50/50!

Armed with the same hope I’ve felt all week (what with two fantastic recent Supreme Court decisions, a Papal decree on global warming, and the dropping of confederate flags all over the country), I set out to count gender among authors at DH2015.

Preliminary results show 34.6% 3 of authors at DH2015 are women. Status quo quo quo quo.

So how do we reconcile the fact that only 35% of authors at DH2015 are women, yet 46% of attendees are? I’m interpreting this to mean that we don’t have a diversity problem, but a representation problem; for some reason, though women comprise nearly half of active participants at DH conferences, they only comprise a third of what’s actually presented at them.

This representation issue is further reflected by the topical analysis of DH2015, which shows that only 10% of presentations are tagged as cultural studies, and only 1% as gender studies. Previous years show a similar low number for both topics. (It’s worth noting that cultural studies tend to have a slightly lower-than-average acceptance rate, while gender studies has a slightly higher-than-average acceptance rate. Food for thought.)

Given this, how do we proceed? At an individual level, obviously, people are already trying to figure out paths forward, but what about at the ADHO level? Their efforts, and efforts of constituent members, have been successful at improving regional diversity at our flagship annual event. What sort of intervention can we create to similarly improve our gender representation problems? Hopefully comments below, or Twitter conversation, might help us collaboratively build a path forward, or offer suggestions to ADHO for future events. 4

Stay-tuned for more DH2015 analyses, and in the meantime, keep on fighting the good fight. These are problems we can address as a community, and despite our many flaws, we can actually be pretty good at changing things for the better when we notice our faults.


  1. It’s worth noting we made a lot of simplifying assumptions that  we very much shouldn’t have, as Miriam Posner so eloquently pointed out with regards to Getty’s Union List of Author Names.

    We labeled authors as male, female, or unknown/other. We did not encode changes of author gender over time, even though we know of at least a few authors in the dataset for whom this would apply. We hope to remedy this issue in the near future by asking authors themselves to help us with identification, and we ourselves at least tried to be slightly more sensitive by labeling author gender by hand, rather than by using an algorithm to guess based on the author’s first name.

    This series of choices was problematic, but we felt it was worth it as a first pass as a vehicle to point out bias and lack of representation in DH, and we hope you all will help us improve our very rudimentary dataset soon.

  2. This is an even more problematic analysis than that of conference authorship. I used Lincoln Mullen’s fabulous gender guessing library in R, which guesses gender based on first names and statistics from US Social Security data, but obviously given the regional diversity of the conference, a lot of its guesses are likely off. As with the above data, we hope to improve this set as time goes on.
  3. Very preliminary, but probably not far off; again using Lincoln Mullen’s R library.
  4. Obviously I’m far from the first to come to this conclusion, and many ADHO committee members are already working on this problem (see GO::DH), but the more often we point out problems and try to come up with solutions, the better.

Acceptances to Digital Humanities 2015 (part 2)

Had enough yet? Too bad! Full-ahead into my analysis of DH2015, part of my 6,021-part series on DH conference submissions and acceptances. If you want more context, read the Acceptances to DH2015 part 1.


This post’s about the topical coverage of DH2015 in Australia. If you’re curious about how the landscape compares to previous years, see this post. You’ll see a lot of text, literature, and visualizations this year, as well as archives and digitisation projects. You won’t see a lot of presentations in other languages, or presentations focused on non-text sources. Gender studies is pretty much nonexistent. If you want to get accepted, submit pieces about visualization, text/data, literature, or archives. If you want to get rejected, submit pieces about pedagogy, games, knowledge representation, anthropology, or cultural studies.

Topical analysis

I’m sorry. This post is going to contain a lot of giant pictures, because I’m in the mountains of Australia and I’d much rather see beautiful vistas than create interactive visualizations in d3. Deal with it, dweebs. You’re just going to have to do a lot of scrolling down to see the next batch of text.

This year’s conference presents a mostly-unsurprising continuations of the status quo (see 2014’s and 2013’s topical landscapes). Figure 1, below, shows the top author-chosen topic words of DH2015, as a proportion of the total presentations at the conference. For example, an impressive quarter, 24%, of presentations at DH2015 are about “text analysis”. The authors were able to choose multiple topics for each presentation, which is why the percentages add up to way more than 100%.

Scroll down for the rest of the post.

Figure 1. Topical coverage of DH2015. Percent represents the % of presentations which authors have tagged with a certain topical keyword. Authors could tag multiple keywords per presentation.
Figure 1. Topical coverage of DH2015. Percent represents the % of presentations which authors have tagged with a certain topical keyword. Authors could tag multiple keywords per presentation.

Text analysis, visualization, literary studies, data mining, and archives take top billing. History’s a bit lower, but at least there’s more history than the abysmal showing at DH2013. Only a tenth of DH2015 presentations are about DH itself, which is maybe impressive given how much we talk about ourselves? (cf. this post)

As usual, gender studies representation is quite low (1%), as are foreign language presentations and presentations not centered around text. I won’t do a lot of interpretation this post, because it’d mostly be repeat of earlier years. At any rate, acceptance rate is a bit more interesting than coverage this time around. Figure 2 shows acceptance rates of each topic, ordered by volume. Figure 3 shows the same, sorted by acceptance rate.

The topics that appear most frequently at the conference are on the far left, and the red line shows the percent of submitted articles that will be presented at DH2015. The horizontal black line is the overall acceptance rate to the conference, 72%, just to show which topics are above or below average.

Figure 2. Acceptance rates of topics to DH2015, sorted by volume.
Figure 2. Acceptance rates of topics to DH2015, sorted by volume. Click to enlarge.
Figure 2. Acceptance rates of topics to DH2015, sorted by acceptance rate. Click to enlarge.
Figure 3. Acceptance rates of topics to DH2015, sorted by acceptance rate. Click to enlarge.

Notice that all the most well-represented topics at DH2015 have a higher-than-average acceptance rate, possibly suggesting a bit of path-dependence on the part of peer reviewers or editors. Otherwise, it could mean that, since a majority peer reviewers were also authors in the conference, and since (as I’ve shown) the majority of authors have a leaning toward text, lit, and visualization, it’s also what they’re likely to rate highly in peer review.

The first dips we see under the average acceptance rate is “Interdisciplinary Studies” and “Historical Studies” (☹), but the dips aren’t all that low, and we ought not to read too much into it without comparing it to earlier conferences. More significant are the low rates for “Cultural Studies”, and even more than that are the two categories on Teaching, Pedagogy, and Curriculum. Both categories’ acceptance rates are about 20% under the average, and although they’re obviously correlated with one another, the acceptance rates are similar to 2014 and 2013. In short, DH peer reviewers or editors are more unlikely to accept submissions on pedagogy than on most other topics, even though they sometimes represent a decent chunk of submissions.

Other low points worth pointing out are “Anthropology” (huh, no ideas there), “Games and Meaningful Play” (that one came as a surprise), and “Other” (can’t help you here). Beyond that, the submission counts are too low to read any meaningful interpretations into the data. The Game Studies dip is curious, and isn’t reflected in earlier conferences, so it could just be noise for 2015. The low acceptance rates in Anthropology are consistent 2013-2015, and it’d be worth looking more into that.

Topical Co-Occurrence, 2013-2015

Figure 4, below, shows how topics appear together on submissions to DH2013, DH2014, and DH2015. Technically this has nothing to do with acceptances, and little to do with this year specifically, but the visualization should provide a little context to the above analysis. Topics connect to one another if they appear on a submission together, and the line connecting them gets thicker the more connections two topics share.

Figure 4. Topical co-occurrence, 2013-2015. Click to enlarge.
Figure 4. Topical co-occurrence, 2013-2015. Click to enlarge.

Although the “Interdisciplinary Collaboration” topic has a low acceptance rate, it understandably ties the network together; other topics that play a similar role are “Visualization”, “Programming”, “Content Analysis”, “Archives”, and “Digitisation”. All unsurprising for a conference where people come together around method and material. In fact, this reinforces our “DH identity” along those lines, at least insofar as it is represented by the annual ADHO conference.

There’s a lot to unpack in this visualization, and I may go into more detail in the next post. For now, I’ve got a date with the Blue Mountains west of Sydney.

Acceptances to Digital Humanities 2015 (part 1)

[Update!] Melissa Terras pointed out I probably made a mistake on 2015 long paper -> short paper numbers. I checked, and she was right. I’ve updated the figures accordingly.


Part 1 is about sheer numbers of acceptances to DH2015 and comparisons with previous years. DH is still growing, but the conference locale likely prohibited a larger conference this year than last. Acceptance rates are higher this year than previous years. Long papers still reign supreme. Papers with more authors are more likely to be accepted.


It’s that time of the year again, when all the good little boys, girls, and other genders of DH gather around the scottbot irregular in pointless meta-analysis quiet self-reflection. As most of you know, the 2015 Digital Humanities conference occurs next week in Sydney, Australia. They’ve just released the final program, full of pretty exciting work, which means I can compare it to my analysis of submissions to DH2015 (1, 2, & 3) to see how DH is changing, how work gets accepted or rejected, etc. This is part of my series on analyzing DH conferences.

Part 1 will focus on basic counts, just looking at percentages of acceptance and rejection by the type of presentation, and comparing it with previous years. Later posts will cover topical, gender, geography, and kangaroos. NOTE: When I say “acceptances”, I really mean “presentations that appear on the final program.” More presentations were likely accepted and withdrawn due to the expense of traveling to Australia, so take these numbers with appropriate levels of skepticism. 1


Around 270 papers, posters, and workshops are featured in this year’s conference program, down from last year’s ≈350 but up from DH2013’s ≈240. Although this is the first conference since 2010 with fewer presentations than the previous year’s, I suspect this is due largely to geographic and monetary barriers, and we’ll see a massive uptick next year in Poland and the following in (probably) North America. Whether or not the trend will continue to increase in 2018’s Antarctic locale, or 2019’s special Lunar venue, has yet to be seen. 2

Annual presentations at DH conferences, compared to growth of DHSI in Victoria.
Annual presentations at DH conferences, compared to growth of DHSI in Victoria.

As you can see from the chart above, even given this year’s dip, both DH2015 and the annual DHSI event in Victoria reveals DH is still on the rise. It’s also worth noting that last year’s DHSI was likely the first where more people attended it than the international ADHO conference.

Acceptance Rates

A full 72% of submissions to DH2015 will be presented in Sydney next week. That’s significantly more inclusive than previous years: 59% of submitted manuscripts made it to DH2014 in Lausanne, and 64% to DH2013.

At first blush, the loss of exclusivity may seem a bad sign of a conference desperate for attendees, but to my mind the exact opposite is true: this is a great step forward. Conference peer review & acceptance decisions aren’t particularly objective, so using acceptance as a proxy for quality or relevance is a bit of a misdirection. And if we can’t aim for consistent quality or relevance in the peer review process, we ought to aim at least for inclusivity, or higher acceptance rates, and let the participants themselves decide what they want to attend.


Acceptance rates broken down by form (panel, poster, short paper, long paper) aren’t surprising, but are worth noting.

  • 73% of submitted long papers were accepted, but only 45% of them were accepted as long papers. The other 28% were accepted as posters or short papers.
  • 61% of submitted short papers were accepted, but only 51% as short papers; the other 10% became posters.
  • 85% of posters were accepted, all of them as posters.
  • 85% of panels were accepted, but one of them was accepted as a long paper.
  • A few papers/panels were converted into workshops.
How submitted articles eventually were rejected or accepted. (e.g. 45% of submitted long papers were accepted as long papers, 14% as short papers, 15% as posters, and 27% were rejected.)

Weirdly, short papers tend to have a lower acceptance rate than long papers over the last three years. I think that’s because if a long paper is rejected, it’s usually further along in the process enough that it’s more likely to be secondarily accepted-as-a-poster, but even that doesn’t account for the entire differential in the acceptance rate. Anyone have any thoughts on this?

Looking over time, we see an increasingly large slice of the DH conference pie is taken up by long papers. My guess is this is just a natural growth as authors learn the difference between long and short papers, a distinction which was only introduced relatively recently.

This is simply wrong with the updated data (tip of the hat to Melissa Terras for pointing it out); the ratio of long papers to short papers is still in flux. My “guess” from earlier was just that, a post-hoc explanation attached to an incorrect analysis. Matthew Lincoln has a great description about why we should be wary of these just-so stories. Go read it.

A breakdown of presentation forms at the last three DH conferences.

The breakdown of acceptance rates for each conference isn’t very informative, due in part to the fact I only have the last three years. In another few years this will probably become interesting, but for those who just can’t get enough o’ them sweet sweet numbers, here they are, special for you:

Breakdown of conference acceptances 2013-2015. The right-most column shows the percent of, for example, long papers that were not only accepted, but accepted AS long papers. Yellow rows are total acceptance rates per year.


DH is still pretty single-author-heavy. It’s getting better; over the last 10 years we’ve seen an upward trend in number of authors per paper (more details in a future blog post), but the last three years have remained pretty stagnant. This year, 35% of presentations & posters will be by a single author, 25% by two authors, 13% by 3 authors, and so on down the line. The numbers are unremarkably consistent with 2013 and 2014.

Percent of accepted presentations with a certain number of co-authors in a given year. (e.g. 35% of presentations in 2015 were single-authored.)
Percent of accepted presentations with a certain number of co-authors in a given year. (e.g. 35% of presentations in 2015 were single-authored.)

We do however see an interesting trend in acceptance rates by number of authors. The more authors on your presentation, the more likely your presentation is to be accepted. This is true of 2013, 2014, and 2015. Single-authored works are 54% likely to be accepted, while works authored by two authors are 67% likely to be accepted. If your submission has more than 7 authors, you’re incredibly unlikely to get rejected.

Acceptance rates by number of authors, 2013-2015. The more authors, the more likely a submission will be accepted.
Acceptance rates by number of authors, 2013-2015. The more authors, the more likely a submission will be accepted.

Obviously this is pure description and correlation; I’m not saying multi-authored works are higher quality or anything else. Sometimes, works with more authors simply have more recognizable names, and thus are more likely to be accepted. That said, it is interesting that large projects seem to be favored in the peer review process for DH conferences.

Stay-tuned for parts 2, π, 16, and 4, which will cover such wonderful subjects as topicality, gender, and other things that seem neat.


  1. The appropriate level of skepticism here is 19.27
  2. I hear Elon Musk is keynoting in 2019.

Not Enough Perspectives, Pt. 1

Right now DH is all texts, but not enough perspectives. –Andrew Piper

Summary: Digital Humanities suffers from a lack of perspectives in two ways: we need to focus more on the perspectives of those who interact with the cultural objects we study, and we need more outside academic perspectives. In Part 1, I cover Russian Formalism, questions of validity, and what perspective we bring to our studies. In Part 2, 1 I call for pulling inspiration from even more disciplines, and for the adoption and exploration of three new-to-DH concepts: Appreciability, Agreement, and Appropriateness. These three terms will help tease apart competing notions of validity.


Let’s begin with the century-old Russian Formalism, because why not? 2 Syuzhet, in that context, is juxtaposed against fabula. Syuzhet is a story’s order, structure, or narrative framework, whereas fabula is the underlying fictional reality of the world. Fabula is the story the author wants to get across, and syuzhet is the way she decides to tell it.

It turns out elements of Russian Formalism are resurfacing across the digital humanities, enough so that there’s an upcoming Stanford workshop on DH & Russian Formalism, and even I co-authored a piece that draws on work of Russian formalists. Syuzhet itself has a new meaning in the context of digital humanities: it’s a piece of code that chews books and spits out plot structures.

You may have noticed a fascinating discussion developing recently on statistical analysis of plot arcs in novels using sentiment analysis. A lot of buzz especially has revolved around Matt Jockers and Annie Swafford, and the discussion has bled into larger academia and inspired 246 (and counting) comments on reddit. Eileen Clancy has written a two-part broad link summary (I & II).

From Jockers' first post describing his method of deriving plot structure from running sentiment analysis on novels.
From Jockers’ first post describing his method of deriving plot structure from running sentiment analysis on novels.

The idea of deriving plot arcs from sentiment analysis has proven controversial on a number of fronts, and I encourage those interested to read through the links to learn more. The discussion I’ll point to here centers around “validity“, a word being used differently by different voices in the conversation. These include:

  • Do sentiment analysis algorithms agree with one another enough to be considered valid?
  • Do sentiment analysis results agree with humans performing the same task enough to be considered valid?
  • Is Jockers’ instantiation of aggregate sentiment analysis validly measuring anything besides random fluctuations?
  • Is aggregate sentiment analysis, by human or machine, a valid method for revealing plot arcs?
  • If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to map onto plot arcs, can they still be valid measurements of anything at all?
  • Can a subjective concept, whether measured by people or machines, actually be considered invalid or valid?

The list goes on. I contributed to a Twitter discussion on the topic a few weeks back. Most recently, Andrew Piper wrote a blog post around validity in this discussion.

Hermeneutics of DH, from Piper's blog.
Hermeneutics of DH, from Piper’s blog.

In this particular iteration of the discussion, validity implies a connection between the algorithm’s results and some interpretive consensus among experts. Piper points out that consensus doesn’t yet exist, because:

We have the novel data, but not the reader data. Right now DH is all texts, but not enough perspectives.

And he’s right. So far, DH seems to focus its scaling up efforts on the written word, rather than the read word.

This doesn’t mean we’ve ignored studying large-scale reception. In fact, I’m about to argue that reception is built into our large corpora text analyses, even though it wasn’t by design. To do so, I’ll discuss the tension between studying what gets written and what gets read through distant reading.

The Great Unread

The Great Unread is a phrase popularized by Franco Moretti 3 to indicate the lost literary canon. In his own words:

[…] the “lost best-sellers” of Victorian Britain: idiosyncratic works, whose staggering short-term success (and long-term failure) requires an explanation in their own terms.

The phrase has since become synonymous with large text databases like Google Books or HathiTrust, and is used in concert with distant reading to set digital literary history apart from its analog counterpart. Distant reading The Great Unread, it’s argued,

significantly increase[s] the researcher’s ability to discuss aspects of influence and the development of intellectual movements across a broader swath of the literary landscape. –Tangherlini & Leonard

Which is awesome. As I understand it, literary history, like history in general, suffers from an exemplar problem. Researchers take a few famous (canonical) books, assume they’re a decent (albeit shining) example of their literary place and period, and then make claims about culture, art, and so forth based on those novels which are available.

Matthew Lincoln raised this point the other day, as did Matthew Wilkins in his recent article on DH in the study of literature and culture. Essentially, both distant- and close-readers make part-to-whole generalized inferences, but the process of distant reading forces those generalizations to become formal and explicit. And hopefully, by looking at The Great Unread (the tens of thousands of books that never made it into the canon), claims about culture can better represent the nuanced literary world of the past.

Franco Moretti's Distant Reading.
Franco Moretti’s Distant Reading.

But this is weird. Without exemplars, what the heck are we studying? This isn’t a representation of what’s stood the test of time—that’s the canon we know and love. It’s also not a representation of what was popular back then (well, it sort of was, but more on that shortly), because we don’t know anything about circulation numbers. Most of these Google-scanned books surely never caught the public eye, and many of the now-canonical pieces of literature may not have been popular at the time.

It turns out we kinda suck at figuring out readership statistics, or even at figuring out what was popular at any given time, unless we know what we’re looking for. A folklorist friend of mine has called this the Sophus Bauditz problem. An expert in 19th century Danish culture, my friend one day stumbled across a set of nicely-bound books written by Sophus Bauditz. They were in his era of expertise, but he’d never heard of these books. “Must have been some small print run”, he thought to himself, before doing some research and discovering copies of these books he’d never heard of were everywhere in private collections. They were popular books for the emerging middle class, and sold an order of magnitude more copies than most books of the era; they’d just never made it into the canon. In another century, 50 Shades of Grey will likely suffer the same fate.


In this light, I find The Great Unread to be a weird term.  The Forgotten Read, maybe, to refer to those books which people actually did read but were never canonized, and The Great Tsundoku 4 for those books which were published, lasted to the present, and became digitized, but for which we have no idea whether anyone bothered to read them. The former would likely be more useful in understanding reception, cultural zeitgeist, etc.; the latter might find better use in understanding writing culture and perhaps authorial influence (by seeing whose styles the most other authors copy).

Tsundoku is Japanese for the ever-increasing pile of unread books that have been purchased and added to the queue. Illustrated by Reddit user Wemedge’s 12-year-old daughter.

In the present data-rich world we live in, we can still only grasp at circulation and readership numbers. Library circulation provides some clues, as does the number, size, and sales of print editions. It’s not perfect, of course, though it might be useful in separating zeitgeist from actual readership numbers.

Mathematician Jordan Ellenberg recently coined the tongue-in-cheek Hawking Index, because Stephen Hawking’s books are frequently purchased but rarely read, to measure just that. In his Wall Street Journal article, Ellenberg looked at popular books sold on Amazon Kindle to see where people tended to socially highlight their favorite passages. Highlights from Kahneman’s “Thinking Fast and Slow”, Hawking’s “A Brief History of Time”, and Picketty’s “Capital in the Twenty-First Century” all tended to cluster in the first few pages of the books, suggesting people simply stopped reading once they got a few chapters in.

Kindle and other ebooks certainly complicate matters. It’s been claimed that one reason behind 50 Shades of Grey‘s success was the fact that people could purchase and read it discreetly, digitally, without worry about embarrassment. Digital sales outnumbered print sales for some time into its popularity. As Dan Cohen and Jennifer Howard pointed out, it’s remarkably difficult to understand the ebook market, and the market is quite different among different constituencies. Ebook sales accounted for 23% of the book market this year, yet 50% of romance books are sold digitally.

And let’s not even get into readership statistics for novels that are out copyright, or sold used, or illegally attained: they’re pretty much impossible to count. Consider It’s a Wonderful Life (yes, the 1946 Christmas movie). A clerical accident pushed the movie into the public domain (sort of) in 1974. It had never really been popular before then, but once TV stations could play it without paying royalties, and VHS companies could legally produce and sell copies for free, the movie shot to popularity. Importantly, it shot to popularity in a way that was impossible to see on official license reports, but which Google ngrams reveals quite clearly.

Google ngram count of "It's a Wonderful Life", showing its rise to popularity after the copyright lapse.
Google ngram count of It’s a Wonderful Life, showing its rise to popularity after the 1974 copyright lapse.

This ngram visualization does reveal one good use for The Great Tsundoku, and that’s to use what authors are writing about as finger on the pulse of what people care to write about. This can also be used to track things like linguistic influence. It’s likely no coincidence, for example, that American searches for the word “folks” doubled during the first month’s of President Obama’s bid for the White House in 2007. 5

American searches for the word "folks" during Obama's first presidential bid.
American searches for the word “folks” during Obama’s first presidential bid.

Matthew Jockers has picked up on this capability of The Great Tsundoku for literary history in his analyses of 19th century literature. He compares books by various similar features, and uses that in a discussion of literary influence. Obviously the causal chain is a bit muddled in these cases, culture being ouroboric as it is, and containing a great deal more influencing factors than published books, but it’s a good set of first steps.

But this brings us back to the question of The Great Tsundoku vs. The Forgotten Read, or, what are we learning about when we distant read giant messy corpora like Google Books? This is by no means a novel question. Ted Underwood, Matt Jockers, Ben Schmidt, and I had an ongoing discussion on corpus representativeness a few years back, and it’s been continuously pointed to by corpus linguists 6 and literary historians for some time.

Surely there’s some appreciable difference when analyzing what’s often read versus what’s written?

Surprise! It’s not so simple. Ted Underwood points out:

we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.

He continues

if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.

While his claim skirts the sorts of issues raised by Ellenberg’s Hawking Index, it does present a very reasonable natural experiment: if you ask the same question of three databases (1. The entire messy, reprint-ridden corpus; 2. Single editions of The Forgotten Read, those books which were popular whether canonized or not; 3. The entire Great Tsundoku, everything that was printed at least once, regardless of whether it was read), what will you find?

Underwood performed 2/3rds of this experiment, comparing The Forgotten Read against the entire HathiTrust corpus on an analysis of the emergence of literary diction. He found that the trend results across both were remarkably similar.

Underwood's analysis of all HathiTrust prose (left), vs. The Forgotten Read (right).
Underwood’s analysis of all HathiTrust prose (47,549 volumes, left), vs. The Forgotten Read (773 volumes, right).

Clearly they’re not precisely the same, but the fact that their trends are so similar is suggestive that the HathiTrust corpus at least shares some traits with The Forgotten Read. The jury is out on the extent of those shared traits, or whether it shares as much with The Great Tsundoku.

The cause of the similarities between historically popular books and books that made it into HathiTrust should be apparent: 7 historically popular books were more frequently reprinted and thus, eventually, more editions made it into the HathiTrust corpus. Also, as Allen Riddell showed, it’s likely that fewer than 60% of published prose from that period have been scanned, and novels with multiple editions are more likely to appear in the HathiTrust corpus.

This wasn’t actually what I was expecting. I figured the HathiTrust corpus would track more closely to what’s written than to what’s read—and we need more experiments to confirm that’s not the case. But as it stands now, we may actually expect these corpora to reflect The Forgotten Read, a continuously evolving measurement of readership and popularity. 8

Lastly, we can’t assume that greater popularity results in larger print runs in every case, or that those larger print runs would be preserved. Ephemera such as zines and comics, digital works produced in the 1980s, and brittle books printed on acidic paper in the 19th century all have their own increased likelihoods of vanishing. So too does work written by minorities, by the subjected, by the conquered.

The Great Unreads

There are, then, quite a few Great Unreads. The Great Tsundoku was coined with tongue planted firmly in-cheek, but we do need a way of talking about the many varieties of Great Unreads, which include but aren’t limited to:

  • Everything ever written or published, along with size of print run, number of editions, etc. (Presumably Moretti’s The Great Unread.)
  • The set of writings which by historical accident ended up digitized.
  • The set of writings which by historical accident ended up digitized, cleaned up with duplicates removed, multiple editions connected and encoded, etc. (The Great Tsundoku.)
  • The set of writings which by historical accident ended up digitized, adjusted for disparities in literacy, class, document preservation, etc. (What we might see if history hadn’t stifled so many voices.)
  • The set of things read proportional to what everyone actually read. (The Forgotten Read.)
  • The set of things read proportional to what everyone actually read, adjusted for disparities in literacy, class, etc.
  • The set of writings adjusted proportionally by their influence, such that highly influential writings are over-represented, no matter how often they’re actually read. (This will look different over time; in today’s context this would be closest to The Canon. Historically it might track closer to a Zeitgeist.)
  • The set of writings which attained mass popularity but little readership and, perhaps, little influence. (Ellenberg’s Hawking-Index.)

And these are all confounded by hazy definitions of publication; slowly changing publication culture; geographic, cultural, or other differences which influence what is being written and read; and so forth.

The important point is that reading at scale is not clear-cut. This isn’t a neglected topic, but nor have we laid much groundwork for formal, shared notions of “corpus”, “collection”, “sample”, and so forth in the realm of large-scale cultural analysis. We need to, if we want to get into serious discussions of validity. Valid with respect to what?

This concludes Part 1. Part 2 will get into the finer questions of validity, surrounding syuzhet and similar projects, and will introduce three new terms (Appreciability, Agreement, and Appropriateness) to approach validity in a more humanities-centric fashion.


  1. Coming in a few weeks because we just received our proofs for The Historian’s Macroscope and I need to divert attention there before finishing this.
  2. And anyway I don’t need to explain myself to you, okay? This post begins where it begins. Syuzhet.
  3. The phrase was originally coined by Margaret Cohen.
  4. (see illustration below)
  5. COCA and other corpus tools show the same trend.
  6. Heather Froelich always has good commentary on this matter.
  7. Although I may be reading this as a just-so story, as Matthew Lincoln pointed out.
  8. This is a huge oversimplification. I’m avoiding getting into regional, class, racial, etc. differences, because popularity obviously isn’t universal. We can also argue endlessly about representativeness, e.g. whether the fact that men published more frequently than women should result in a corpus that includes more male-authored works than female-authored, or whether we ought to balance those scales.

Networks Demystified 9: Bimodal Networks

What do you think, is a year long enough to wait between Networks Demystified posts? I don’t think so, which is why it’s been a year and a month. Welcome back! A recent twitter back-and-forth culminated in a request for a discussion of “bimodal networks”, and my Networks Demystified series seemed like a perfect place for just such a discussion.

What’s a bimodal network, you ask? (Go on, ask aloud at your desk. Nobody will look at you funny, this is the age of Siri!) A bimodal network is one which connects two varieties of things. It’s also called a bipartite, 2-partite, or 2-mode network. A network of authors connected to the papers they write is bimodal, as are networks of books to topics, and people to organizations they are affiliated with.

A bimodal network.
A bimodal network.

This is a bimodal network which connects people and the clubs they belong to. Alice is a member of the Network Club and the We Love History Society, Bob‘s in the Network Club and the No Adults Allowed Club, and Carol‘s in the No Adults Allowed Club.

If this makes no sense, read my earlier Networks Demystified posts (the first two posts), or the our Historian’s Macroscope chapter, for a primer on networks. If it does make sense, excellent! The rest of this post will hopefully take you out of your comfort zone, but remain understandable to someone who doesn’t speak math.

k-partite Networks & Projections

Bimodal networks are part of a larger class of k-partite networks. Unipartite/unimodal networks have only one type of node (remember, nodes are the stuff being connected by the edges), bipartite/bimodal networks have two types of nodes, tripartite/trimodal networks have three types of node, and so on to infinity.

The most common networks you’ll see being researched are unipartite. Who follows whom on Twitter? Who’s writing to whom in early modern Europe? What articles cite which other articles? All are examples of unipartite networks. It’s important to realize this isn’t necessarily determined by the dataset, but by the researcher doing the studying. For example, you can use the same organization affiliation dataset to create a unipartite network of who is in a club with whom, or a bipartite network of which person is affiliated with each organization.

The same dataset used to create a unipartite (left) and a bipartite (right) network.
The same dataset used to create a unipartite (left) and a bipartite (right) network.

The above illustration shows the same dataset used to create a unimodal and a bimodal network. The process of turning a pre-existing bimodal network into a unimodal network is called a bimodal projection. This process collapses one set of nodes into edges connecting the other set. In this case, because Alice and Bob are both members of the Network Club, the Network Club collapses into becoming an edge between those two people. The No Adults Allowed Club collapses into an edge between Bob and Carol. Because only Alice is a member of the We Love History Society, it does not collapse into an edge connecting any people.

You can also collapse the network in the opposite direction, connecting organizations who share people. No Adults Allowed and Network Club would share an edge (Bob), as would Network Club and We Love History Society (Alice).

Why Bimodal Networks?

If the same dataset can be described with unimodal networks, which are less complex, why go to bi-, tri-, or multimodal? The answer to that is in your research question: different network representations suit different questions better.

Collaboration is a hot topic in bibliometrics. Who collaborates with whom? Why? Do your collaborators affect your future collaborations? Co-authorship networks are well-suited to some of these questions, since they directly connect collaborators who author a piece together. This is a unimodal network: I wrote The Historian’s Macroscope with Shawn Graham and Ian Milligan, so we draw an edge connecting each of us together.

Some of the more focused questions of collaboration, however, require a more nuanced view of the data. Let’s say you want to know how individual instances of collaboration affect individual research patterns going forward. In this case, you want to know more than the fact that I’ve co-authored two pieces with Shawn and Ian, and they’ve co-authored three pieces together.

For this added nuance, we can draw an edge from each of us to The Historian’s Macroscope (rather than each-other), then another set edges to the piece we co-authored in The Programming Historian, and a last set of edges going from Shawn and Ian to the piece they wrote in the Journal of Digital Humanities. That’s three people nodes and three publication nodes.

Scott, Ian, and Shawn's co-authorship network
Scott, Ian, and Shawn’s co-authorship network

Why Not Bimodal Networks?

Humanities data are often a rich array of node types: people, places, things, ideas, all connected to each other via a complex network. The trade-off is, the more complex and multimodal your dataset, the less you can reasonably do with it. This is one of the fundamental tensions between computational and traditional humanities. More categories lead to a richer understanding of the diversity of human experience, but are incredibly unhelpful when you want to count things.

Consider two pie-charts showing the religious makeup of the United States. The first chart groups together religions that fall under a similar umbrella, and the second does not. That is, the first chart groups religions like Calvinists and Lutherans together into the same pie slice (Protestants), and the second splits them into separate slices. The second, more complex chart obviously presents a richer picture of religious diversity in the United States, but it’s also significantly more difficult to read. It might trick you into thinking there are more Catholics than Protestants in the country, due to how the pie is split.

The same is true in network analysis. By creating a dataset with a hundred varieties of nodes, you lose your ability to see a bigger picture through meaningful aggregations.

Surely, you’re thinking, bimodal networks, with only two categories, should be fine! Wellllll, yes and no. You don’t bump into the same aggregation problem you do with very multimodal networks; instead, you bump into technical and mathematical issues. These issues are why I often warn non-technical researchers away from bimodal networks in research. They’re not theoretically unsound, they’re just difficult to work with properly unless you know what changes when you’re working with these complex networks.

The following section will discuss a few network metrics you may be familiar with, and what they mean for bimodal networks.

Network Metrics and Bimodality

The easiest thing to measure in a network is a node’s degree centrality. You’ll recall this is a measurement of how many edges are attached to a node, which gives a rough proxy for this concept we’ve come to call network “centrality“. It means different things depending on your data and your question: the most important or well-connected person in your social network; the point in the U.S. electrical grid which is most vulnerable to attack; the book that shares the most concepts with other books (the encyclopedia?); the city that the most traders pass through to get to their destination. These are all highly “central” in the networks they occupy.

A network with each node labeled with its degree centrality.
A network with each node labeled with its degree centrality, via Wikipedia.

Degree centrality is the easiest such proxy to compute: how many connections does a node have? The idea is that nodes that are more highly connected are more central. The assumption only goes so far, and it’s easy to come up with nodes that are central that do not have a  high degree, as with the network below.

The blue node is highly central, but only has a degree centrality of 3. [via]
The blue node is highly central, but only has a degree centrality of 3. [via]
That’s the thing with these metrics: if you know how they work, you know which networks they apply well to, and which they do not. If what you mean by “centrality” is “has more friends”, and we’re talking about a Facebook network, then degree centrality is a perfect metric for the job.

If what you mean is “an important stop for river trade”, and we’re talking about 12th century Russia, then degree centrality sucks. The below is an illustration of such a network by Pitts (1978):

Russian river trade routes. Numbers/nodes are cities, and edges are rivers between them.
Russian river trade routes. Numbers/nodes are cities, and edges are rivers between them.

Moscow is number 35, and pretty clearly the most central according to the above criteria (you’ll likely pass through it to reach other destinations). But it only has a degree centrality of four! Node 9 also has a degree centrality of four, but clearly doesn’t play as important a structural role as Moscow in this network.

We already see that depending on your question, your definitions, and your dataset, specific metrics will either be useful or not. Metrics may change meanings entirely from one network to the next – for example, looking at bimodal rather than unimodal networks.

Consider what degree centrality means for the Alice, Bob, and Carol’s bimodal affiliation network above, where each is associated with a different set of clubs. Calculate the degree centralities in your head (hint: if you can’t, you haven’t learned what degree centrality means yet. Try again.).

Alice and Bob have a degree of 2, and Carol has a degree of 1. Is this saying anything about how central each is to the network? Not at all. Compare this to the unimodal projection, and you’ll see Bob is clearly the only structurally central actor in the network. In a bimodal network, degree centrality is nothing more than a count of affiliations with the other half of the network. It is much less likely to tell you something structurally useful than if you were looking at a unimodal network.

Consider another common measurement: clustering coefficient. You’ll recall that a node’s local clustering coefficient is the extent to which its neighbors are neighbors to one another. If all my Facebook friends know each other, I have a high clustering coefficient; if none of them know each other, I have a low clustering coefficient. If all of a power plant’s neighbors directly connect to one another, it has a high clustering coefficient, and if they don’t, it has a low clustering coefficient.

Clustering coefficient, from largest to smallest. [via]
Clustering coefficient, from largest to smallest. [via]
This measurement winds up being important for all sorts of reasons, but one way to interpret its meaning is as a proxy for the extent to which a node bridges diverse communities, the extent to which it is an important broker. In the 17th century, Henry Oldenburg was an important broker between disparate scholarly communities, in that he corresponded with people all across Europe, many of whom would never meet one another. The fact that they’d never meet is represented by the local clustering coefficient. It’s low, so we know his neighbors were unlikely to be neighbors of one another.

You can get creative (and network scientists often are) with what this metric means in the context of your own dataset. As long as you know how the algorithm works (taking the fraction of neighbors who are neighbors to one another), and the structural assumptions underlying your dataset, you can argue why clustering coefficient is a useful proxy for answering whatever question you’re asking.

Your argument may be pretty good, like if you say clustering coefficient is a decent (but not the best) proxy for revealing nodes that broker between disparate sections of a unimodal social network. Or your argument may be bad, like if you say clustering coefficient is a good proxy for organizational cohesion on the bimodal Alice, Bob, and Carol affiliation network above.

A thorough glance at the network, and a realization of our earlier definition of clustering coefficient (taking the fraction of neighbors who are neighbors to one another), should reveal why this is a bad justification. Alice’s clustering coefficient is zero. As is Bob’s. As is the Network Club’s. Every node has a clustering coefficient of zero, because no node’s neighbors connect to each other. That’s just the nature of bimodal networks: they connect across, rather than between, modes. Alice can never connect directly with Bob, and the Network Club can never connect directly with the We Love History Society.

Bob’s neighbors (the organizations) can never be neighbors with each other. There will never be a clustering coefficient as we defined it.

In short, the simplest definition of clustering coefficient doesn’t work on bimodal networks. It’s obvious if you know how your network works, and how clustering coefficient is calculated, but if you don’t think about it before you press the easy “clustering coefficient” button in Gephi, you’ll be lead astray.

Gephi doesn’t know if your network is bimodal or unimodal or ∞modal. Gephi doesn’t care. Gephi just does what you tell it to. You want Gephi to tell you the degree centralities in a bimodal network? Here ya go! You want it to give you the local clustering coefficients of nodes in a bimodal network? Voila! Everything still works as though these metrics would produce meaningful, sensible results.

But they won’t be meaningful on your network. You need to be your own network’s sanity check, and not rely on software to tell you something’s a bad idea. Think about your network, think about your algorithm, and try to work through what an algorithm means in the context of your data.

Using Bimodal Networks

This doesn’t mean you should stop using bimodal networks. Most of the easy network software out there comes with algorithms made for unimodal networks, but other algorithms exist and are available for more complex networks. Very occasionally, but by no means always, you can project your bimodal network to a unimodal network, as described above, and run your unimodal algorithms on that new network projection.

There are a number of times when this doesn’t work well. At 2,300 words, this tutorial is already too long, so I’ll leave thinking through why as an exercise for the reader. It’s less complicated than you’d expect, if you have a pen and paper and know how fractions work.

The better solution, usually, is to use an algorithm meant for bi- or multimodal networks. Tore Opsahl has put together a good primer on the subject with regard to clustering coefficient (slightly mathy, but you can get through it with ample use of Wikipedia). He argues that projection isn’t an optimal solution, but gives a simple algorithm for a finding bimodal clustering coefficients, and directions to do so in R. Essentially the algorithm extends the visibility of the clustering coefficient, asking whether a node’s neighbors 2 hops away can reach the others via 2 hops as well. Put another way, I don’t want to know what clubs Bob belongs to, but rather whether Alice and Carol can also connect to one another through a club.

It’s a bit difficult to write without the use of formulae, but looking at the bimodal network and thinking about what clustering coefficient ought to mean should get you on the right track.

Bimodal networks aren’t an unsolved problem. If you search Google Scholar for bimodal, bipartite, and 2-mode networks, you’ll discover all sorts of clever methods for analyzing bimodal networks, including some great introductory texts by Borgatti and Everett.

The issue is there aren’t easy solutions through platforms like Gephi, and that’s probably on us as Digital Humanists.  I’ve found that DHers are much more likely to have bi- or multimodal datasets than most network researchers. If we want to be able to analyze them easily, we need to start developing our own plugins to Gephi, or our own tools, to do so. Push-button solutions are great if you know what’s happening when you push the button.

So let this be an addendum to my previous warnings against using bimodal networks: by all means, use them, but make sure you really think about the algorithms and your data, and what any given metric might imply when run on your network specifically. There are all sorts of free resources online you can find by googling your favorite algorithm. Use them.

For more information, read up on specific algorithms, methods, interpretations, etc. for two-mode networks from Tore Opsahl.


Submissions to Digital Humanities 2015 (pt. 3)

This is the third post in a three-part series analyzing submissions to the 2015 Digital Humanities conference in Australia. In parts 1 & 2, I covered submission volumes, topical coverage, and comparisons to conferences in previous years. This post will briefly address the geography of submissions, further exploring my criticism that this global-themed conference doesn’t feel so global after all. My geographic analysis shows the conference to be more international than I originally suspected.

I’d like to explore whether submissions to DH2015 are more broad in scope than those to previous conferences as well, but given time constraints, I’ll leave that exploration to a later post in this series, which has covered submissions and acceptances at DH conferences since 2013.

For this analysis, I looked at the universities of the submitting (usually lead) author on every submission, and used a geocoder to extract country, region, and continent data for each given university. This means that every submission is attached to one and only one location, even if other authors are affiliated with other places. Not perfect, but good enough for union work. After the geocoding, I corrected the results by hand 1, and present those results here.

It is immediately apparent that the DH2015 authors represent a more diverse geographical distribution than those in previous years. DH2013 in Nebraska was the only conference of the three where over half of submissions were concentrated in one continental region, the Americas. The Switzerland conference in 2014 had a slightly more even regional distribution, but still had very few contributions (11%) from Asia or Oceania. Contrast these heavily skewed numbers against DH2015 in Australia, with a third of the contributions coming from Asia or Oceania.

DH submissions broken down by UN macro-continental regions.

The trend continues broken down by UN micro-continental regions. The trends are not unexpected, but they are encouraging. When the conference was in Switzerland, Northern and Western Europe were much more well-represented, as was (surprisingly?) Eastern Asia. This may present the case that Eastern Asia’s involvement in DH is on the rise even not taking into account conference locations. Submissions for 2015 in Sydney are well-represented by Australia, New Zealand, Eastern Asia, and even Eastern Europe and Southern Asia.

DH conferences broken down by % covered from region in a given year.
DH conferences broken down by % covered from region in a given year.

One trend is pretty clear: the dominance of North America. Even at its lowest point in 2015, authors from North America comprise over a third of submissions. This becomes even more stark in the animation below, on which every submitting author’s country is represented.

DH2013-2015 with dots sized by the percent coverage that year.
DH2013-2015 with dots sized by the percent coverage that year.

The coverage from the United States over the course of the last three years barely changes, and from Canada shrinks only slightly when the conference moves off of North America. The UK also pretty much retains its coverage 2013-2015, hovering around 10% of submissions. Everywhere else the trends are pretty clear: a slow move eastward as the conference moves east. It’ll be interesting to see how things change in Poland in 2016, and wherever it winds up going in 2017.

In sum, it turns out “Global Digital Humanities 2015” is, at least geographically, much more global than the conferences of the previous two years. While the most popular topics are pretty similar to those in earlier years, I haven’t yet done an analysis of the diversity of the less popular topics, and it may be that they actually prove more diverse than those in earlier years. I’ll save that analysis for when the acceptances come in, though.


  1. It’s a small enough dataset. There’s 648 unique institutional affiliations listed on submissions from 2013-2015, which resolved to 49 unique countries in 14 regions on 4 continents.