Encouraging Misfits

tl;dr Academics’ individual policing of disciplinary boundaries at the expense of intellectual merit does a disservice to our global research community, which is already structured to reinforce disciplinarity at every stage. We should work harder to encourage research misfits to offset this structural pull.


The academic game is stacked to reinforce old community practices. PhDs aren’t only about specialization, but about teaching you to think, act, write, and cite like the discipline you’ll soon join. Tenure is about proving to your peers you are like them. Publishing and winning grants are as much about goodness of fit as about quality of work.

This isn’t bad. One of science’s most important features is that it’s often cumulative or at least agglomerative, that scientists don’t start from scratch with every new project, but build on each other’s work to construct an edifice that often resembles progress. The scientific pipeline uses PhDs, tenure, journals, and grants as built-in funnels, ensuring everyone is squeezed snugly inside the pipes at every stage of their career. It’s a clever institutional trick to keep science cumulative.

But the funnels work too well. Or at least, there’s no equally entrenched clever institutional mechanism for building new pipes, for allowing the development of new academic communities that break the mold. Publishing in established journals that enforce their community boundaries is necessary for your career; most of the world’s scholarly grant programs are earmarked for and evaluated by specific academic communities. It’s easy to be disciplinary, and hard to be a misfit.

To be sure, this is a known problem. Patches abound. Universities set aside funds for “interdisciplinary research” or “underfunded areas”; postdoc positions, centers, and antidsciplinary journals exist to encourage exactly the sort of weird research I’m claiming has no little place in today’s university. These solutions are insufficient.

University or even external grant programs fostering “interdisciplinarity” for its own sake become mostly useless because of the laws of Goodhart & Campbell. They’re usually designed to bring disciplines together rather than to sidestep disciplinarity altogether, which while admirable, is a system that’s pretty easy to game, and often leads to awkward alliances of convenience.

Dramatic rendition of types of -disciplinarity from Lotrecchiano in 2010, shown here never actually getting outside disciplines.

Universities do a bit better in encouraging certain types of centers that, rather than being “interdisciplinary”, are focused on a specific goal, method, or topic that doesn’t align easily with the local department structure. A new pipe, to extend my earlier bad metaphor. The problems arise here because centers often lack the institutional benefits available to departments: they rely on soft money, don’t get kickback from grant overheads, don’t get money from cross-listed courses, and don’t get tenure lines. Antidisciplinary postdoc positions suffer a similar fate, allowing misfits to thrive for a year or so before having to go back on the job market to rinse & repeat.

In short, the overwhelming inertial force of academic institutions pulls towards disciplinarity despite frequent but half-assed or poorly-supported attempts to remedy the situation. Even when new disciplinary configurations break free of institutional inertia, presenting themselves as means to knowledge every bit as legitimate as traditional departments (chemistry, history, sociology, etc.), it can take decades for them to even be given the chance to fail.

It is perhaps unsurprising that the community which taught us about autopoiesis proved incapable of sustaining itself, though half a century on its influences are glaringly apparent and far-reaching across today’s research universities. I wonder if we reconfigured the organization of colleges and departments from scratch today, whether there would be more departments of environmental studies and fewer departments of [redacted] 1.

I bring this all up to raise awareness of the difficulty facing good work with no discernible home, and to advocate for some individual action which, though it won’t change the system overnight, will hopefully make the world a bit easier for those who deserve it.

It is this: relax the reflexive disciplinary boundary drawing, and foster programs or communities which celebrate misfits. I wrote a bit about this last year in the context of history and culturomics; historians clamored to show that culturomics was bad history, but culturomics never attempted to be good history—it attempted to be good culturomics. Though I’d argue it often failed at that as well, it should have been evaluated by its own criteria, not the criteria of some related but different discipline.

Some potential ways to move forward:

  • If you are reviewing for a journal or grant and the piece is great, but doesn’t quite fit, and you can’t think of a better home for it, push against the editor to let it in anyway.
  • If you’re a journal editor or grant program officer, be more flexible with submissions which don’t fit your mold but don’t have easy homes elsewhere.
  • If you control funds for research grants, earmark half your money for good work that lacks a home. Not “good work that lacks a home but still looks like the humanities”, or “good work that looks like economics but happens to involve a computer scientist and a biologist”, but truly homeless work. I realize this won’t happen, but if I’m advocating, I might as well advocate big!
  • If you are training graduate students, hiring faculty, or evaluating tenure cases, relax the boundary-drawing urge to say “her work is fascinating, but it’s not exactly our department.”
  • If you have administrative and financial power at a university, commit to supporting nondisciplinary centers and agendas with the creation of tenure lines, the allocation of course & indirect funds, and some of the security offered to departments.

Ultimately, we need clever systems to foster nondisciplinary thinking which are as robust as those systems that foster cumulative research. This problem is above my paygrade. In the meantime, though, we can at least avoid the urge to equate disciplinary fitness with intellectual quality.

Notes:

  1. You didn’t seriously expect me to name names, did you?

Experience

Last week, I publicly outed myself as a non-tenure-track academic diagnosed on the autism spectrum, 1 hoping that doing so might help other struggling academics find solace knowing they are not alone. I was unprepared for the outpouring of private and public support. Friends, colleagues, and strangers thanked me for helping them feel a little less alone, which in turn helped me feel much less alone. Thank you all, deeply and sincerely.

In a similar spirit, for interested allies and struggling fellows, this post is about how my symptoms manifest in the academic world, and how I manage them. 2

Navigating the social world is tough—a fact that may surprise some of my friends and most of my colleagues. I do alright at conferences and in groups, when conversation is polite and skin-deep, but it requires careful concentration and a lot of smoke and mirrors. Inside, it feels like I’m translating from Turkish to Cantonese without knowing either language. Every time this is said, that is the appropriate reply, though I struggle to understand why. I just possess a translation book, and recite what is expected. Stimulus and response. This skill was only recently acquired.

Looking at the point between people’s eyes makes it appear as though I am making direct eye contact during conversations. Certain observations (“you look tired”) are apparently less well-received than others (“you look excited”), and I’ve mostly learned which are which.

After a long day keeping up this appearance, especially at conferences, I find a nice dark room and stay there. Sharing conference hotel rooms with fellow academics is never an option. Some strategies I figured out myself; others, like the eye contact trick, I built over extended discussions with an old girlfriend after she handed me a severely-highlighted copy of The Partner’s Guide to Asperger Syndrome.

ADHD and Autism Spectrum Disorder are highly co-morbid, and I have been diagnosed with either or both by several independent professionals in the last twenty years. Working is hard, and often takes at least twice as much time for me as it does for the peers with whom I have discussed this. When interested in something, I lose myself entirely in it for hours on end, but a single break in concentration will leave me scrambling. It may take hours or days to return to a task, if I do at all. My best work is done in marathon, and work that takes longer than a few days may never get finished, or may drop in quality precipitously. Keeping the internet disconnected and my phone off during regular periods every day, locked in my windowless office, helps keep distractions at bay. But, I have yet to discover a good strategy to manage long projects. A career in the book-driven humanities may have been a poor choice.

Paying bills on time, keeping schedules, and replying to emails are among the most stressful tasks in my life. When I don’t adequately handle all of these mundane tasks, it sets in motion a cycle of horror that paralyzes my ability to get anything done, until I eventually file for task bankruptcy and inevitably disappoint colleagues, friends, or creditors to whom action is owed. Poor time management and stress-cycles lead me to over-promise and under-deliver. On the bright side, I recently received help in strategies to improve that, and they work. Sometimes.

Friendships, surprisingly, are easy to maintain but difficult to nourish. My friends consider me trustworthy and willing to help (if not necessarily always dependable), but I lose track of friends or family who aren’t geographically close. Deeper emotional relationships are rare or, for swaths of my life, non-existent. I get no fits of anger or depression or elation or excitement. Indeed, my friends and family remark how impossible it is to see if I like a gift they’ve given me.

People occasionally describe my actions as offensive, rude, or short, and I get frustrated trying to understand exactly why what I’m doing fits into those categories. Apparently, early in grad school, I had a bit of a reputation for asking obnoxious questions in lectures. But I don’t like upsetting people, and actively (maybe successfully?) try to curb these traits when they are pointed out.

Thankfully, academic life allows me the freedom to lock myself in a room and focus on a task. Using work as a coping mechanism for social difficulties may be unhealthy, but hey, at least I found a career that rewards my peculiarities.

My life is pretty great. I have good friends, a loving family, and hobbies that challenge me. As long as I maintain the proper controlled environment, my fixations and obsessions are a perfect complement to an academic career, especially in a culture that (unfortunately) rewards workaholism. The same tenacity often compensates for difficulties in navigating romantic relationships, of which I’ve had a few incredibly fulfilling and valuable ones over my life thus-far.

Unfortunately, my experience on the autism spectrum is not shared by all academics. Some have enough difficulty managing the social world that they end up alienating colleagues who are on their tenure committees, to disastrous effect. From private conversations, it seems autistic women suffer more from this than men, as they are expected to perform more service work and to be more social. Supportive administrators can be vital in these situations, and autism-spectrum academics may want to negotiate accommodations for themselves as part of their hiring process.

Despite some frustrations, I have found my atypical way of interacting with the world to be a feature, not a bug. My atypicality presents as what used to be called Asperger Syndrome, and it is easier for me to interact with the world, and easier for the world to interact with me, than many other autistic individuals. That said, whether or not my friends and colleagues notice, I still struggle with many aspects common to those diagnosed on the autism spectrum: social-emotional difficulties, alexithymia, intensity of focus, hypersensitivity, system-oriented thinking, etc.

Relationships or friendships with someone on the spectrum can be tough, even with someone who doesn’t outwardly present common characteristics, like me. An old partner once vented her frustrations that she couldn’t turn to her friends for advice, because: “everyone just said Scott is so normal and I was thinking [no], he’s just very very good at passing [as socially aware].” Like many who grow up non-neurotypical, I learned a complex set of coping strategies to help me fit in and succeed in a neurotypical world. To concentrate on work, I create an office cave to shut out the world. I use a complicated set of journals, calendars, and apps to keep me on task and ensure I pay bills on time. To stay attentive, I sit at the front of a lecture hall—it even works, sometimes. Some ADHD symptoms are managed pharmacologically.

These strategies give me the 80% push I need to be a functioning member of society, to become someone who can sustain relationships, not get kicked out of his house for forgetting rent, and can almost finish a PhD. Almost. It’s not quite enough to prevent me from a dozen incompletes on my transcripts, but I make do. A host of unrealistically patient and caring friends, family, and colleagues helps. (If you’re someone to whom I still owe work, but am too scared to reply to because of how delinquent I am, thanks for understanding! waves and runs away). Caring allies help. A lot.

My life so far has been a series of successes and confusions. Not unlike anybody else’s life, I suppose. I occupy my own corner of weirdness, which is itself unique enough, but everyone has their own corner. I doubt my writing this will help anyone understand themselves any better, but hopefully it will help fellow academics feel a bit safer in their own weirdness. And if this essay helps our neurotypical colleagues be a bit more understanding of our struggles, and better-informed as allies, all the better.

Notes:

  1. The original article, Stigma, was written for the Conditionally Accepted column of Inside Higher Ed. Jeana Jorgensen, Eric Grollman and Sarah Bray provided invaluable feedback, and I wouldn’t have written it without them. They invited me to write this second article for Inside Higher Ed as well, which was my original intent. I wound up posting it on my blog instead because their posting schedule didn’t quite align with my writing schedule. This shouldn’t be counted as a negative reflection on the process of publishing with that fine establishment.
  2. Let me be clear: I know very little about autism, beyond that I have been diagnosed with it. I’m still learning a lot. This post is about me. Knowing other people face similar struggles has been profoundly helpful, regardless of what causes those struggles.

The Turing Point

Below is some crazy, uninformed ramblings about the least-complex possible way to trick someone into thinking a computer is a human, for the purpose of history research. I’d love some genuine AI/Machine Intelligence researchers to point me to the actual discussions on the subject. These aren’t original thoughts; they spring from countless sci-fi novels and AI research from the ’70s-’90s. Humanists beware: this is super sci-fi speculative, but maybe an interesting thought experiment.


If someone’s chatting with a computer, but doesn’t realize her conversation partner isn’t human, that computer passes the Turing Test. Unrelatedly, if a robot or piece of art is just close enough to reality to be creepy, but not close enough to be convincingly real, it lies in the Uncanny ValleyI argue there is a useful concept in the simplest possible computer which is still convincingly human, and that computer will be at the Turing Point. 1 

By Smurrayinchester - self-made, based on image by Masahiro Mori and Karl MacDorman at http://www.androidscience.com/theuncannyvalley/proceedings2005/uncannyvalley.html, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2041097
By Smurrayinchester – self-made, based on image by Masahiro Mori and Karl MacDorman, CC BY-SA 3.0

Forgive my twisting Turing Tests and Uncanny Valleys away from their normal use, for the sake of outlining the Turing Point concept:

  • A human simulacrum is a simulation of a human, or some aspect of a human, in some medium, which is designed to be as-close-as-possible to that which is being modeled, within the scope of that medium.
  • A Turing Test winner is any human simulacrum which humans consistently mistake for the real thing.
  • An occupant of the Uncanny Valley is any human simulacrum which humans consistently doubt as representing a “real” human.
  • Between the Uncanny Valley and Turing Test winners lies the Turing Point, occupied by the least-sophisticated human simulacrum that can still consistently pass as human in a given medium. The Turing Point is a hyperplane in a hypercube, such that there are many points of entry for the simulacrum to “phase-transition” from uncanny to convincing.

Extending the Turing Test

The classic Turing Test scenario is a text-only chatbot which must, in free conversation, be convincing enough for a human to think it is speaking with another human. A piece of software named Eugene Goostman sort-of passed this test in 2014, convincing a third of judges it was a 13-year-old Ukrainian boy.

There are many possible modes in which a computer can act convincingly human. It is easier to make a convincing simulacrum of a 13-year-old non-native English speaker who is confined to text messages than to make a convincing college professor, for example. Thus the former has a lower Turing Point than the latter.

Playing with the constraints of the medium will also affect the Turing Point threshold. The Turing Point for a flesh-covered robot is incredibly difficult to surpass, since so many little details (movement, design, voice quality, etc.) may place it into the Uncanny Valley. A piece of software posing as a Twitter user, however, would have a significantly easier time convincing fellow users it is human.

The Turing Point, then, is flexible to the medium in which the simulacrum intends to deceive, and the sort of human it simulates.

From Type to Token

Convincing the world a simulacrum is any old human is different than convincing the world it is some specific human. This is the token/type distinction; convincingly simulating a specific person (token) is much more difficult than convincingly simulating any old person (type).

Simulations of specific people are all over the place, even if they don’t intend to deceive. Several Twitter-bots exist as simulacra of Donald Trump, reading his tweets and creating new ones in a similar style. Perhaps imitating Poe’s Law, certain people’s styles, or certain types of media (e.g. Twitter), may provide such a low Turing Point that it is genuinely difficult to distinguish humans from machines.

Put differently, the way some Turing Tests may be designed, humans could easily lose.

It’ll be useful to make up and define two terms here. I imagine the concepts already exist, but couldn’t find them, so please comment if they do so I can use less stupid words:

  • type-bot is a machine designed to be represent something at the type-level. For example, a bot that can be mistaken for some random human, but not some specific human.
  • token-bot is a machine designed to represent something at the token-level. For example, a bot that can be mistaken for Donald Trump.

Replaying History

Using traces to recreate historical figures (or at least things they could have done) as token-bots is not uncommon. The most recent high-profile example of this is a project to create a new Rembrandt painting in the original style. Shawn Graham and I wrote an article on using simulations to create new plausible histories, among many other examples old and new.

This all got me thinking, if we reach the Turing Point for some social media personalities (that is, it is difficult to distinguish between their social media presence, and a simulacrum of it), what’s to say we can’t reach it for an entire social media ecosystem? Can we take a snapshot of Twitter and project it several seconds/minutes/hours/days into the future, a bit like a meteorological model?

A few questions and obvious problems:

  • Much of Twitter’s dynamics are dependent upon exogenous forces: memes from other media, real world events, etc. Thus, no projection of Twitter alone would ever look like the real thing. One can, however, potentially use such a simulation to predict how certain types of events might affect the system.
  • This is way overkill, and impossibly computationally complex at this scale. You can simulate the dynamics of Twitter without simulating every individual user, because people on average act pretty systematically. That said, for the humanities-inclined, we may gain more insight from the ground-level of the system (individual agents) than macroscopic properties.
  • This is key. Would a set of plausibly-duplicate Twitter personalities on aggregate create a dynamic system that matches Twitter as an aggregate system? That is, just because the algorithms pass the Turing Test, because humans believe them to be humans, does that necessarily imply the algorithms have enough fidelity to accurately recreate the dynamics of a large scale social network? Or will small unnoticeable differences between the simulacrum and the original accrue atop each other, such that in aggregate they no longer act like a real social network?

The last point is I think a theoretically and methodologically fertile one for people working in DH, AI, and Cognitive Science: whether reducing human-appreciable traits between machines and people is sufficient to simulate aggregate social behavior, or whether human-appreciability (i.e., Turing Test) is a strict enough criteria for making accurate predictions about societies.

These points aside, if we ever do manage to simulate specific people (even in a very limited scope) as token-bots based on the traces they leave, it opens up interesting pedagogical and research opportunities for historians. Scott Enderle tweeted a great metaphor for this:

Imagine, as a student, being able to have a plausible discussion with Marie Curie, or sitting in an Enlightenment-era salon. 2 Or imagine, as a researcher (if individual Turing Point machines do aggregate well), being able to do well-grounded counterfactual history that works at the token level rather than at the type level.

Turing Point Simulations

Bringing this slightly back into the realm of the sane, the interesting thing here is the interplay between appreciability (a person’s ability to appreciate enough difference to notice something wrong with a simulacrum) and fidelity.

We can specifically design simulation conditions with incredibly low-threshold Turing Points, even for token-bots. That is to say, we can create a condition where the interactions are simple enough to make a bot that acts indistinguishably from the specific human it is simulating.

At the most extreme end, this is obviously pointless. If our system is one in which a person can only answer “yes” or “no” to pre-selected preference questions (“Do you like ice-cream?”), making a bot to simulate that person convincingly would be trivial.

Putting that aside (lest we get into questions of the Turing Point of a set of Turing Points), we can potentially design reasonably simplistic test scenarios that would allow for an easy-to-reach Turing Point while still being historiographically or sociologically useful. It’s sort of a minimization problem in topological optimizations. Such a goal would limit the burden of the simulation while maximizing the potential research benefit (but only if, as mentioned before, the difference between true fidelity and the ability to win a token-bot Turing Test is small enough to allow for generalization).

In short, the concept of a Turing Point can help us conceptualize and build token-simulacra that are useful for research or teaching. It helps us ask the question: what’s the least-complex-but-still-useful token-simulacra? It’s also kind-of maybe sort-of like Kolmogorov complexity for human appreciability of other humans: that is, the simplest possible representation of a human that is convincing to other humans.

I’ll end by saying, once again, I realize how insane this sounds, and how far-off. And also how much an interloper I am to this space, having never so much as designed a bot. Still, as Bill Hart-Davidson wrote,

the possibility seems more plausible than ever, even if not soon-to-come. I’m not even sure why I posted this on the Irregular, but it seemed like it’d be relevant enough to some regular readers’ interests to be worth spilling some ink.

Notes:

  1. The name itself is maybe too on-the-nose, being a pun for turning point and thus connected to the rhetoric of singularity, but ¯\_(ツ)_/¯
  2. Yes yes I know, this is SecondLife all over again, but hopefully much more useful.

Work with me! CMU is hiring a DH Developer

Carnegie Mellon University is hiring a DH Developer!

I’ve had a blast since starting as Digital Humanities Specialist at CMU. Enough administrators, faculty, and students are on board to make building a DH strength here pretty easy, and we’re neighbors to Pitt DHRX, a really supportive supercomputing center, and great allies in the Mayor’s Office keen on a city rich with art, data, and both combined.

We want a developer to help jump-start our research efforts. You’ll be working as a full collaborator on projects from all sorts of domains, and as a review board member you’ll have a strong say in which projects they are and how they get implemented. You and I will work together in achievable rapid prototyping, analyzing data, and web deployment.

The idea is we build or do stuff that’s scholarly, interesting, and can have a proof-of-concept or article done in a semester or two. With that, the project can go on to seek additional funding and a full-time specialized programmer, or we can finish there and all be proud authors or creators of something we enjoyed making.

Ideally, you have a social science, humanities, journalism, or similar research background, and the broad tech chops to create a d3 viz, DeepDream some dogs into a work of art, manage a NoSQL database, and whatever else seems handy. Ruby on Rails, probably.

We’re looking for someone who loves playing with new tech stacks, isn’t afraid to get their hands dirty, and knows how to talk to humans. You probably have a static site and a github account. You get excited by interactive data stories, and want to make them with us. This job values breadth over depth and done over perfect.

The job isn’t as insane as it sounds—you don’t actually need to be able to do all this already, just be the sort of person who can learn on the fly. A bachelor’s degree or similar experience is required, with a strong preference for candidates with some research background. You’ll need to submit or point to some examples of work you’ve done.

We’re an equal opportunity employer, and would love to see applications from women, minorities, or other groups who often have a tough time getting developer jobs. If you work here you can take two free classes a semester. Say, who wants a fancy CMU computer science graduate degree? We can offer an awesome city, friendly coworkers, and a competitive salary (also Pittsburgh’s cheap so you wouldn’t live in a closet, like in SF or NYC).

What I’m saying is you should apply ’cause we love you.


The ad, if you’re too lazy to click the link, or are scared CMU hosts viruses:

Job Description
Digital Humanities Developer, Dietrich College of Humanities and Social Sciences

Summary
The Dietrich College of Humanities and Social Sciences at Carnegie Mellon University (CMU) is undertaking a long-term initiative to foster digital humanities research among its faculty, staff, and students. As part of this initiative, CMU seeks an experienced Developer to collaborate on cutting edge interdisciplinary projects.

CMU is a world leader in technology-oriented research, and a highly supportive environment for cross-departmental teams. The Developer would work alongside researchers from Dietrich and elsewhere to plan and implement digital humanities projects, from statistical analyses of millions of legal documents to websites that crowdsource grammars of endangered languages. Located in the the Office of The Dean under CMU’s Digital Humanities Specialist, the developer will help start up faculty projects into functioning prototypes where they can acquire sustaining funding to hire specialists for more focused development.

The position emphasizes rapid, iterative deployment and the ability to learn new techniques on the job, with a focus on technologies intersecting data science and web development, such as D3.js, NoSQL, Shiny (R), IPython Notebooks, APIs, and Ruby on Rails. Experience with digital humanities or computational social sciences is also beneficial, including work with machine learning, GIS, or computational linguistics.

The individual in this position will work with clients and the digital humanities specialist to determine achievable short-term prototypes in web development or data analysis/presentation, and will be responsible for implementing the technical aspects of these goals in a timely fashion. As a collaborator, the Digital Humanities Developer will play a role in project decision-making, where appropriate, and will be credited on final products to which they extensively contribute.

Please submit a cover letter, phone numbers and email addresses for two references, a résumé or cv, and a page describing how your previous work fits the job, including links to your github account or other relevant previous work examples.

Qualifications

  • Bachelor’s Degree in humanities computing, digital humanities, informatics, computer science, related field, or equivalent combination of training and experience.
  • At least one year of experience in modern web development and/or data science, preferably in a research and development team setting.
  • Demonstrated knowledge of modern machine learning and web development languages and environments, such as some combination of Ruby on Rails, LAMP, Relational Databases or NoSQL (MongoDB, Cassanda, etc.), MV* & JavaScript (including D3.js), PHP, HTML5, Python/R, as well as familiarity with open source project development.
  • Some system administration.

Preferred Qualifications

  • Advanced degree in digital humanities, computational social science, informatics, or data science. Coursework in data visualization, machine learning, statistics, or MVC web applications.
  • Three or more years at the intersection of web development/deployment and machine learning (e.g. data journalism or digital humanities) in an agile software environment.
  • Ability to assess client needs and offer creative research or publication solutions.
  • Any combination of GIS, NLTK, statistical models, ABMs, web scraping, mahout/hadoop, network analysis, data visualization, RESTful services, testing frameworks, XML, HPC.

Job Function: Research Programming

Primary Location: United States-Pennsylvania-Pittsburgh

Time Type: Full Time

Organization: DIETRICH DEAN’S OFFICE

Minimum Education Level: Bachelor’s Degree or equivalent

Salary: Negotiable

Ghosts in the Machine

Musings on materiality and cost after a tour of The Shoah Foundation.

Forgetting The Holocaust

As the only historian in my immediate family, I’m responsible for our genealogy, saved in a massive GEDCOM file. Through the wonders of the web, I now manage quite the sprawling tree: over 100,000 people, hundreds of photos, thousands of census records & historical documents. The majority came from distant relations managing their own trees, with whom I share.

Such a massive well-kept dataset is catnip for a digital humanist. I can analyze my family! The obvious first step is basic stats, like the most common last name (Aber), average number of kids (2), average age at death (56), or most-frequently named location (New York). As an American Jew, I wasn’t shocked to see New York as the most-common place name in the list. But I was unprepared for the second-most-common named location: Auschwitz.

I’m lucky enough to write this because my great grandparents all left Europe before 1915. My grandparents don’t have tattoos on their arms or horror stories about concentration camps, though I’ve met survivors their age. I never felt so connected to The Holocaust, HaShoah, until I took time to see explore the hundreds of branches of my family tree that simply stopped growing in the 1940s.

Aerial photo of Auschwitz-Birkenau. [via wikipedia]
1 of every 16 Jews in the entire world were murdered in Auschwitz, about a million in all. Another 5 million were killed elsewhere. The global Jewish population before the Holocaust was 16.5 million, a number we’re only now approaching again, 70 years later. And yet, somehow, last month a school official and national parliamentary candidate in Canada admitted she “didn’t know what Auschwitz was”.

I grew up hearing “Never Forget” as a mantra to honor the 11 million victims of hate and murder at the hands of Nazis, and to ensure it never happens again. That a Canadian official has forgotten—that we have all forgotten many of the other genocides that haunt human history—suggests how easy it is to forget. And how much work it is to remember.

The material cost of remembering 50,000 Holocaust survivors & witnesses

Yad Vashem (“a place and a name”) represents the attempt to inscribe, preserve, and publicize the names of Jewish Holocaust victims who have no-one to remember them. Over four million names have been collected to date.

The USC Shoah Foundation, founded by Steven Spielberg in 1994 to remember Holocaust survivors and witnesses, is both smaller and larger than Yad Vashem. Smaller because the number of survivors and witnesses still alive in 1994 numbered far fewer than Yad Vashem‘s 4.3 million; larger because the foundation conducted video interviews: 100,000 hours of testimony from 50,000 individuals, plus recent additions of witnesses and survivors of other genocides around the world. Where Yad Vashem remembers those killed, the Shoah Foundation remembers those who survived.  What does it take to preserve the memories of 50,000 people?

I got a taste of the answer to that question at a workshop this week hosted by USC’s Digital Humanities Program, who were kind enough to give us a tour of the Shoah Foundation facilities. Sam Gustman, the foundation’s CTO and Associate Dean of USC’s Libraries, gave the tour.

Shoah Foundation Digitization Facility
Shoah Foundation Digitization Facility [via my camera]
Digital preservation it a complex process. In this case, it began by digitizing 235,000 analog Betacam SP Videocassettes, on which the original interviews had been recorded, a process which took from 2008-2012. This had to be done quickly (automatically/robotically), given that cassette tapes are prone to become sticky, brittle, and unplayable within a few decades due to hydrolysis. They digitized about 30,000 hours per year. The process eventually produced 8 petabytes (link to more technical details) of  lossless JPEG 2000 videos, roughly the equivalent of 2 million DVDs. Stacked on top of each other, those DVDs would reach three times higher than Burj Khalifa, the world’s tallest tower.

From there, the team spent quite some time correcting errors that existed in the original tapes, and ones that were introduced in the process of digitization. They employed a small army of signal processing students, patented new technologies for automated error detection & processing/cleaning, and wound up cleaning video from about 12,000 tapes. According to our tour guide, cleaning is still happening.

Lest you feel safe knowing that digitization lengthens the preservation time, turns out you’re wrong. Film lasts longer than most electronic storage, but making film copies would have cost the foundation $140,000,000 and made access incredibly difficult. Digital copies would only cost tens of millions of dollars, even though hard-drives couldn’t be trusted to last more than a decade. Their solution was a RAID hard-drive system in an Oracle StorageTek SL8500 (of which they have two), and a nightly process of checking video files for even the slightest of errors. If an error is found, a backup is loaded to a new cartridge, and the old cartridge is destroyed. Their two StorageTeks each fit over 10,000 drive cartridges, have 55 petabytes worth of storage space, weigh about 4,000 lbs, and are about the size of a New York City apartment. If a drive isn’t backed up and replaced within three years, they throw it out and replace it anyway, just in case. And this setup apparently saved the Shoah Foundation $6 million.

Digital StillCamera
StorageTek SL8500 [via CERN]
Oh, and they have another facility a few states away, connected directly via high-bandwidth fiber optic cables, where everything just described is duplicated in case California falls into the ocean.

Not bad for something that costs libraries $15,000 per year, which is about the same the library would pay for one damn chemistry journal.

So how much does it cost to remember 50,000 Holocaust witnesses and survivors for, say, 20 years? I mean, above and beyond the cost of building a cutting edge facility, developing new technologies of preservation, cooling and housing a freight container worth of hard drives, laying fiber optic cables below ground across several states, etc.? I don’t know. But I do know how much the Shoah Foundation would charge you to save 8 petabytes worth of videos for 20 years, if you were a USC Professor. They’d charge you $1,000/TB/20 years.

The Foundation’s videos take up 8,000 terabytes, which at $1,000 each would cost you $8 million per 20 years, or about half a million dollars per year. Combine that with all the physical space it takes up, and never forgetting the Holocaust is sounding rather prohibitive. And what about after 20 years, when modern operating systems forget how to read JPEG 2000 or interface with StorageTek T10000C Tape Drives, and the Shoah Foundation needs to undertake another massive data conversion? I can see why that Canadian official didn’t manage it.

The Reconcentration of Holocaust Survivors

While I appreciated the guided tour of the exhibit, and am thankful for the massive amounts of money, time, and effort scholars and donors are putting into remembering Holocaust survivors, I couldn’t help but be creeped out by the experience.

Our tour began by entering a high security facility. We signed our names on little pieces of paper and were herded through several layers of locked doors and small rooms. Not quite the way one expects to enter the project tasked with remembering and respecting the victims of genocide.

The Nazi’s assembly-line techniques for mass extermination led to starkly regular camps, like Auschwitz pictured above, laid out in efficient grids for the purpose of efficient control and killings. “Concentration camp”, by the way, refers to the concentration of people into small spaces, coming from “reconcentration camps” in Cuba. Now we’re concentrating 50,000 testimonies into a couple of closets with production line efficiency, reconcentrating the stories of people who dispersed across the world, so they’re all in one easy-to-access place.

Server farm [via wikipedia]
We’ve squeezed 100,000 hours of testimony into a server farm that consists of a series of boxes embedded in a series of larger boxes, all aligned to a grid; input, output, and eventual destruction of inferior entities handled by robots. Audits occur nightly.

The Shoah Foundation materials were collected, developed, and preserved with the utmost respect. The goal is just, the cause respectable, and the efforts incredibly important. And by reconcentrating survivors’ stories, they can now be accessed by the world. I don’t blame the Foundation for the parallels which are as much a construct of my mind as they are of the society in which this technology developed. Still, on Halloween, it’s hard to avoid reflecting on the material, monetary, and ultimately dehumanizing costs of processing ghosts into the machine.

What’s Counted Counts

tl;dr. Don’t rely on data to fix the world’s injustices. An unusually self-reflective and self-indulgent post.

[Edit: this question was prompted by a series of analyses and visualizations I’ve done in collaboration with Nickoal Eichmann, but I purposefully left her out of the majority of this post, as it was one of self-reflection about my own personal choices. A respected colleague pointed out in private that by doing so, I nullified my female collaborator’s contributions to the project, for which I apologize deeply. Nickoal’s input has been integral to all of this, and she and many others, including particularly Jeana Jorgensen and Heather Froehlich (who has written on this very subject), have played vital roles in my own learning about these issues. Recent provocations by Miriam Posner helped solidify a lot of these thoughts and inspired this post. What follows is a self-exploration, recapping what many people have already said, but hopefully still useful to some. Mistakes below shouldn’t reflect poorly on those who influenced or inspired me. The post from this point on is as it originally appeared.]


Someone asked yesterday why I cared enough 1 about gender equality in academia to make this chart (with Nickoal Eichmann).

Gender representation as authors at DH conferences over the last decade. (Women consistently represent around 33% of authors)
Gender representation as authors at DH conferences over the last decade. Context. (Women consistently represent around 33% of authors)

I didn’t know how to answer the question. Our culture gives some more and better opportunities than others, so in order to make things better for more people, we must reveal and work towards resolving points of inequality. “Why do I care?” Don’t most of us want to make things better, we just go about it in different ways, and have different ideas of what’s “better”?

But the question did make me consider why I’d started with gender equality, when there are clearly so many other equally important social issues to tackle, within and outside academia. The answer was immediately obvious: ease. I’d attempted to explore racial and ethnic diversity as well, but it was simply more fraught, complicated, and less amenable to my methods than gender, so I started with gender and figured I’d work my way into the weeds from there. 2

I’ll cut to the chase. My well-intentioned attempts at battling inequality suffer their own sort of bias: by focusing on measurements of inequality, I bias that which is easily measured. It’s not that gender isn’t complex (see Miriam Posner’s wonderful recent keynote on these and related issues), but at least it’s a little easier to measure than race & ethnicity, when all you have available to you is what you can look up on the internet.

[scroll down]

Saturday Morning Breakfast Cereal. [source]
Saturday Morning Breakfast Cereal. [source]
While this problem is far from new, it takes special significance in a data-driven world. That which is countable counts, and damn the rest. At its heart, this problem is one of classification and categorization: those social divides which have the clearest seams are those most easily counted. And in a data-driven world, it’s inequality along these clear divides which get noticed first, even when injustice elsewhere is far greater.

Sex is easy, compared to gender. At most 2% of people are born intersex according to most standards (but not accounting for dysmorphia & similar). And gender is relatively easy compared to race and ethnicity. Nationality is pretty easy because of bureaucratic requirements for passports and citizenship, and country of residence is even easier, unless you live somewhere like Palestine.

But even the Palestine issue isn’t completely problematic, because counting still works fine when one thing exists in multiple categories, or may be categorized differently in different systems. That’s okay.

[source]
[source]
Where math gets lost is where there are simply no good borders to draw around entities—or worse, there are borders, but those borders themselves are drawn by insensitive outgroups. We see this a lot in the history of colonialism. Have you ever been to the Pitt Rivers Museum in Oxford? It’s a 19th century museum that essentially shows what the 19th century British mind felt about the world: everything that looks like a flute is in the flute cabinet, everything that looks like a gun is in the gun cabinet, and everything that looks like a threatening foreign religious symbol is in the threatening foreign religious symbol cabinet. Counting such a system doesn’t reveal any injustice except that of the counters themselves.

Pitt Rivers Museum [source]
Pitt Rivers Museum [source]
And I’ll be honest here: I want to help make the world a better place, but I’ve got to work to my strengths and know my limits. I’m a numbers guy. I’m at my best when counting stuff, and when there are no sensitive ways to classify, I avoid counting, because I don’t want to be That Colonizing White Dude who tries to fit everything into boxes of his own invention to make himself feel better about what he’s doing for the world. I probably still fall into that trap a lot anyway.

So why did I care enough to count gender at DH conferences? It was (relatively) easy. And it’s needed, as we saw at DH2015 and we’ve seen throughout the digital humanities – we have a gender issue, and a feminism issue, and they both need to be pointed out and addressed. But we also have lots of other issues that I’ll simply never be able to approach, and don’t know how to approach, and am in danger of ignoring entirely if I only rely on quantitative evidence of inequality.

useless by xkcd
useless by xkcd

Of course, only relying on non-quantitative evidence has its own pitfalls. People evolved and are socialized to spot patterns, to extrapolate from limited information, even when those extrapolations aren’t particularly meaningful or lead to Jesus in a slice of toast. I’m not advocating we avoid metrics entirely (for one, I’d be out of a job), but echoing Miriam Posner’s recent provocation, we need to engage with techniques, approaches, and perspectives that don’t rely on easy classification schemes. Especially, we need to listen when people notice injustice that isn’t easily classified or counted.

“Uh, yes, Scott, who are you writing this for? We already knew this!” most of you are likely asking if you’ve read this far. I’m writing to myself in early college, an engineering student obsessed with counting, who’s slowly learned the holes in a worldview that only relies on quantitative evidence. The one who spent years quantifying his health issues, only to discover the pursuit of a number eventually took precedence over the pursuit of his own health. 3

Hopefully this post helps balance all the bias implicit in my fighting for a better world from a data-driven perspective, by suggesting “data-driven” is only one of many valuable perspectives.

Notes:

  1. Upon re-reading the original question, it was actually “Why did you do it? (or why are you interested?)”. Still, this post remains relevant.
  2. I’m light on details here because I don’t want this to be an overlong post, but you can read some more of the details on what Nickoal and I are doing, and the decisions we make, in this blog series.
  3. A blog post on mental & physical health in academia is forthcoming.

Down the Rabbit Hole

WHEREIN I get angry at the internet and yell at it to get off my lawn.

You know what’s cool? Ryan Cordell and friends’ Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

Which newspapers copied from one another, from the Viral Texts project.
Which newspapers copied from one another, from the Viral Texts project.

Isn’t that a neat little slice of journalistic history? Different copyright laws, different technologies of text, different constraints of the medium, they all led to an interesting moment of textual virality in 19th-century America. If I weren’t a historian who knew better, I’d call it something like “quaint” or “charming”.

You know what isn’t quaint or charming? Living in the so-called “information age“, where everything is intertwingled, with hyperlinks and text costing pretty much zilch, and seeing the same gorram practices.

What proceeds is a rant. They say never to blog in anger. But seriously.

Inequality in Science

Tonight Alex Vespignani, notable network scientist, tweeted a link to an interesting-sounding study about inequality in scientific publishing. In Quartz! I like Quartz, it’s where Christopher Mims used to post awesome science things. Part of their mission statement reads:

In all that we do at Quartz, we embrace openness: open source code, an open newsroom, and open access to the data behind our journalism.

Pretty cool, right?

Anyway, here’s the tweet:

It links to this article on a “map of the world’s scientific research“. Because Vespignani tweeted it, I took it seriously (yes yes I know rt≠endorsement), and read the article. It describes a cartogram map of scientific research publications which shows how the U.S. and Western Europe (and a bit of China) dominates the research world, making the point that such a disparity is “disturbingly unequal”.

Map of scientific research, pulled from qz.com
Map of scientific research, by how many published articles are produced in a country, pulled from qz.com

“What’s driving the inequality?” they ask. Money & tech play a big role. So does what counts as “high impact” in science. What’s worse, the journalist writes,

In the worst cases, the global south simply provides novel empirical sites and local academics may not become equal partners in these projects about their own contexts.

The author points out an issue with the data: it only covers journals, not monographs, grey literature, edited volumes, etc. This often excludes the humanities and social sciences. The author also raises the issue of journal paywalls and how it decreases access to researchers in countries without large research budges. But we need to do better on “open dissemination”, the article claims.

Sources

Hey, that was a good read! I agree with everything the author said. What’s more, it speaks to my research, because I’ve done a fair deal of science mapping myself at the Cyberinfrastructure for Network Science Center under Katy Börner. Great, I think, let’s take a look at the data they’re using, given Quartz’s mission statement about how they always use open data.

I want to see the data because I know a lot of scientific publication indexing sites do a poor job of indexing international publications, and I want to see how it accounts for that bias. I look at the bottom of the page.

Crap.

This post originally appeared at The Conversation. Follow @US_conversation on Twitter. We welcome your comments at ideas@qz.com.

Alright, no biggie, time to look at the original article on The Conversation, a website whose slogan is “Academic rigor, journalistic flair“. Neat, academic rigor, I like the sound of that.

I scroll to the bottom, looking for the source.

A longer version of this article originally appeared on the London School of Economics’ Impact Blog.

Hey, the LSE Impact blog! They usually publish great stuff surrounding metrics and the like. Cool, I’ll click the link to read the longer version. The author writes something interesting right up front:

What would it take to redraw the knowledge production map to realise a vision of a more equitable and accurate world of knowledge?

A more accurate world of knowledge? Was this map inaccurate in a way the earlier articles didn’t report? I read on.

Well, this version of the article goes on a little to say that people in the global south aren’t always publishing in “international” journals. That’s getting somewhere, maybe the map only shows “international journals”! (Though she never actually makes that claim). Interestingly, the author writes of literature in the global south:

Even when published, this kind of research is often not attributed to its actual authors. It has the added problem of often being embargoed, with researchers even having to sign confidentiality agreements or “official secrets acts” when they are given grants. This is especially bizarre in an era where the mantra of publically funded research being made available to the public has become increasingly accepted.

Amen to that. Authorship information and openness all the way!

So who made this map?

Oh, the original article (though not the one in Quantz or The Conversation) has a link right up front to something called “The World of Science“. The link doesn’t actually take you to the map pictured, it just takes you to a website called worldmapper that’s filled with maps, letting you fend for yourself. That’s okay, my google-fu is strong.

www.worldmapper.org
www.worldmapper.org

I type “science” in the search bar.

Found it! Map #205, created by no-author-name-listed. The caption reads:

Territory size shows the proportion of all scientific papers published in 2001 written by authors living there.

Also, it only covers “physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering, technology, and earth and space sciences.” I dunno about you, but I can name at least 2.3 other types of science, but that’s cool.

In tiny letters near the bottom of the page, there are a bunch of options, including the ability to see the poster or download the data in Excel.

SUCCESS. ish.

Map of Science Poster from worldmapper.org
Map of Science Poster from worldmapper.org

Ahhhhh I found the source! I mean, it took a while, but here it is. You apparently had to click “Open PDF poster, designed for printing.” It takes you to a 2006 poster, which marks that it was made by the SASI Group from Sheffield and Mark Newman, famous and awesome complex systems scientist from Michigan. An all-around well-respected dude.

To recap, that’s a 7/11/2015 tweet, pointing to a 7/11/2015 article on Quartz, pointing to a 7/8/2015 article on The Conversation, pointing to a 4/29/2013 article on the LSE Impact Blog, pointing to a website made Thor-knows-when, pointing to a poster made in 2006 with data from 2001. And only the poster cites the name of the creative team who originally made the map. Blood and bloody ashes.

Intermission

Please take a moment out of your valuable time to watch this video clip from the BBC’s television adaptation of Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I’ll wait.

If you’re hard-of-hearing, read some of the transcript instead.

What I’m saying is, the author of this map was “on display at the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying beware of the leopard.”

The Saga Continues

Okay, at least I now can trust the creation process of the map itself, knowing Mark Newman had a hand in it. What about the data?

Helpfully, worldmapper.org has a link to the data as an Excel Spreadsheet. Let’s download and open it!

Frak. Frak frak frak frak frak.

My eyes.

Excel data for the science cartogram from worldmapper.org
Excel data for the science cartogram from worldmapper.org

Okay Scott. Deep breaths. You can brave the unicornfarts color scheme and find the actual source of the data. Be strong.

“See the technical notes” it says. Okay, I can do that. It reads:

Nearly two thirds of a million papers were published in enumerated science journals in 2001

Enumerated science journals? What does enumerated mean? Whatever, let’s read on.

The source of this data is the World Bank’s 2005 World Development Indicators, in the series on Scientific and technical journal articles (IP.JRN.ARTC.SC).

Okay, sweet, IP.JRN.ARTC.SC at the World Bank. I can Google that!

It brings me to the World Bank’s site on Scientific and technical journal articles. About the data it says:

Scientific and technical journal articles refer to the number of scientific and engineering articles published in the following fields: physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences

Yep, knew that already, but it’s good to see the sources agreeing with each other.

I look for the data source to no avail, but eventually do see a small subtitle “National Science Foundation, Science and Engineering Indicators.”

Alright /me *rolls sleeves*, IRC-style.

Eventually, through the Googles, I find my way to what I assume is the original data source website, although at this point who the hell knows? NSF Science and Engineering Indicators 2006.

Want to know what I find? A 1,092-page report (honestly, see the pdfs, volumes 1 & 2) within which, presumably, I can find exactly what I need to know. In the 1,092-page report.

I start with Chapter 5: Academic Research and Development. Seems promising.

Three-quarters-of-the-way-down-the-page, I see it. It’s shimmering in blue and red and gold to my Excel-addled eyes.

S&E

Could this be it? Could this be the data source I was searching for, the Science Citation Index and the Social Sciences Citation Index? It sounds right! Remember the technical notes which states “Nearly two thirds of a million papers were published in enumerated science journals in 2001?” That fits with the number in the picture above! Let’s click on the link to the data.

There is no link to the data.

There is no reference to the data.

That’s OKAY. WE’RE ALRIGHT. THERE ARE DATA APPENDICES IT MUST BE THERE. EVEN THOUGH THIS IS A REAL WEBSITE WITH HYPERTEXT LINKS AND THEY DIDN’T LINK TO DATA IT’S PROBABLY IN THE APPENDICES RIGHT?

Do you think the data are in the section labeled “Tables” or “Appendix Tables“? Don’t you love life’s little mysteries?

(Hint: I checked. After looking at 14 potential tables in the “Tables” section, I decided it was in the “Appendix Tables” section.)

Success! The World Bank data is from Appendix Table 5-41, “S&E articles, by region and country/economy: 1988–2003”.

Wait a second, friends, this can’t be right. If this is from the Science Citation Index and the Social Science Citation Index, then we can’t really use these metrics as a good proxy for global scientific output, because the criteria for national inclusion in the index is apparently kind of weird and can skew the output results.

Also, and let me be very clear about this,

This dataset actually covers both science and social science. It is, you’ll recall, the Science Citation Index and the Social Sciences Citation Index. [edit: at least as far as I can tell. Maybe they used different data, but if they did, it’s World Bank’s fault for not making it clear. This is the best match I could find.]

In Short

Which brings us back to Do. The article on Quartz made (among other things) two claims: that the geographic inequality of scientific output is troubling, and that the map really ought to include social scientific output.

And I agree with both of these points! And all the nuanced discussion is respectable and well-needed.

But by looking at the data, I just learned that A) the data the map draws from is not really a great representation of global output, and B) social scientific output is actually included.

I leave you with the first gif I’ve ever posted on my blog:

source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html real source: Seinfeld. Seriously, people.
source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html
real source: Seinfeld. Seriously, people.

You know what’s cool? Ryan Cordell and friend’s Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

—————————————————————————————————

(p.s. I don’t blame the people involved, doing the linking. It’s just the tumblr-world of 19th century newspapers we live in.)

[edit: I’m noticing some tweets are getting the wrong idea, so let me clarify: this post isn’t a negative reflection on the research therein, which is needed and done by good people. It’s frustration at the fact that we write in an environment that affords full references and rich hyperlinking, and yet we so often revert to context-free tumblr-like reblogging which separates text from context and data. We’re reverting to the affordances of 18th century letters, 19th century newspapers, 20th century academic articles, etc., and it’s frustrating.]

[edit 2: to further clarify, two recent tweets:

]

The moral role of DH in a data-driven world

This is the transcript from my closing keynote address at the 2014 DH Forum in Lawrence, Kansas. It’s the result of my conflicted feelings on the recent Facebook emotional contagion controversy, and despite my earlier tweets, I conclude the study was important and valuable specifically because it was so controversial.

For the non-Digital-Humanities (DH) crowd, a quick glossary. Distant Reading is our new term for reading lots of books at once using computational assistance; Close Reading is the traditional term for reading one thing extremely minutely, exhaustively.


Networked Society

Distant reading is a powerful thing, an important force in the digital humanities. But so is close reading. Over the next 45 minutes, I’ll argue that distant reading occludes as much as it reveals, resulting in significant ethical breaches in our digital world. Network analysis and the humanities offers us a way out, a way to bridge personal stories with the big picture, and to bring a much-needed ethical eye to the modern world.

Today, by zooming in and out, from the distant to the close, I will outline how networks shape our world and our lives, and what we in this room can do to set a path going forward.

Let’s begin locally.

1. Pale Blue Dot

Pale Blue Dot

You are here. That’s a picture of Kansas, from four billion miles away.

In February 1990, after years of campaigning, Carl Sagan convinced NASA to turn the Voyager 1 spacecraft around to take a self-portrait of our home, the Earth. This is the most distant reading of humanity that has ever been produced.

I’d like to begin my keynote with Carl Sagan’s own words, his own distant reading of humanity. I’ll spare you my attempt at the accent:

Consider again that dot. That’s here. That’s home. That’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every ‘superstar,’ every ‘supreme leader,’ every saint and sinner in the history of our species lived there – on a mote of dust suspended in a sunbeam.

What a lonely picture Carl Sagan paints. We live and die in isolation, alone in a vast cosmic darkness.

I don’t like this picture. From too great a distance, everything looks the same. Every great work of art, every bomb, every life is reduced to a single point. And our collective human experience loses all definition. If we want to know what makes us, us, we must move a little closer.

2. Black Rock City

Black Rock City

We’ve zoomed into Black Rock City, more popularly known as Burning Man, a city of 70,000 people that exists for only a week in a Nevada desert, before disappearing back into the sand until the following year. Here life is apparent; the empty desert is juxtaposed against a network of camps and cars and avenues, forming a circle with some ritualistic structure at its center.

The success of Burning Man is contingent on collaboration and coordination; on the careful allocation of resources like water to keep its inhabitants safe; on the explicit planning of organizers to keep the city from descending into chaos year after year.

And the creation of order from chaos, the apparent reversal of entropy, is an essential feature of life. Organisms and societies function through the careful coordination and balance of their constituent parts. As these parts interact, patterns and behaviors emerge which take on a life of their own.

3. Complex Systems

Thus cells combine to form organs, organs to form animals, and animals to form flocks.

We call these networks of interactions complex systems, and we study complex systems using network analysis. Network analysis as a methodology takes as a given that nothing can be properly understood in total isolation. Carl Sagan’s pale blue dot, though poignant and beautiful, is too lonely and too distant to reveal anything of we creatures who inhabit it.

We are not alone.

4. Connecting the Dots

When looking outward rather than inward, we find we are surrounded on all sides by a hundred billion galaxies each with a hundred billion stars. And for as long as we can remember, when we’ve stared up into the night sky, we’ve connected the dots. We’ve drawn networks in the stars in order to make them feel more like us, more familiar, more comprehensible.

Nothing exists in isolation. We use networks to make sense of our place in the vast complex system that contains protons and trees and countries and galaxies.The beauty of network analysis is its ability to transcend differences in scale, such that there is a place for you and for me, and our pieces interact with other pieces to construct the society we occupy. Networks allow us to see the forest and the trees, to give definition to the microcosms and macrocosms which describe the world around us.

5. Networked World

Networks open up the world. Over the past four hundred years, the reach of the West extended to the globe, overtaking trade routes created first by eastern conquerors. From these explorations, we produced new medicines and technologies. Concomitant with this expansion came unfathomable genocide and a slave trade that spanned many continents and far too many centuries.

Despite the efforts of the Western World, it could only keep the effects of globalization to itself for so long. Roads can be traversed in either direction, and the network created by Western explorers, businesses, slave traders, and militaries eventually undermined or superseded the Western centers of power. In short order, the African slave trade in the Americas led to a rich exchange of knowledge of plants and medicines between Native Americans and Africans.

In Southern and Southeast Asia, trade routes set up by the Dutch East India Company unintentionally helped bolster economies and trade routes within Asia. Captains with the company, seeking extra profits, would illicitly trade goods between Asian cities. This created more tightly-knit internal cultural and economic networks than had existed before, and contributed to a global economy well beyond the reach of the Dutch East India Company.

In the 1960s, the U.S. military began funding what would later become the Internet, a global communication network which could transfer messages at unfathomable speeds. The infrastructure provided by this network would eventually become a tool for control and surveillance by governments around the world, as well as a distribution mechanism for fuel that could topple governments in the Middle East or spread state secrets in the United States. The very pervasiveness which makes the internet particularly effective in government surveillance is also what makes it especially dangerous to governments through sites like WikiLeaks.

In short, science and technology lay the groundwork for our networked world, and these networks can be great instruments of creation, or terrible conduits of destruction.

6. Macro Scale

So here we are, occupying this tiny mote of dust suspended in a sunbeam. In the grand scheme of things, how does any of this really matter? When we see ourselves from so great a distance, it’s as difficult to be enthralled by the Sistine Chapel as it is to be disgusted by the havoc we wreak upon our neighbors.

7. Meso Scale

But networks let us zoom in, they let us keep the global system in mind while examining the parts. Here, once again, we see Kansas, quite a bit closer than before. We see how we are situated in a national and international set of interconnections. These connections come in every form, from physical transportation to electronic communication. From this scale, wars and national borders are visible. Over time, cultural migration patterns and economic exchange become apparent. This scale shows us the networks which surround and are constructed by us.

slide7

And this is the scale which is seen by the NSA and the CIA, by Facebook and Google, by social scientists and internet engineers. Close enough to provide meaningful aggregations, but far enough that individual lives remain private and difficult to discern. This scale teaches us how epidemics spread, how minorities interact, how likely some city might be a target for the next big terrorist attack.

From here, though, it’s impossible to see the hundred hundred towns whose factories have closed down, leaving many unable to feed their families. It’s difficult to see the small but endless inequalities that leave women and minorities systematically underappreciated and exploited.

8. Micro Scale

slide8

We can zoom in further still, Lawrence Kansas at a few hundred feet, and if we watch closely we can spot traffic patterns, couples holding hands, how the seasons affect people’s activities. This scale is better at betraying the features of communities, rather than societies.

But for tech companies, governments, and media distributors, it’s all-too-easy to miss the trees for the forest. When they look at the networks of our lives, they do so in aggregate. Indeed, privacy standards dictate that the individual be suppressed in favor of the community, of the statistical average that can deliver the right sort of advertisement to the right sort of customer, without ever learning the personal details of that customer.

This strange mix of individual personalization and impersonal aggregation drives quite a bit of the modern world. Carefully micro-targeted campaigning is credited with President Barack Obama’s recent presidential victories, driven by a hundred data scientists in an office in Chicago in lieu of thousands of door-to-door canvassers. Three hundred million individually crafted advertisements without ever having to look a voter in the face.

9. Target

And this mix of impersonal and individual is how Target makes its way into the wombs of its shoppers. We saw this play out a few years ago when a furious father went to complain to a Target store manager. Why, he asked the manager, is my high school daughter getting ads for maternity products in the mail? After returning home, the father spoke to his daughter to discover she was, indeed pregnant.  How did this happen? How’d Target know?

 It turns out, Target uses credit cards, phone numbers, and e-mail addresses to give every customer a unique ID. Target discovered a list of about 25 products that, if purchased in a certain sequence by a single customer, is pretty indicative of a customer’s pregnancy. What’s more, the date of the purchased products can pretty accurately predict the date the baby would be delivered. Unscented lotion, magnesium, cotton balls, and washcloths are all on that list.

When Target’s systems learns one of its customers is probably pregnant, it does its best to profit from that pregnancy, sending appropriately timed coupons for diapers and bottles. This backfired, creeping out customers and invading their privacy, as with the angry father who didn’t know his daughter was pregnant. To remedy the situation, rather than ending the personalized advertising, Target began interspersing ads for unrelated products with personalized products in order to trick the customer into thinking the ads were random or general. All the while, a good portion of the coupons in the book were still targeted directly towards those customers.

One Target executive told a New York Times reporter:

We found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

The scheme did work, raising Target’s profits by billions of dollars by subtly matching their customers with coupons they were likely to use. 

10. Presidential Elections

Political campaigns have also enjoyed the successes of microtargeting. President Bush’s 2004 campaign pioneered this technique, targeting socially conservative Democratic voters in key states in order to either convince them not to vote, or to push them over the line to vote Republican. This strategy is credited with increasing the pro-Bush African American vote in Ohio from 9% in 2000 to 16% in 2004, appealing to anti-gay marriage sentiments and other conservative values.

The strategy is also celebrated for President Obama’s 2008 and especially 2012 campaigns, where his staff maintained a connected and thorough database of a large portion of American voters. They knew, for instance, that people who drink Dr. Pepper, watch the Golf Channel, drive a Land Rover, and eat at Cracker Barrel are both very likely to vote, and very unlikely to vote Democratic. These insights lead to the right political ads targeted exactly at those they were most likely to sway.

So what do these examples have to do with networks? These examples utilize, after all, the same sorts of statistical tools that have always been available to us, only with a bit more data and power to target individuals thrown in the mix.

It turns out that networks are the next logical step in the process of micronudging, the mass targeting of individuals based on their personal lives in order to influence them toward some specific action.

In 2010, a Facebook study, piggy-backing on social networks, influenced about 340,000 additional people to vote in the US mid-term elections. A team of social scientists at UCSD experimented on 61 million facebook users in order to test the influence of social networks on political action.

A portion of American Facebook users who logged in on election day were given the ability to press an “I voted” button, which shared the fact that they voted with their friends. Facebook then presented users with pictures of their friends who voted, and it turned out that these messages increased voter turnout by about 0.4%. Further, those who saw that close friends had voted were more likely to go out and vote than those who had seen that distant friends voted. The study was framed as “voting contagion” – how well does the action of voting spread among close friends?

This large increase in voter turnout was prompted by a single message on Facebook spread among a relatively small subset of its users. Imagine that, instead of a research question, the study was driven by a particular political campaign. Or, instead, imagine that Facebook itself had some political agenda – it’s not too absurd a notion to imagine.

11. Blackout

slide11

In fact, on January 18, 2012, a great portion of the social web rallied under a single political agenda. An internet blackout. In protest of two proposed U.S. congressional laws that threatened freedom of speech on the Web, SOPA and PIPA, 115,000 websites voluntarily blacked out their homepages, replacing them with pleas to petition congress to stop the a bills.

Reddit, Wikipedia, Google, Mozilla, Twitter, Flickr, and others asked their users to petition Congress, and it worked. Over 3 million people emailed their congressional representatives directly, another million sent a pre-written message to Congress from the Electronic Frontier Foundation, a Google petition reached 4.5 million signatures, and lawmakers ultimated collected the names of over 14 million people who protested the bills. Unsurprisingly, the bills were never put up to vote.

These techniques are increasingly being leveraged to influence consumers and voters into acting in-line with whatever campaign is at hand. Social networks and the social web, especially, are becoming tools for advertisers and politicians.

12a. Facebook and Social Guessing

In 2010, Tim Tangherlini invited a few dozen computer scientists, social scientists, and humanists to a two-week intensive NEH-funded summer workshop on network analysis for the humanities. Math camp for nerds, we called it. The environment was electric with potential projects and collaborations, and I’d argue it was this workshop that really brought network analysis to the humanities in force.

During the course of the workshop, one speaker sticks out in my memory: a data scientist at Facebook. He reached the podium, like so many did during those two weeks, and described the amazing feats they were able to perform using basic linguistic and network analyses. We can accurately predict your gender and race, he claimed, regardless of whether you’ve told us. We can learn your political leanings, your sexuality, your favorite band.

Much like most talks from computer scientists at the event, the purpose was to show off the power of large-scale network analysis when applied to people, and didn’t focus much on its application. The speaker did note, however, that they used these measurements to effectively advertise to their users; electronics vendors could advertise to wealthy 20-somethings; politicians could target impoverished African Americans in key swing states.

It was a few throw-away lines in the presentation, but the force of the ensuing questions revolved around those specifically. How can you do this without any sort of IRB oversight? What about the ethics of all this? The Facebook scientist’s responses were telling: we’re not doing research, we’re just running a business.

And of course, Facebook isn’t the only business doing this. The Twitter analytics dashboard allows you to see your male-to-female follower ratio, even though users are never asked their gender. Gender is guessed based on features of language and interactions, and they claim around 90% accuracy.

Google, when it targets ads towards you as a user, makes some predictions based on your search activity. Google guessed, without my telling it, that I am a 25-34 year old male who speaks English and is interested in, among other things, Air Travel, Physics, Comics, Outdoors, and Books. Pretty spot-on.

12b. Facebook and Emotional Contagion

And, as we saw with the Facebook voting study, social web services are not merely capable of learning about you; they are capable of influencing your actions. Recently, this ethical question has pushed its way into the public eye in the form of another Facebook study, this one about “emotional contagion.”

A team of researchers and Facebook data scientists collaborated to learn the extent to which emotions spread through a social network. They selectively filtered the messages seen by about 700,000 Facebook users, making sure that some users only saw emotionally positive posts by their friends, and others only saw emotionally negative posts. After some time passed, they showed that users who were presented with positive posts tended to post positive updates, and those presented with negative posts tended to post negative updates.

The study stirred up quite the controversy, and for a number of reasons. I’ll unpack a few of them:

First of all, there were worries about the ethics of consent. How could Facebook do an emotional study of 700,000 users without getting their consent, first? The EULA that everyone clicks through when signing up for Facebook only has one line saying that data may be used for research purposes, and even that line didn’t appear until several months after the study occurred.

A related issue raised was one of IRB approval: how could the editors at PNAS have approved the study given that the study took place under Facebook’s watch, without an external Institutional Review Board? Indeed, the university-affiliated researchers did not need to get approval, because the data were gathered before they ever touched the study. The counter-argument was that, well, Facebook conducts these sorts of studies all the time for the purposes of testing advertisements or interface changes, as does every other company, so what’s the problem?

A third issue discussed was one of repercussions: if the study showed that Facebook could genuinely influence people’s emotions, did anyone in the study physically harm themselves as a result of being shown a primarily negative newsfeed? Should Facebook be allowed to wield this kind of influence? Should they be required to disclose such information to their users?

The controversy spread far and wide, though I believe for the wrong reasons, which I’ll explain shortly. Social commentators decried the lack of consent, arguing that PNAS shouldn’t have published the paper without proper IRB approval. On the other side, social scientists argued the Facebook backlash was antiscience and would cause more harm than good. Both sides made valid points.

One well-known social scientist noted that the Age of Exploration, when scientists finally started exploring the further reaches of the Americas and Africa, was attacked by poets and philosophers and intellectuals as being dangerous and unethical. But, he argued, did not that exploration bring us new wonders? Miracle medicines and great insights about the world and our place in it?

I call bullshit. You’d be hard-pressed to find a period more rife with slavery and genocide and other horrible breaches of human decency than that Age of Exploration. We can’t sacrifice human decency in the name of progress. On the flip-side, though, we can’t sacrifice progress for the tiniest fears of misconduct. We must proceed with due diligence to ethics without being crippled by inefficacy.

But this is all a red herring. The issue here isn’t whether and to what extent these activities are ethical science, but to what extent they are ethical period, and if they aren’t, what we should do about it. We can’t have one set of ethical standards for researchers, and another for businesses, but that’s what many of the arguments in recent months have boiled down to. Essentially, it was argued, Facebook does this all the time. It’s something called A/B testing: they make changes for some users and not others, and depending on how the users react, they change the site accordingly. It’s standard practice in web development.

13. An FDA/FTC for Data?

It is surprising, then, that the crux of the anger revolved around the published research. Not that Facebook shouldn’t do A/B testing, but that researchers shouldn’t be allowed to publish on it. This seems to be the exact opposite of what should be happening: if indeed every major web company practices these methods already, then scholarly research on how such practices can sway emotions or voting practices are exactly what we need. We must bring these practices to light, in ways the public can understand, and decide as a society whether they cross ethical boundaries. A similar discussion occurred during the early decades of the 20th century, when the FDA and FTC were formed, in part, to prevent false advertising of snake oils and foods and other products.

We are at the cusp of a new era. The mix of big data, social networks, media companies, content creators, government surveillance, corporate advertising, and ubiquitous computing is a perfect storm for intense influence both subtle and far-reaching. Algorithmic nudging has the power to sell products, win elections, topple governments, and oppress a people, depending on how it is wielded and by whom. We have seen this work from the bottom-up, in Occupy Wallstreet, the Revolutions in the Middle East, and the ALS Ice-Bucket Challenge, and from the top-down in recent presidential campaigns, Facebook studies, and coordinated efforts to preserve net neutrality. And these have been works of non-experts: people new to this technology, scrambling in the dark to develop the methods as they are deployed. As we begin to learn more about network-based control and influence, these examples will multiply in number and audacity.

14. Surveillance

And this story leaves out one of the most major players of all: government. When Edward Snowden leaked the details of classified NSA surveillance program, the world was shocked at the government’s interest in and capacity for omniscience. Data scientists, on the other hand, were mostly surprised that people didn’t realize this was happening. If the technology is there, you can bet it will be used.

And so here, in the NSA’s $1.5 billion dollar data center in Utah, are the private phone calls, parking receipts, emails, and Google searches of millions of American citizens. It stores a few exabytes of our data, over a billion gigabytes and roughly equivalent to a hundred thousand times the size of the library of congress. More than enough space, really.

The humanities have played some role in this complex machine. During the Cold War, the U.S. government covertly supported artists and authors to create cultural works which would spread American influence abroad and improve American sentiment at home.

Today the landscape looks a bit different. For the last few years DARPA, the research branch of the U.S. Department of Defense, has been funding research and hosting conferences in what they call “Narrative Networks.” Computer scientists, statisticians, linguists, folklorists, and literary scholars have come together to discuss how ideas spread and, possibly, how to inject certain sentiments within specific communities. It’s a bit like the science of memes, or of propaganda.

Beyond this initiative, DARPA funds have gone toward several humanities-supported projects to develop actionable plans for the U.S. military. One project, for example, creates as-complete-as-possible simulations of cultures overseas, which can model how groups might react to the dropping of bombs or the spread of propaganda. These models can be used to aid in the decision-making processes of officers making life-and-death decisions on behalf of troops, enemies, and foreign citizens. Unsurprisingly, these initiatives, as well as NSA surveillance at home, all rely heavily on network analysis.

In fact, when the news broke on the captures of Osama bin Laden and Saddam Hussein, and how they were discovered via network analysis, some of my family called me after reading the newspapers claiming “we finally understand what you do!” This wasn’t the reaction I was hoping for.

In short, the world is changing incredibly rapidly, in large part driven by the availability of data, network science and statistics, and the ever-increasing role of technology in our lives. Are these corporate, political, and grassroots efforts overstepping their bounds? We honestly don’t know. We are only beginning to have sustained, public discussions about the new role of technology in society, and the public rarely has enough access to information to make informed decisions. Meanwhile, media and web companies may be forgiven for overstepping ethical boundaries, as our culture hasn’t quite gotten around to drawing those boundaries yet.

15. The Humanities’ Place

This is where the humanities come in – not because we have some monopoly on ethics (goodness knows the way we treat our adjuncts is proof we do not) – but because we are uniquely suited to the small scale. To close reading. While what often sets the digital humanities apart from its analog counterpart is the distant reading, the macroanalysis, what sets us all apart is our unwillingness to stray too far from the source. We intersperse the distant with the close, attempting to reintroduce the individual into the aggregate.

Network analysis, not coincidentally, is particularly suited to this endeavor. While recent efforts in sociophysics have stressed the importance of the grand scale, let us not forget that network theory was built on the tiniest of pieces in psychology and sociology, used as a tool to explore individuals and their personal relationships. In the intervening years, all manner of methods have been created to bridge macro and micro, from Granovetter’s theory of weak ties to Milgram’s of Small Worlds, and the way in which people navigate the networks they find themselves in. Networks work at every scale, situating the macro against the meso against the micro.

But we find ourselves in a world that does not adequately utilize this feature of networks, and is increasingly making decisions based on convenience and money and politics and power without taking the human factor into consideration. And it’s not particularly surprising: it’s easy, in the world of exabytes of data, to lose the trees for the forest.

This is not a humanities problem. It is not a network scientist problem. It is not a question of the ethics of research, but of the ethics of everyday life. Everyone is a network scientist. From Twitter users to newscasters, the boundary between people who consume and people who are aware of and influence the global social network is blurring, and we need to deal with that. We must collaborate with industries, governments, and publics to become ethical stewards of this networked world we find ourselves in.

16. Big and Small

Your challenge, as researchers on the forefront of network analysis and the humanities, is to tie the very distant to the very close. To do the research and outreach that is needed to make companies, governments, and the public aware of how perturbations of the great mobile that is our society affect each individual piece.

We have a number of routes available to us, in this respect. The first is in basic research: the sort that got those Facebook study authors in such hot water. We need to learn and communicate the ways in which pervasive surveillance and algorithmic influence can affect people’s lives and steer societies.

A second path towards influencing an international discussion is in the development of new methods that highlight the place of the individual in the larger network. We seem to have a critical mass of humanists collaborating with or becoming computer scientists, and this presents a perfect opportunity to create algorithms which highlight a node’s uniqueness, rather than its similarity.

Another step to take is one of public engagement that extends beyond the academy, and takes place online, in newspapers or essays, in interviews, in the creation of tools or museum exhibits. The MIT Media Lab, for example, created a tool after the Snowden leaks that allows users to download their email metadata to reveal the networks they form. The tool was a fantastic example of a way to show the public exactly what “simply metadata” can reveal about a person, and its viral spread was a testament to its effectiveness. Mike Widner of Stanford called for exactly this sort of engagement from digital humanists a few years ago, and it is remarkable how little that call has been heeded.

Pedagogy is a fourth option. While people cry that the humanities are dying, every student in the country will have taken many humanities-oriented courses by the time they graduate. These courses, ostensibly, teach them about what it means to be human in our complex world. Alongside the history, the literature, the art, let’s teach what it means to be part of a global network, constantly contributing to and being affected by its shadow.

With luck, reconnecting the big with the small will hasten a national discussion of the ethical norms of big data and network analysis. This could result in new government regulating agencies, ethical standards for media companies, or changes in ways people interact with and behave on the social web.

17. Going Forward

When you zoom out far enough, everything looks the same. Occupy Wall Street; Ferguson Riots; the ALS Ice Bucket Challenge; the Iranian Revolution. They’re all just grassroots contagion effects across a social network. Rhetorically, presenting everything as a massive network is the same as photographing the earth from four billion miles: beautiful, sobering, and homogenizing. I challenge you to compare network visualizations of Ferguson Tweets with the ALS Ice Bucket Challenge, and see if you can make out any differences. I couldn’t. We need to zoom in to make meaning.

The challenge of network analysis in the humanities is to bring our close reading perspectives to the distant view, so media companies and governments don’t see everyone as just some statistic, some statistical blip floating on this pale blue dot.

I will end as I began, with a quote from Carl Sagan, reflecting on a time gone by but every bit as relevant for the moment we face today:

I know that science and technology are not just cornucopias pouring good deeds out into the world. Scientists not only conceived nuclear weapons; they also took political leaders by the lapels, arguing that their nation — whichever it happened to be — had to have one first. … There’s a reason people are nervous about science and technology. And so the image of the mad scientist haunts our world—from Dr. Faust to Dr. Frankenstein to Dr. Strangelove to the white-coated loonies of Saturday morning children’s television. (All this doesn’t inspire budding scientists.) But there’s no way back. We can’t just conclude that science puts too much power into the hands of morally feeble technologists or corrupt, power-crazed politicians and decide to get rid of it. Advances in medicine and agriculture have saved more lives than have been lost in all the wars in history. Advances in transportation, communication, and entertainment have transformed the world. The sword of science is double-edged. Rather, its awesome power forces on all of us, including politicians, a new responsibility — more attention to the long-term consequences of technology, a global and transgenerational perspective, an incentive to avoid easy appeals to nationalism and chauvinism. Mistakes are becoming too expensive.

Let us take Carl Sagan’s advice to heart. Amidst cries from commentators on the irrelevance of the humanities, it seems there is a large void which we are both well-suited and morally bound to fill. This is the path forward.

Thank you.


Thanks to Nickoal Eichmann and Elijah Meeks for editing & inspiration.

Stanford Musings

It’s official: I am Stanford’s new DH data scientist from May to August. What does that mean? I haven’t the foggiest idea – I think figuring that out is part of my job description. Over the next few months, I’ll be assisting a small platoon of Stanfordites with their networks, their visualizations, their data, and who knows, maybe their love lives. I’m reporting to the inimitable Glen Worthey and the indomitable Elijah Meeks, who will keep me on the straight and narrow. I’ll also be blogging, teaching workshops, writing papers, and crunching numbers, all under the Stanford banner.

This announcement is on the heels of my recent trip to Stanford, and I have to say, I was incredibly impressed by the operation they had going there. The library has at least three branches under which DH projects occur, and of particular interest are the Academic Technology Specialists like Mike Widner. A half a dozen of them are embedded in different schools around campus, and they act as technology liaisons and researchers within those schools, supporting faculty projects, developing their own research, and just generally fostering a fantastic digital humanities presence on the Stanford campus.

Stanford! Did you know it’s actually “Leland Stanford Junior University”? Weird, right?

Then there’s Elijah Meeks and Karl Grossner. Do you know those TV shows where contestants vie for a fancy house from some team of super creative builders? They basically do that, except instead of offering cool new digs, they offer their impressive technical services for a few months. There’s also the Lit Lab, CESTA, the DH Focal Group, and probably a dozen other projects which do DH on campus in some way or another.

As far as I can tell, I’ll be just one more chaotic agent in this complex DH environment. Many of the big projects going on at Stanford rely in some way on networks, and I’m going to try to bring them all together and set agendas for how they can best utilize and analyze the networks at hand. I’ll also design some tools that’ll make it easier for future network-y projects to get off the ground. There’s also a bunch of Famous Network Scientists who operate out of Stanford, and I plan on nurturing some collaborations between them, the DH community, and some humanities-curious tenants of Silicon Valley.

It will be interesting to see how this position unfolds. As far as I’m aware, the “resident data scientist” model for DH is an untried one at any university, and I’m lucky and honored that Stanford has decided to take a chance on such a new position with me at the helm. If this proves successful, it will provide even more proof that the role of libraries in fostering DH on campus can be a powerful one. Of course there’s also the chance I could fail spectacularly, but in true DH tradition, I believe such a public failure would also be a worthy outcome. If the process works, great; if not, we’ll know what to fix for the next try.

Barriers to Scholarship & Iterative Writing

This post is mostly just thinking out loud, musing about two related barriers to scholarship: a stigma related to self-plagiarism, and various copyright concerns. It includes a potential way to get past them.

Self-Plagiarism

When Jonah Lehrer’s plagiarism scandal first broke, it sounded a bit silly. Lehrer, it turned out, had taken some sentences he’d used in earlier articles, and reused them in a few New Yorker blog posts. Without citing himself. Oh no, I thought. Surely, this represents the height of modern journalistic moral depravity.

Of course, later it was revealed that he’d bent facts, and plagiarized from others without reference, and these were all legitimately upsetting. And plagiarizing himself without reference was mildly annoying, though certainly not something that should have attracted national media attention. But it raises an interesting question: why is self-plagiarism wrong? And it’s as wrong in academia as it is in journalism.

Lehrer chart from Slate [via].
Lehrer chart from Slate. [via]
I can’t speak for journalists (though Alberto Cairo can, and he lists some of the good reasons why non-referenced self-plagiarism is bad and links to not one, but two articles about it, and), but for academia, the reasons behind the wrongness seem pretty clear.

  1. It’s wrong to directly lift from any source without adequate citation. This only applies to non-cited self-plagiarism, obviously.
  2. It’s wrong to double-dip. The currency of the academy is publications / CV lines, and if you reuse work to fill your CV, you’re getting an unfair advantage.
  3. Confusion. Which version should people reference if you have so many versions of a similar work?
  4. Copyright. You just can’t reuse stuff, because your previous publishers own the copyright on your earlier work.

That about covers it. Let’s pretend academics always cite their own works (because, hell, it gives them more citations), so we can do away with #1. Regular readers will know my position on publisher-owned copyright, so I just won’t get into #4 here to save you my preaching. The others are a bit more difficult to write off, but before I go on to try to do that, I’d like to talk a bit about my own experience of self-plagiarism as a barrier to scholarship.

I was recently invited to speak at the Universal Decimal Classification seminar, where I presented on the history of trees as a visual metaphor for knowledge classification. It’s not exactly my research area, but it was such a fun subject, I’ve decided to write an article about it. The problem is, the proceedings of the UDC seminar were published, and about 50% of what I wanted to write is already sitting in a published proceedings that, let’s face it, not many people will ever read. And if I ever want to add to it, I have to change the already-published material significantly if I want to send it out again.

Since I presented, my thesis has changed slightly, I’ve added a good chunk of more material, and I fleshed out the theoretical underpinnings. I now have a pretty good article that’s ready to be sent out for peer review, but if I want to do that, I can’t just have a reference saying “half of this came from a published proceeding.” Well, I could, but apparently there’s a slight taboo against this. I was told to “be careful,” that I’d have to “rephrase” and “reword.” And, of course, I’d have to cite my earlier publication.

I imagine most of this comes from the fear of scholars double-dipping, or padding their CVs. Which is stupid. Good scholarship should come first, and our methods of scholarly attribution should mold itself to it. Right now, scholarship is enslaved to the process of attribution and publication. It’s why we willingly donate our time and research to publishing articles, and then have our universities buy back our freely-given scholarship in expensive subscription packages, when we could just have the universities pay for the research upfront and then release it for free.

Copyright

The question of copyright is pretty clear: how much will the publisher charge if I want my to reuse a significant portion of my work somewhere else? The publisher to which I refer, Ergon Verlag, I’ve heard is pretty lenient about such things, but what if I were reprinting from a different publish?

There’s an additional, more external, concern about my materials. It’s a history of illustrations, and the manuscript itself contains 48 illustrations in all. If I want to use them in my article, for demonstrative purposes, I not only need to cite the original sources (of course), I need to get permission to use the illustrations from the publishers who scanned them – and this can be costly and time consuming. I priced a few of them so-far, and they range from free to hundreds of dollars.

A Potential Solution – Iterative Writing

To recap, there are two things currently preventing me from sending out a decent piece of scholarship for peer-review:

  1. A taboo against self-plagiarism, which requires quite a bit of time for rewriting, permission from the original publisher to reuse material, and/or the dissolution of such a taboo.
  2. The cost and time commitment of tracking down copyright holders to get permission to reproduce illustrations.

I believe the first issue is largely a historical artifact of print-based media. Scholars have this sense of citing the source because, for hundreds of years, nearly every print of a single text was largely identical. Sure, there were occasionally a handful of editions, some small textual changes, some page number changes, but citing a text could easily be done, and so we developed a huge infrastructure around citations and publications that exists to this day. It was costly and difficult to change a printed text, and so it wasn’t done often, and now our scholarly practices are based around the idea scholarly material has to be permanent and unchanging, finished, if they are to enter into the canon and become citeable sources.

In the age of Wikipedia, this is a weird idea. Texts grow organically, they change, they revert. Blog posts get updated. A scholarly article, though, is relatively constant, even those in online-only publications. One of the major exceptions are ArXiv-like pre-print repositories, which allow an article to go through several versions before the final one goes off to print. But generally, once the final version goes to print, no further changes are made.

The reasons behind this seem logical: it’s the way we’ve always done it, so why change a good thing? It’s hard to cite something that’s constantly changing; how do we know the version we cited will be preserved?

In an age of cheap storage and easily tracked changes, this really shouldn’t be a concern. Wikipedia does this very well: you can easily cite the version of an article from a specific date and, if you want, easily see how the article changed between then and any other date.

Changes between versions of the Wikipedia entry on History.
Changes between versions of the Wikipedia entry on History.

This would be more difficult to implement in academia because article hosting isn’t centralized. It’s difficult to be certain that the URL hosting a journal article now will persist for 50 years, both because of ownership and design changes, and it’s difficult to trust that whomever owns the article or the site won’t change the content and not preserve every single version, or a detailed description of changes they’ve made.

There’s an easy solution: don’t just reference everything you cite, embed everything you cite. If you cite a picture, include the picture. If you cite a book, include the book. If you cite an article, include the article. Storage is cheap: if your book cites a thousand sources, and includes a copy of every single one, it’ll be at most a gigabyte. Probably, it would be quite a deal smaller. That way, if the material changes down the line, everyone reading your research will till be able to refer to the original material. Further, because you include a full reference, people can go and look the material up to see if it has changed or updated in the time since you cited it.

Of course, this idea can’t work – copyright wouldn’t let it. But again, this is a situation where the industry of academia is getting in the way of potential improvements to the way scholarship can work.

The important thing, though, is that self-plagiarization would become a somewhat irrelevant concept. Want to write more about what you wrote before? Just iterate your article. Add some new references, a paragraph here or there, change the thesis slightly. Make sure to keep a log of all your changes.

I don’t know if this is a good solution, but it’s one of many improvements to scholarship – or at least, a removal of barriers to publishing interesting things in a timely and inexpensive fashion – which is currently impossible because of copyright concerns and institutional barriers to change. Cameron Neylon, from PLOS, recently discussed how copyright put up some barriers to his own interesting ideas. Academia is not a nimble beast, and because of it, we are stuck with a lot of scholarly practices which are, in part, due to the constraints of old media.

In short: academic writing is tough. There are ways it could be easier, that would allow good scholarship to flow more freely, but we are constrained by path dependency from choices we made hundreds of years ago. It’s time to be a bit more flexible and be more willing to try out new ideas. This isn’t anywhere near a novel concept on my part, but it’s worth repeating.

The last big barrier to self-plagiarism, double dipping to pad one’s CV, still seems tricky to get past. I’m not thrilled with the way we currently assess scholarship, and “CV size” is just one of the things I don’t like about it, but I don’t have any particularly clever fixes on that end.