The Index of Digital Humanities Conferences

Check out The Index of Digital Humanities Conferences, the largest extant collection of DH conference metadata.

The Data

The Index focuses on the flagship ADHO conference, but encompasses other events as well. As of this release, it indexes 489 conferences from 39 countries dating back to 1960. We entered individual work metadata for 59 of those conferences, including titles, keywords, authorships, institutional affiliations, and so on. In all, there are 7,082 conference presentations by 8,392 authors, hailing from 1,830 institutions and 86 countries.

Image
The Index of Digital Humanities Conferences homepage.

The where/when conference data was crowdsourced through a google sheet and twitter. Conference presentation metadata comes from a variety of public sources. Recent ADHO conferences generally come from XML files on ADHO’s GitHub page, and earlier ADHO / ACH / ALLC conference metadata was entered by hand from old websites, PDFs, listservs, and printed conference programs contributed by Joe Rudman (ACH Treasurer 1985-1989). Non-ADHO conference data was entered by hand from public programs (usually PDFs).

To ease browsing and analysis, we cleaned and merged as much as we could. That means connecting “Cambridge University” with “University of Cambridge” and the occasional “Camridge University”, as well as “Scott B. Weingart” with “Scott Weingart”.

When possible, we also indexed the full texts of conference abstracts / program entries. Currently those are only being used to power the search engine and generally not visible on a conference work page. We hope to be able to display full text in the coming months, as we get permissions to republish them from the original rights holders.

Attribute assignment model diagram
The attribute assignment model diagram for the Index of Digital Humanities Conferences.

The Good

Julianne Nyhan and Andrew Flinn recently wrote

A crucial obstacle to the writing of histories of the field is that much of DH’s archival evidence has either not been preserved or is held by individuals (and so remains ‘hidden’ unless one can discover its existence and secure approval and the means to access it).

Nyhan & Flynn, 2016

This holds as true for DH conferences as for the sources Nyhan & Flinn were working with. There’s no single public archive for physical conference programs, most old conference websites no longer exist (often even absent from the Internet Archive), and even ADHO’s digital records are spread out across many sources or locked in byzantine and private ConfTool data dumps.

Ours isn’t the first attempt to put a DH conference database together, but it is the most extensive. It represents eight years of work by many collaborators and contributors, and builds off those earlier attempts. And it’s all open for anyone to use: anyone can download the data to browse or analyze.

In making this corner of our community’s history more accessible, we hope this helps fill the archival gap pointed out by Nyhan and Flinn.

The Bad

This was always a passion project: never funded, nor supported by any scholarly organization. We work on it in empty moments, usually months apart. We don’t have resources for additional quality control, nor time to implement all the features we want on the website.

Errors are rampant. We made mistakes entering data, we made mistakes cleaning and merging data, and we inherited mistakes made by conference program editors or even, occasionally, authors themselves. Some common issues include merging two entries that should be separate (John Smith of Florida vs. John Smith of Maryland, not the same person), or not merging two entries who should be connected (J.B. Smith vs. John Smith, the same person).

If you spot an error, please reach out to me, and I’ll do my best to fix it in a timely fashion.

For a dataset that spent its first six years as a hilariously complicated excel spreadsheet shared on Dropbox, the web interface is amazing. It solves so many problems. But there’s still a lot missing, because we simply don’t have time to build it. Faceting works strangely, we’re missing a lot of useful search interfaces, and we just don’t have the infrastructure needed to turn this into a crowdsourcing project.

The Ugly

Errors in data cleaning and merging can be problematic, but generally not show-stopping. Unfortunately data in the Index does have a few show-stopping issues, but rather than keep everything private until we can fix them all, we’re releasing this publicly in the hope the community can help.

When merging people, unless we’re absolutely certain two people with different names are the same, we won’t connect them in the database. That means a Jane Smith who changes her name to Jane Doe after getting married will not be merged. The issue disproportionately affects women (who are more likely to change names during their careers), and as Jessica Otis and I point out, will put those women at a disadvantage in any eventual analysis.

The merging issue also affects people who for whatever reason (often gender/identity transition) decide to change their names. To complicate matters further, many who change their names do not want their birth name / deadname shared on the web, but since we don’t know that, their old names remain visible within our database.

Our data entry and cleaning gets worse the further we (the data collectors) get from our cultural comfort zones. This became a problem in the first and last name fields; figuring out what went where proved particularly difficult for us with respect to Hispanic and Indian authors. We reached out to friends for help, and used the separations entered by authors themselves when XML was available, but errors still abound.

Additionally, our department/institution database ontology (a strict hierarchy) fails for many Italian and French research units, which are often shared across many institutions. Italian and French affiliations are frankly in an abysmal state, and we’d appreciate input to help us untangle that mess.

Sir Not Appearing In This Film

The most apparent absence in the database and website is the full text of presentation abstracts (which are sometimes the entire conference paper).

We have full text for 75% of all works in the database (5,301 of them, to be exact), but because we either don’t know their copyright status or don’t have explicit permission to publish them, they are currently invisible. Searching in the works page will search through the full text of the abstracts, and present works with relevant hits, but visitors will have to find other sources to view the abstracts themselves. We’re working to secure permissions to share what we have, but it may take some time.

When I first started this project back in 2012, it was based on submissions to the DH conference, which I collected by scraping (without permission) the semi-private website for reviewers to select which submissions they were most qualified to review. I continued this for several years, comparing submitted abstracts to what made it to the final program, adding collaborators along the way. For privacy reasons, that data isn’t included here.

My collaborators and I also began collecting author demographic data, using it to point out biases, absences, and related issues of equity and inclusion in the DH conference community. Though reductive, the data served its purpose. Following the work of Miriam Posner and conversations with Shack Hackney, however, we believe the potential for harm in making demographic data part of this public database would outweigh any potential benefits.

Such demographic data would also likely put as at odds with GDPR. It’s worth pointing out that ADHO is keeping its distance from this project because it is worried about potential GDPR ramifications even without demographic data, which is understandable. From several sources (e.g., 1, 2, 3), it seems this is an “archive in the public interest” and thus doesn’t violate GDPR, however I’m not a lawyer and I suppose anything can happen.

In the spirit of GDPR and the general “right to be forgotten”, we will happily take down any personal data for any reason. Just reach out to me if you’d like your materials to be removed.

Roll Credits

As alluded to earlier, this was a group effort over eight years. Nickoal Eichmann-Kalwara has been my partner in crime (sometimes literally, but only small crimes) for most of it. We’ve put in countless hours guiding the project, entering and cleaning data, and strong-arming friends into becoming collaborators. Jeana Jorgensen contributed her inimitable expertise and guidance for many years. Matt Lincoln single-handedly turned our janky spreadsheets into an actual database and website.

There are 55 additional names on our credits page, and every one of them is worth mention.

Appendix: Conferences with Presentation Metadata

Although there are 489 conferences in the index, only 59 of them currently include presentation metadata. They are presented below, organized by conference series when applicable. Some series overlap (e.g., ACH/ALLC overlaps one year with ADHO), so the numbers will add up to higher than 59.

Conferences that aren’t part of a series

  • 1964 Conference on the Use of Computers in Humanistic Research (1)
  • 1964 Literary Data Processing Conference (1)
  • 1968 Computers and their potential applications in museums (1)
  • 1969 IBM Symposium on Introducing the Computer into the Humanities (1)
  • 1979 International Conference on Literary and Linguistic Computing (1)
  • 2019 The Arts, Knowledge, and Critique in the Digital Age in India: Addressing Challenges in the Digital Humanities (1)

Conferences that are part of a series

  • ADHO (the “DH” conference) (15)
  • ACH/ICCH (25)
  • ALLC/EADH (24)
  • Caribbean Digital (1)
  • Digital Humanities Alliance of India (DHAI) (1)
  • Digital Humanities Forum (1)
  • Encuentro de Humanistas Digitales (EHD) (1)
  • EADH (1)
  • ALLC IM/AGM (4)
  • Japanese Association for Digital Humanities Annual Conference (JADH) (1)
  • Joint ACH/ALLC (18)
  • KeystoneDH (1)

Seeking New Physics

Yesterday, OpenAI announced the results of a new experiment. 1 AIs evolved to use tools to play hide-and-seek. More interestingly, they learned to exploit errors from the in-game physics engine to “cheat”, breaking physics to find their opponents.

Algorithms learning to exploit glitches to succeed at games are not uncommon. OpenAI also recently showed a video of an algorithm using a glitch in Sonic the Hedgehog save Sonic from certain death. Victoria Krakovna has collected 50 or so similar examples, going back to 1998, explained in her blog post.

But what happens when algorithms learn to exploit actual physics? A quarter of a century ago, Adrian Thompson provided evidence of just that.

In An evolved circuit, intrinsic in silicon, entwined with physics (ICES 1996), Thompson used a genetic algorithm, quite similar to the ones used to find glitches in games, to teach a bunch of computer chips to discern the difference between sounds at two different pitches: 1 kHz (low-pitch) and 10 kHz (high-pitch).

Genetic algorithms work by evolution. You give them a task, and they keep trying different approaches that either work or don’t work. The ones that work well replicate themselves with slight variations, and this goes on for many generations until the algorithm learns an efficient solution.

Genetic algorithms are easier to understand in practice than in theory, so to understand a bit better, watch the below video by Johan Eliasson:

Thompson’s genetic algorithm worked the same way, but on a physical substrate. He trained a bunch of circuit boards over 5,000 generations to essentially reconfigure themselves into pitch-discerning machines. He got a bunch that worked really well, and really quickly. But when he tried to figure out how the efficient ones worked, he came back flummoxed.

Evolution inevitably leads to a lot of redundancies, mistakes, and other stupid design choices. It’s why we have vestigial organs like appendices, why flightless birds still have wings, and why we seem to have wide swaths of “junk” DNA. It’s not that these things are useless, per se, but in the randomness of natural selection, some things tend to stray.

vestigial structure

So Thompson tried to excise the vestigial bits of circuitry that were no longer necessary, but happened to stick around after 5,000 algorithmic generations. He found the circuits that were disconnected from the circuitry that was actually solving the problem, and removed them.

After he removed the vestigial, disconnected circuitry, the most efficient algorithm slowed down considerably. Let me repeat that: the algorithms slowed down after Thompson removed vestigial parts of the circuit that had no actual effect on the algorithm. What was going on?

Thompson tried an experiment. He moved the efficient pitch-detecting algorithm to another identical circuit board. Same algorithm, identical circuit board.

The efficiency dropped by 7%.

What was happening, it turns out, is that the genetic algorithms actually learned to exploit the magnetic fields created when electrons flow through circuitry. The vestigial circuitry apparently boosted the performance of the algorithm just by existing next to the functional circuitry and emitting the appropriate physical signals.

When Thompson moved the algorithm to an identical board, the efficiency dropped because the boards weren’t actually identical, even though they were manufactured to be the same. Subtle physical differences in the circuitry actually contributed to the performance of the algorithm. Indeed, the algorithm evolved to exploit those differences.

Some scientists actually considered this a bit of a bummer. Oh no, they said, physics ruins our ability to get consistent results. But a bunch of others got quite excited.

For a while, I imagined the most exciting implications were for cognitive neuroscience.

Screenshot of C. elegans simulation representing its general view. 
From ” Towards a virtual C. Elegans: A framework for simulation and visualization of the neuromuscular system in a 3D physical environment

One theory of how thinking works is that the brain is a vast network of neurons sending signals to each other, a bit like circuits. A branch of science called connectomics is founded on abstract models of these networks.

Thompson’s research is fascinating because, if the physical embodiment of electronic circuits winds up making such a big difference, imagine the importance of the physical embodiment of neurons in a brain. Evolution spent a long time building brains, and there’s a good chance their materiality, and the adjacency of one neuron to the next, is functionally meaningful. Indeed, this has been an active area of research for some time, alongside theories of embodied cognition.

We learn from Thompson’s work not to treat brains like abstract circuits, because we can’t even treat circuits like abstract circuits.

But now, I think there’s potentially an even more interesting implication of Thompson’s results, drawing a line from it to AIs learning to exploit physics for hide-and-seek. These experiments may pave the way for a new era of physics.

A New Physics

In the history of physics, practice occasionally outpaces theory. We build experiments expecting one result, but we see another instead. Physicists spend a while wondering what the hell is going on, and then sometimes invent new kinds of physics to deal with the anomalies. We have a theory of how the world works, and then we see things that don’t align with that theory, so we replace it. 2

For example, in the 1870s, scientists began experimenting with what would become known as a Crookes tube, which emits a mysterious light under certain conditions. Trying to figure out why led to the discovery of X-rays and other phenomena.

Crooks tube, via D-Kuru, https://en.wikipedia.org/wiki/Crookes_tube#/media/File:Crookes_tube_two_views.jpg

Genetic algorithms and their siblings are becoming terrifyingly powerful. And we’ve already seen they often reach their goals by exploiting peculiarities in physics and simulated physical environments. What happens when these algorithms are given more generous leave to control their physical substrate at very basic levels?

Let’s say we ask a set of embodied algorithms to race, to get from Point A to Point B in their little robot skeletons. Let’s also say we don’t just allow them control over levers and wheels and things, but the ability to reconfigure their own bodies and print new parts of any sort, down to the nano scale. 3

I suspect, after enough generations, these racing machines will start acting quite strangely. Maybe they’ll exploit quantum tunneling, superposition, or other weird subatomic principles. Maybe they’ll latch on to macroscopic complex particle interaction effects that scientists haven’t yet noticed. I have no idea.

Nobody has any idea. We’re poised to enter a brave new world of embodied algorithms ruthlessly, indiscriminately optimizing their way into strange physics.

In short, I wonder if physical AI bots will learn to exploit what we’d perceive to be glitches in physics. If that happens, and we start trying to figure out what the heck they’re doing to get from A to B so quickly, we may have to invent entirely new areas of physics to explain them.

Although this would be an interesting future, I’m not sure it would be a good one. It may, like the gray goo hypothesis people worried about with nano-engineering, have the potential of producing apocalyptic results. What if a thoughtless algorithm, experimenting with propulsion to optimize its speed, winds up accidentally setting off an uncontrollable nuclear reaction?

I don’t suspect that will happen, but I do seriously worry what happens once the current class of learning algorithms everts into the physical world. Confined to the digital realm, we already see them wreaking havoc in unexpected ways. Recall, for example, the Amazon seller algorithms that artificially boost book prices to the point of absurdity, or the high-frequency stock trading algorithms that caused a financial panic. To say nothing of ML models that are currently in use that disadvantage particular races, genders, and other classes.

https://en.wikipedia.org/wiki/2010_Flash_Crash#/media/File:Flashcrash-2010.png

If allowed to proceed, and given the appropriate technological capacities, embodied algorithms would undoubtedly cause unintentional physical harm in their “value-free” hunt for optimization. They will cause harm in spite of any safety systems we put in place, for the same reason they may stumble on unexplored domains of physics: genetic algorithms are so very good at exploiting glitches or loopholes in systems.

I don’t know what the future holds. It’s entirely possible this is all off-base, and since I’m neither a physicist nor an algorithmic roboticist, I wouldn’t recommend putting any money behind this prediction.

All I know is that, in 1894, Albert Michelson famously said “it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice.” And we all saw how that turned out.

With the recent results of the LHC and LIGO pretty much confirming what physicists already expected, at great expense, I’m betting the new frontier will come out of left field. I wouldn’t be so surprised if AI/ML opened the next set of floodgates.

Notes:

  1. You remember OpenAI. They’re the ones who recently trained a really good language model called GPT2 and then didn’t release it on account of ethical concerns.
  2. The story is usually much more complicated than this, but that’s the best I can do in a paragraph.
  3. As far as I know this is currently implausible, but I bet it will feel more plausible in the not-too-distant future.

Releasing the Digital Humanities Literacy Guidebook

tl;dr There’s a new website called the Digital Humanities Literacy Guidebook, made for people just beginning their journeys into digital humanities, but hopefully still helpful for folks early in their career. It’s a crowdsourced resource that Carnegie Mellon University and the A.W. Mellon Foundation are offering to the world. We hope you will contribute!


Announcing the DHLG

Releasing a new website into the world is always a bit scary. Will people like it? Will they use it? Will they contribute?

The DHLG Logo

I hope the Digital Humanities Literacy Guidebook (DHLG) will help people. We made it for people who are still in their DH-curious phase, but I suspect it’ll also be helpful for folks in the first several years of their DH career.

The DHLG is an incomplete map of a territory that’s still actively growing. It doesn’t offer a definition of digital humanities; instead, it introduces by example. The site offers dozens of short videos describing DH projects, as well as lists of resources, a topical glossary, job advice, and other helpful entry-points.

We first designed the DHLG to serve our local community in Pittsburgh, for our new graduate students who aren’t yet sure if they’ll pursue DH. But a lot of this is nationally and internationally relevant, so although we built this for local needs, we’re presenting this as a gift to, well, everyone.

Crowdsourcing

In that spirit, we built the DHLG using jekyll, on github, for the community to contribute to as they like, for as long as I’m around to curate contributions. We welcome new videos and topical definitions, and edits to anything and everything, especially the lists of resources like grants and journals and recurring conferences. If you think a new page or navigation structure will help the community, get in touch, and we’ll figure out a way to make it work.

I realize there’s a bit of a barrier to entry. Github is more difficult to edit than a wiki. I’m sorry. This was the best solution we could come up with that would be inexpensive, unlikely to break, widely editable, and easily curatable. If you want to add something and you don’t know github, reach out to me and I’ll try to help.

If you do know github and you want to share editorial duties with me, dealing with pull requests and the like, please do let me know. I can use the help!

Credits

The DHLG is already crowdsourced. A lot of the lists come directly from Alex Gil and Dennis Tenen’s DHNotes page at Columbia, which is a very similar effort. We’ve slightly updated some links, and added some new information, but we rely heavily on them and their initial collaborators.

The job resources section includes editorial advice written by Matthew Hannah (Purdue University), slightly modified on import. Lauren Tilton (University of Richmond) offered the initial list of DH organizations. Zoe LeBlanc (Princeton University) collaborated on curating the hiring interviews, which are not yet released but will be soon. Hannah Alpert-Abrams (Program Office in DH) wrote the entry on Black DH in the glossary.

What I’m trying to say is we’re standing on the shoulders of giants. Thank you everyone who has already contributed to the site. We certainly couldn’t have done it without you.

Who’s we? I steered the ship, and filled in all the cracks when necessary. Susan Grunewald (Pitt) wrote most of the original text and did the majority of content-related tasks, including fighting with markdown. Matt Lincoln (CMU) oversaw the technical development, which was implemented by the Agile Humanities Agency. From their team, Dean Irvine, Bill Kennedy, and Matt Milner were particularly helpful.

None of this would have been possible without the A.W. Mellon Foundation, who generously funded most DH that’s gone on at CMU over the last five years.

The DHLG is a gift from all of us to you. I hope it’s useful, and that you help us turn it from a Pittsburgh site that’s useful for others, into an internationally relevant resource for years to come.

Modeling Space Ships from Ocean Liners

Caveats

I love books, and libraries. In graduate school, most of my friends were studying to be librarians. A university library now partially pays my salary. Still, I don’t know much about them, being first and foremost a historian who uses computers. A lot of smart people have written a lot of smart things about libraries that I wish I’ve read, or even know about, but I’m still working on that one. 

This is all to preface a blog post about a past and future of libraries, from my perspective, that has undoubtedly been articulated better elsewhere. Please read this post as it is intended: as a first public articulation of my thoughts that I hope my friends will read, so they can perhaps guide me to the interesting stuff already written on the subject, or explain to me why this is wrong-minded. This is not an expert opinion, and on that account I urge anyone taking it seriously as such to, erm, stop.

Context

We are closer in time to Jesus Christ than he was to the first documents written on paper-like materials. Our species has, collectively, had a lot of time to figure them out. After four thousand years, papery materials and the apparatuses around them have co-evolved into something akin to a natural ecosystem. Ink, page, index, spine, shelf, catalog, library building all fit together for the health of the collective.

This efficient system is the most consequential prosthesis for memory and discovery that humankind ever created. After paper and its cousins (papyrus, vellum, etc.) became external vessels for knowledge, their influence on religion, law, and science—on control and freedom—cannot be overestimated. 

The system’s success has as much to do with the materials themselves as the socio-technical-architectural apparatuses around them. They allowed the written word to become portable, reproducible, findable, preservable, and accessible. 

Chief among these apparatuses, of course, is the library. Librarie (n.): a collection of books. Liber (n.): the inner bark of a tree.

A library’s shelves are perfectly sized to fit the standard dimensions of books (and vice-versa). Climate control keeps the words shelf-stable. Neatly ordered rows and catalog systems help us find the right words, and a combination of personnel and inventory technologies allow us to borrow them for as long as we need, and then return them for someone else’s use. Libraries physically centralize books, allowing us access to great swaths of memory without needing to leave a single building. This centralization makes everything more efficient, especially when it comes to expert knowledge in the form of librarians. A few of these professionals can keep the whole system running smoothly, and ensure anyone who walks in the door will find exactly what they need.

Image result for bodleian library layout century
Plan of the Bodleian Library, Oxford (from Geoffrey Tyack, Bodleian Library, University of Oxford: A History [Oxford: Oxford University Press, 2000], 28. Bodleian Library, University of Oxford, R. Ref. 357/3).

It’s no accident the beating heart of the university has always been its library. It has, traditionally, been one of the largest draws of an institution: join our faculty, and you’ll have easy access to all the knowledge that’s ever been written down. Through libraries, scholars found their gateway to entire academic world, and could join a conversation that spanned geography and time. It’s not everything a researcher needs, for sure—chemists need glassblowers and fume hoods and reagents—but without it, a university traditionally cannot function.

This is changing, because our reliance on paper is changing. Money, science, and news are going digital. We’re not getting rid of paper—people still read books, use toilet paper, and ship things in boxes—but its direct role in knowledge production and circulation has shrunk dramatically over the last several decades. Everything from lab notes to journals are moving online, a trend as true in the humanities as it is in the sciences. Humanists still use printed books en masse, but increasingly the physical page is the object of study more than the means of conversation, indicating a subtle but important shift in how the library is being used.

Crisis

Our shifting relationship to paper has led to a crisis in the university library, an institution that evolved over thousands of years around the written word. As researchers increasingly turn to Google for pretty much everything, books are dropping in circulation, and the stacks are emptying out of people. Libraries, once the beating heart of the university, are scrambling to remain so. 

As I see it, the university library has two options: 

  1. Become something else to retain its centrality, or 
  2. Continue doing the papery tasks it evolved to do very well, as that task’s relevance slowly diminishes (but will, I hope and suspect, never disappear).

My lay impression, as an outsider until recently, is that most well-respected libraries are trying to do both, while institutional pressures are pushing to favor option #1. This pressure, in part, is because universities’ successes were tied to the success of libraries, so universities need libraries to evolve if they hope the institutions to maintain the same relationship.

And option #1 is doing pretty well, it seems. The same architectural features that allowed us to centralize books (namely big, climate-controlled buildings) are also good for other things: meeting rooms, cafes, makerspaces, and the like. Libraries that clear out books for this get a lot of new feet through their doors, though the institutions are perhaps straying a bit from why they were so important in the first place.

James B. Hunt Jr. Library, https://commons.wikimedia.org/wiki/File:Hunt_Library_Commons_Area.JPG

Libraries are also making changes that hew spiritually closer to the papery tasks of yore, by becoming virtual information hubs. Through online catalogs/search, VPN-enabled digital subscriptions, data repositories, and the like, libraries are offering the same sorts of services they used to, connecting researchers to the larger scholarly world, without the need of a physical building. With the exception of negotiated subscriptions to digital journals, these efforts appear slightly less successful. 1 In large part this is because libraries are often playing catch-up to tech giants with massive budgets. 

To bridge the gap between the physical and the digital, libraries wind up paying large subscription fees to tech vendors which provide them search interfaces, hosting services, and other digital information solutions. Libraries do this largely because they were honed for thousands of years around written media, and are ill-equipped to handle most digital tasks themselves. 2 

Now libraries are paying outside vendors remarkable sums of money every year so they can play the same informational role they’ve always played, as the ground shifts beneath them. And even when this effort succeeds, as it occasionally does, it’s not clear that this informational role is as centrally valued in the university as it used to be. When new faculty choose where to work, I rarely hear them considering a library’s digital services in their decision, as important as they might be. The one exception is journal subscriptions, and with the Open Access movement and sci-hub, even access to articles seem decreasingly on a researcher’s mind.

Choice

Don’t get me wrong. I believe we desperately need an apparatus that does for digital information what libraries do for the written word. Whereas we have thousands of years of experience dealing with geographically situated collections of knowledge, with the preservation, organization, and access of written words, we have little such experience with respect to digitally-embedded knowledge. It’s still difficult to find, and nearly impossible to preserve. We don’t have robust systems for allocating or evaluating trust, and the economic and legal apparatuses around digital knowledge are murky, disputed, and often deeply immoral.

Whatever institution evolves to deal with digital knowledge won’t just be the beating heart of the university, it will be an organizing body for the world. Specifically because everything is so interconnected and geographically diffuse, its eventual home will not be a university (unless universities, too, lose their tethers to geography). And because of this reach and the power it implies, such an institution will be incredibly difficult to form, both politically and technologically. It may take another few thousand years to get it right, if we have that long.

But perhaps we can start at home, with universities or university consortia. That’s how we got the internet to begin with, after all, though as it grew it privatized away from universities. Today, the web is mostly hosted by big tech companies like Amazon (originally a bookseller, no surprise there), discovered and reached by other big tech companies like Google, and accessed through infrastructure provided by other other big tech companies, like AT&T. When libraries plug into this world, they do so through external tech vendors, which plug directly into for-profit publishing houses like Elsevier. Those publishing houses, too, squeeze universities out of quite a lot of money.

The economies and politics of this system are grossly exploitative. They often undermine privacy, intellectual property, individual agency, and financial independence at every angle. The infrastructure upon which scholars works and across which they communicate are constant points of friction against the ideals of academia. 

Publishing science, for example, is ostensibly about making science public, in order to coordinate a global conversation. In an era of inexpensive communication, however, many scientific publishers spend considerable resources blocking the availability of scientific publications in order to secure their business’s financial viability. This makes sense for publishers, but doesn’t quite make sense for science. 3

Perhaps universities could band together to construct institutions to short circuit this chain, from Amazon to Google to AT&T to Elsevier to Ex Libris. Perhaps we can construct an infrastructure for research that doesn’t barter on privacy, whose mission is the same as the mission of universities themselves. Universities are one of the few types of institutions that can afford to focus on long-term goals rather than short-term needs. 4 If there were a federated, academic alternative to our quagmire of an information economy, perhaps that would be the university’s “killer app”, the reason scholars decide to spend their time at one university over another, or in academia rather than a corporate research unit. We figured out eduroam and interlibrary loans, maybe we can figure out this too. 

Perhaps the answer to “what institution can be the necessary beating heart of centers of knowledge for the next 2,000 years?” is the same as the answer to “how can we organize information in a digital world?”. Or maybe it isn’t. 5 The second question is important enough that it’s worth a shot regardless, and universities as well as everyone else should be trying to figure it out.

But this post is about libraries, with tree pulp ground into their etymology. Libraries are so very good at solving paper problems, and if they cease to function in that capacity, we’ll have a lot of paper problems on our hands. Paper isn’t disappearing anytime soon, but institutional pressures to keep libraries as the heart of the university are the same pressures pushing them away from paper. So now, libraries are trying to solve problems they’re not terribly well-equipped to solve, which is why they rely so heavily on vendors, while the problems they are good at solving move further away from their core. This worries me.

Image result for paper piles
via https://www.flickr.com/photos/ralf-steinberger/32929137350

I’m certainly not implying here librarians aren’t good at computers, or digital information. As of recently, I am a librarian who happens to be a digital information professional, and I think I’m reasonably good at it. If I am good at it, it’s because I’ve learned from the small army of expert information professionals now employed in libraries.

The issues I’m raising aren’t ones of personnel, but of institutional function. I’m not convinced that (1) libraries being the best versions of themselves, (2) libraries being diffuse information hubs, and (3) libraries continuing to function as the infrastructure that makes a university successful are compatible goals. Or, even if they are compatible goals, that a single institution is the best choice for pursuing all three at once. And it certainly seems that our vendor-heavy way of going about things might be a necessary outcome of this tripartite split, which heavily benefits the short-term at the expense of the long-term. 

I’m at somewhat of a personal impasse. I find wordhouses and information hubs and academic infrastructure to be important, and I’d like to contribute to all three. As a librarian in a modern university library, I can do that. I seem to have evolved to fit well inside a university library, just as university libraries evolved to fit well around books. There are a lot of librarians who are perfectly shaped for the institutions in which we find ourselves. But I continue to wonder, are our institutions the right shape for the future we need to build, or are we trying to model space ships from ocean liners?

Caveats #2

One critique to this post came early, in the form of a tweet. I’m sharing it here to ensure readers take the most skeptical possible eye to my post:

Tweet critique, well-placed.

The name is redacted because I don’t want folks who disagree with it to project any negativity on the original poster. Their point is well-taken, and I hope everyone reading this blog post will take it with the grain of salt it deserves.

Notes:

  1. Slightly less successful doesn’t mean not successful! A lot of these projects do quote well.
  2. This is also due to a serious lack of funding
  3. I’m a historian of science, and I’ll say the story is a good deal more complicated than this, but for the purposes of this post let’s keep it at that.
  4. One astute reply suggests this isn’t how universities actually function. I’d agree with that. I don’t believe universities do function like this, but that they ought to be able to do, of any institution out there. Part of the reason we’re in this mess to begin with is because of rampant academic short-termism.
  5. This possibility is one it seems more people need to seriously reckon with.

The Route of a Text Message

[Note: Find the more polished, professionally illustrated version of this piece at Motherboard|Vice!]

This is the third post in my full-stack dev (f-s d) series on the secret life of data. This installment is about a single text message: how it was typed, stored, sent, received, and displayed. I sprinkle in some history and context to break up the alphabet soup of protocols, but though the piece gets technical, it should all be easily understood.

The first two installments of this series are Cetus, about the propagation of errors in a 17th century spreadsheet, and Down the Rabbit Hole, about the insane lengths one occasionally needs to go through to track down the source of a dataset.

A Love Story

My leg involuntarily twitches with vibration—was it my phone, or just a phantom feeling?—and a quick inspection reveals a blinking blue notification. “I love you”, my wife texted me. I walk downstairs to wish her goodnight, because I know the difference between the message and the message, you know?

It’s a bit like encryption, or maybe steganography: anyone can see the text, but only I can decode the hidden data.

My translation, if we’re being honest, is just one extra link in a remarkably long chain of data events, all to send a message (“come downstairs and say goodnight”) in under five seconds across about 40 feet.

The message presumably began somewhere in my wife’s brain and somehow ended up in her thumbs, but that’s a signal for a different story. Ours begins as her thumb taps a translucent screen, one letter at a time, and ends as light strikes my retinas.

Through the Looking Glass

With each tap, a small electrical current passes from the screen to her hand. Because electricity flows easily through human bodies, sensors on the phone register a change in voltage wherever her thumb presses against the screen. But the world is messy, and the phone senses random fluctuations in voltage across the rest of the screen, too, so an algorithm determines the biggest, thumbiest-looking voltage fluctuations and assumes that’s where she intended to press.


Figure 0. Capacitive touch.

So she starts tap-tap-tapping on the keyboard, one letter at a time.

I-spacebar-l-o-v-e-spacebar-y-o-u.

She’s not a keyboard swiper (I am, but somehow she still types faster than me). The phone reliably records the (x,y) coordinates of each thumbprint and aligns it with the coordinates of each key on the screen. It’s harder than you think; sometimes her thumb slips, yet somehow the phone realizes she’s not trying to swipe, that it was just a messy press.

Deep in the metal guts of the device, an algorithm tests whether each thumb-shaped voltage disruption moves more that a certain number of pixels, called touch slop. If the movement is sufficiently small, the phone registers a keypress rather than a swipe.

Fig 1. Android’s code for detecting ‘touch slop’. Notice the developers had my wife’s gender in mind.

She finishes her message, a measly 10 characters of her allotted 160.

The allotment of 160 characters is a carefully chosen number, if you believe the legend: In 1984, German telephone engineer Friedhelm Hillebrand sat at his typewriter and wrote as many random sentences as came to his mind. His team then looked at postcards and telex messages, and noticed most fell below 160 characters. “Eureka!”, they presumably yelled in German, before setting the character limit of text messages in stone for the next three-plus decades.

Character Limits & Legends

Legends rarely tell the whole story, and the legend of SMS is no exception. Hillebrand and his team hoped to relay messages over a secondary channel that phones were already using to exchange basic information with their local stations.

Signalling System no. 7 (SS7) are a set of protocols used by cell phones to stay in constant contact with their local tower; they need this continuous connection to know when to ring, to get basic location tracking, to check for voicemail, and communicate other non-internet reliant messages. Since the protocol’s creation in 1980, it had a hard limit of 279 bytes of information. If Hillebrand wanted text messages to piggyback on the SS7 protocol, he had to deal with this pretty severe limit.

Normally, 279 bytes equals 279 characters. A byte is eight bits (each bit is a 0 or 1), and in common encodings, a single letter is equivalent to eight 0s and 1s in a row.

‘A’ is

0100 0001

‘B’ is

0100 0010

‘C’ is

0100 0011

and so on.

Unfortunately, getting messages across the SS7 protocol isn’t a simple matter of sending 2,232 (that’s 279 bytes at 8 bits each) 0s or 1s through radio signals from my phone to yours. Part of that 279-byte signal needs to contain your phone number, and part of it needs to contain my phone number. Part of it needs to let the cell tower know “hey, this is a message, not a call, don’t ring the phone!”.

By the time Hillebrand and his team finished cramming all the necessary contextual bits into the 279-byte signal, they were left with only enough space for 140 characters at 1 byte (8 bits) a piece, or 1,120 bits.

But what if they could encode a character in only 7 bits? At 7 bits per character, they could squeeze 160 (1,140 / 7 = 160) characters into each SMS, but those extra twenty characters demanded a sacrifice: fewer possible letters.

An 8-bit encoding allows 256 possible characters: lowercase ‘a’ takes up one possible space, uppercase ‘A’ another space, a period takes up a third space, an ‘@’ symbol takes up a fourth space, a line break takes up a fifth space, and so on up to 256. To squeeze an alphabet down to 7 bits, you need to remove some possible characters: the 1/2 symbol (½), the degree symbol (°), the pi symbol (π), and so on. But assuming people will never use those symbols in text messages (a poor assumption, to be sure), this allowed Hillebrand and his colleagues to stuff 160 characters into a 140-byte space, which in turn fit neatly into a 279-byte SS7 signal: exactly the number of characters they claim to have discovered was the perfect length of a message. (A bit like the miracle of Hanukkah, if you ask me.)

Fig 2. The GSM-7 character set.

So there my wife is, typing “I love you” into a text message, all the while the phone converts those letters into this 7-bit encoding scheme, called GSM-7.

“I” (notice it’s at the intersection of 4x and x9 above) =

49 

Spacebar (notice it’s at the intersection of 2x and x0 above) =

20 

“l” =

6C

“o” =

6F

and so on down the line.

In all, her slim message becomes:

49 20 6C 6F 76 65 20 79 6F 75 

(10 bytes combined). Each two-character code, called a hex code, is one 8-bit chunk, and together it spells “I love you”.

But this is actually not how the message is stored on her phone. It has to convert the 8-bit text to 7-bit hex codes, which it does by essentially borrowing the remaining bit at the end of every byte. The math is a bit more complicated than is worth getting into here, but the resulting message appears as

49 10 FB 6D 2F 83 F2 EF 3A 

(9 bytes in all) in her phone.

When my wife finally finishes her message (it takes only a few seconds), she presses ‘send’ and a host of tiny angels retrieve the encoded message, flutter their invisible wings the 40 feet up to the office, and place it gently into my phone. The process isn’t entirely frictionless, which is why my phone vibrates lightly upon delivery.

The so-called “telecommunication engineers” will tell you a different story, and for the sake of completeness I’ll relay it to you, but I wouldn’t trust them if I were you.

SIM-to-Send

The engineers would say that, when the phone senses voltage fluctuations over the ‘send’ button, it sends the encoded message to the SIM card (that tiny card your cell provider puts in your phone so it knows what your phone number is), and in the process it wraps it in all sorts of useful contextual data. By the time it reaches my wife’s SIM, it goes from a 140-byte message (just the text) to a 176-byte message (text + context).

The extra 36 bytes are used to encode all sorts of information, seen below.

Fig 3. Here, bytes are called octets (8 bits). Counting all possible bytes yields 174 (10+1+1+12+1+1+7+1+140). The other two bytes are reserved for some SIM card bookkeeping.

The first ten bytes are reserved for the telephone number (service center address, or SCA) of the SMS service center (SMSC), tasked with receiving, storing, forwarding, and delivering text messages. It’s essentially a switchboard: my wife’s phone sends out a signal to the local cell tower and gives it the number of the SMSC, which forwards her text message from the tower to the SMSC. The SMSC, which in our case is operated by AT&T, routes the text to the mobile station nearest to my phone. Because I’m sitting three rooms away from my wife, the text just bounces back to the same mobile station, and then to my phone.

Fig 4. SMS cellular network

The next byte (PDU-type) encodes some basic housekeeping on how the phone should interpret the message, including whether it was sent successfully, whether the carrier requests a status report, and (importantly) whether this is a single text or part of a string of connected messages.

The byte after the PDU-Type is the message reference (MR). It’s a number between 1 and 255, and is essentially used as a short-term ID number to let the phone and the carrier know which text message it’s dealing with. In my wife’s case the number is set to 0, because her phone has its own message ID system independent of this particular file.

The next twelve bytes or so are reserved for the recipient’s phone number, called the destination address (DA). With the exception of the 7-bit letter character encoding I mentioned earlier, that helps us stuff 160 letters into a 140-character space, the phone number encoding is the stupidest, most confusing bits you’ll encounter in this SMS. It’s called reverse nibble notation, and it reverses every other digit in a large number. (Get it? Part of a byte is a nibble, hahahahaha, nobody’s laughing, engineers.)

My number, which is usually 1-352-537-8376, is logged in my wife’s phone as:

3125358773f6

The 1-3 is represented by

31

The 52 is represented by

25

The 53 is represented by

35

The 7-8 is represented by

87

The 37 is represented by

73

And the 6 is represented by…

f6

Where the fuck did the ‘f’ come from? It means it’s the end of the phone number, but for some awful reason (again, reverse nibble notation) it’s one character before the final digit.

It’s like pig latin for numbers.

tIs'l ki eip galit nof runbmre.s

But I’m not bitter.

[Edit: Sean Gies points out that reverse nibble notation is an inevitable artifact of representing 4-bit little-endian numbers in 8-bit chunks. That doesn’t invalidate the above description, but it does add some context for those who know what it means, and makes the decision seem more sensible.]

The Protocol Identifier (PID) byte is honestly, at this point, mostly wasted space. It takes about 40 possible values, and it tells the service provider how to route the message. A value of

22 

means my wife is sending “I love you” to a fax machine; a value of

24 

means she’s sending it to a voice line, somehow. Since she’s sending it as an SMS to my phone, which receives texts, the PID is set to

0

(Like every other text sent in the modern world.)

Fig 5. Possible PID Values

The next byte is the Data Coding Scheme (DCS, see this doc for details), which tells the carrier and the receiving phone which character encoding scheme was used. My wife used GSM-7, the 7-bit alphabet I mentioned above that allows her to stuff 160 letters into a 140-character space, but you can easily imagine someone wanting to text in Chinese, or someone texting a complex math equation (ok, maybe you can’t easily imagine that, but a guy can dream, right?).

In my wife’s text, the DCS byte was set to

0

meaning she used a 7-bit alphabet, but she may have changed that value to use an 8- or 16-bit alphabet, which would allow her many more possible letters, but a much smaller space to fit them. Incidentally, this is why when you text emoji to your friend, you have fewer characters to work with.

There’s also a little flag in the DCS byte that tells the phone whether to self-destruct the message after sending it, Mission Impossible style, so that’s neat.

The validity period (VP) space can take up to seven bytes, and sends us into another aspect of how text messages actually work. Take another look at Figure 4, above. It’s okay, I’ll wait.

When my wife finally hits ‘send’, the text gets sent to the SMS Service Center (SMSC), which then routes the message to me. I’m upstairs and my phone is on, so I receive the text in a handful of seconds, but what if my phone were off? Surely my phone can’t accept a message when it’s not receiving any power, so the SMSC has to do something with the text.

If the SMSC can’t find my phone, my wife’s message will just bounce around in its system until the moment my phone reconnects, at which point it sends the text out immediately. I like to think of the SMSC continuously checking every online phone to see if its mine like a puppy waiting for its human by the door: is that smell my human? No. Is that smell my human? No. Is this smell my human? YESYESJUMPNOW.

The validity period (VP) bytes tell the carrier how long the puppy will wait before it gets bored and finds a new home. It’s either a timestamp or a duration, and it basically says “if you don’t see the recipient phone pop online in the next however-many days, just don’t bother sending it.” The default validity period for a text is 10,080 minutes, which means if it takes me more than seven days to turn my phone back on, I’ll never receive her text.

Because there’s often a lot of empty space in an SMS, a few bits here or there are dedicated to letting the phone and carrier know exactly which bytes are unused. If my wife’s SIM card expects a 176-byte SMS, but because she wrote an exceptionally short message it only receives a 45-byte SMS, it may get confused and assume something broke along the way. The user data length (UDL) byte solves this problem: it relays exactly how many bytes the text in the text message actually take up.

In the case of “I love you”, the UDL claims the subsequent message is 9 bytes. You’d expect it to be 10 bytes, one for each of the 10 characters in

I-spacebar-l-o-v-e-spacebar-y-o-u

but because each character is 7 bits rather than 8 bits (a full byte), we’re able to shave an extra byte off in the translation. That’s because 7 bits * 10 characters = 70 bits, divided by 8 (the number of bits in a byte) = 8.75 bytes, rounded up to 9 bytes.

Which brings us to the end of every SMS: the message itself, or the UD (User Data). The message can take up to 140 bytes, though as I just mentioned, “I love you” will pack into a measly 9. Amazing how much is packed into those 9 bytes—not just the message (my wife’s presumed love for me, which is already difficult enough to compress into 0s and 1s), but also the message (I need to come downstairs and wish her goodnight). Those bytes are:

49 10 FB 6D 2F 83 F2 EF 3A.

In all, then, this is the text message stored on my wife’s SIM card:

SCA[1-10]-PDU[1]-MR[1]-DA[1-12]-DCS[1]-VP[0, 1, or 7]-UDL[1]-UD[0-140]

00 - 11 - 00 - 07 31 25 35 87 73 F6 - ?? 00 ?? - ?? - 09 - 49 10 FB 6D 2F 83 F2 EF 3A

(Note: to get the full message, I need to do some more digging. Alas, you only see most of the message here, hence the ??s.)

Waves in the Æther

Somehow [he says in David Attenborough’s voice], the SMS must now begin its arduous journey from the SIM card to the nearest base station.  To do that, my wife’s phone must convert a string of 176 bytes to the 279 bytes readable by the SS7 protocol, convert those digital bytes to an analog radio signal, and then send those signals out into the æther at a frequency of somewhere between 800 and 2000 megahertz. That means each wave is between 6 and 14 inches from one peak to the next.

Fig 6. Wavelength

In order to efficiently send and receive signals, antennas should be no smaller than half the size of the radio waves they’re dealing with. If cell waves are 6 to 14 inches, their antennas need to be 3-7 inches. Now stop and think about the average height of a mobile phone, and why they never seem to get much smaller.

Through some digital gymnastics that would take entirely too long to explain, suddenly my wife’s phone shoots a 279-byte information packet containing “I love you” at the speed of light in every direction, eventually fizzling into nothing after about 30 miles.

Well before getting that far, her signal strikes the AT&T HSPA Base Station ID199694204 LAC21767. This base transceiver station (BTS) is about 5 blocks from my favorite bakery in Hazelwood, La Gourmandine, and though I was able to find its general location using an android app called OpenSignal, the antenna is camouflaged beyond my ability to find it.

The really fascinating bit here is that it reaches the base transceiver station at all, given everything else going on. Not only is my wife texting me “I love you” in the 1000ish mhz band of the electromagnetic spectrum; tens of thousands of other people are likely talking on the phone or texting within the 30 mile radius around my house, beyond which cell signals disintegrate. On top of that, a slew of radio and TV signals are jostling for attention in our immediate airspace, alongside visible light bouncing this way and that, to name a few of the many electromagnetic waves that seem like they ought to be getting in the way.

As Richard Feynman eloquently put it in 1983, it’s a bit like the cell tower is a little blind bug resting gently atop the water on one end of a pool, and based only on the frequency and direction of waves that cause it to bounce up and down, it’s able to reconstruct who’s swimming and where.

Feynman discussing waves.

In part due to the complexity of competing signals, each base transceiver station generally can’t handle more than 200 active users (using voice or data) at a time. So “I love you” pings my local base transceiver station, about a half a mile away, and then shouts itself into the void in every direction until it fades into the noise of everyone else.

Switching

I’m pretty lucky, all things considered. Were my wife and I on different cell providers, or were we in different cities, the route of her message to me would be a good deal more circuitous.

My wife’s message is massaged into the 279-byte SS7 channel, and sent along to the local base transceiver station (BTS) near the bakery. From there, it gets routed to the base station controller (BSC), which is the brain of not just our antenna, but several other local antennas besides. The BSC flings the text to AT&T Pittsburgh’s mobile switching center (MSC), which relies on the text message’s SCA (remember the service center address embedded within every SMS? That’s where this comes in) to get it to the appropriate short message service center (SMSC).

This alphabet soup is easier to understand with the diagram from figure 7; I just described steps 1 and 3. If my wife were on a different carrier, we’d continue through steps 4-7, because that’s where the mobile carriers all talk to each other. The SMS has to go from the SMSC to a global switchboard and then potentially bounce around the world before finding its way to my phone.

Fig 7. SMS routed through a GSM network.

But she’s on AT&T and I’m on AT&T, and our phones are connected to the same tower, so after step 3 the 279-byte packet of love just does an about-face and returns through the same mobile service center, through the same base station, and now to my phone instead of hers. A trip of a few dozen miles in the blink of an eye.

Sent-to-SIM

Buzzzzz. My pocket vibrates. A notification lets me know an SMS has arrived through my nano-SIM card, a circuit board about the size of my pinky nail. Like Bilbo Baggins or any good adventurer, it changed a bit in its trip there and back again.

Fig 8. Received message, as opposed to sent message (figure 3).

Figure 8 shows the structure of the message “I love you” now stored on my phone. Comparing figures 3 and 8, we see a few differences. The SCA (phone number of the short message service center), the PDU (some mechanical housekeeping), the PID (phone-to-phone rather than, say, phone-to-fax), the DCS (character encoding scheme), the UDL (length of message), and the UD (the message itself) are all mostly the same, but the VP (the text’s expiration date), the MR (the text’s ID number), and the DA (my phone number) are missing.

Instead, on my phone, there are two new pieces of information: the OA (originating address, or my wife’s phone number), and the SCTS (service center time stamp, or when my wife sent the message).

My wife’s phone number is stored in the same annoying reverse nibble notation (like dyslexia but for computers) that my phone number was stored in on her phone, and the timestamp is stored in the same format as the expiration date was stored in on on her phone.

These two information inversions make perfect contextual sense. Her phone needed to reach me by a certain time at a certain address, and I now need to know who sent the message and when. Without the home address, so to speak, I wouldn’t know whether the “I love you” came from my wife or my mother, and the difference would change my interpretation of the message fairly significantly.

Through a Glass Brightly

In much the same way that any computer translates a stream of bytes into a series of (x,y) coordinates with specific color assignments, my phone’s screen gets the signal to render

49 10 FB 6D 2F 83 F2 EF 3A

on the screen in front of me as “I love you” in backlit black-and-white. It’s an interesting process, but as it’s not particularly unique to smartphones, you’ll have to look it up elsewhere. Let’s instead focus on how those instructions become points of light.

The friendly marketers at Samsung call my screen a Super AMOLED (Active Matrix Organic Light-Emitting Diode) display, which is somehow both redundant and not particularly informative, so we’ll ignore unpacking the acronym as yet another distraction, and dive right into the technology.

There are about 330,000 tiny sources of light, or pixels, crammed inside each of my phone screen’s 13 square inches. For that many pixels, each needs to be about 45µm (micrometers) wide: thinner than a human hair. There’s 4 million of ‘em in all packed into the palm of my hand.

But you already know how screens work. You know that every point of light, like the Christian God or Musketeers (minus d’Artagnan), is always a three-for-one sort of deal. Red, green, and blue combine to form white light in a single pixel. Fiddle with the luminosity of each channel, and you get every color in the rainbow. And since 4 x 3 = 12, that’s 12 million tiny sources of light sitting innocently dormant behind my black mirror, waiting for me to press the power button to read my wife’s text.

Fig 9. The subpixel array of a Samsung OLED display.

Each pixel, as the acronym suggests, is an organic light-emitting diode. That’s fancy talk for an electricity sandwich:

Fig 10. An electricity sandwich.

The layers aren’t too important, beyond the fact that it’s a cathode plate (negatively charged), below a layer of organic molecules (remember back to highschool: it’s just some atoms strung together with carbon), below an anode plate (positively charged).

When the phone wants the screen on, it sends electrons from the cathode plate to the anode plate. The sandwiched molecules intercept the energy, and in response they start emitting visible light, photons, up through the transparent anode, up through the screen, and into my waiting eyes.

Since each pixel is three points of light (red, green, and blue), there’s actually three of these sandwiches per pixel. They’re all essentially the same, except the organic molecule is switched out: poly(p-phenylene) for blue light, polythiophene for red light, and poly(p-phenylene vinylene) for green light. Because each is slightly different, they shine different colors when electrified.

(Fun side fact: blue subpixels burn out much faster, due to a process called “exciton-polaron annihilation”, which sounds really exciting, doesn’t it?)

All 4 million pixels are laid out on an indexed matrix. An index works in a computer much the same way it works in a book: when my phone wants a specific pixel to light a certain color, it looks that pixel up in the index, and then sends a signal to the address it finds. Let there be light, and there was light.

(Fun side fact: now you know what “Active Matrix Light-Emitting Diode” means, and you didn’t even try.)

My phone’s operating system interprets my wife’s text message, figures out the shape of each letter, and maps those shapes to the indexed matrix. It sends just the right electric pulses through the Super AMOLED screen to render those three little words that have launched ships and vanquished curses.

The great strangeness here is that my eyes never see “I love you” in bright OLED lights; it appear on the screen black-on-white. The phone creates the illusion of text through negative space, washing the screen white by setting every red, green, & blue to maximum brightness, then turning off the bits where letters should be. Its complexity is offensively mundane.

Fig 11. Negative space.

In displaying everything but my wife’s text message, and letting me read it in the gaps, my phone succinctly betrays the lie at the heart of the information age: that communication is simple. Speed and ease hide a mountain of mediation.

And that mediation isn’t just technical. My wife’s text wouldn’t have reached me had I not paid the phone bill on time, had there not been a small army of workers handling financial systems behind the scenes. Technicians keep the phone towers in working order, which they reach via a network of roads partially subsidized by federal taxes collected from hundreds of millions of Americans across 50 states. Because so many transactions still occur via mail, if the U.S. postal system collapsed tomorrow, my phone service would falter. Exploited factory workers in South America and Asia assembled parts in both our phones, and exhausted programmers renting expensive Silicon Valley closets are as-you-read-this pushing out code ensuring our phones communicate without interruption.

All of this underneath a 10-character text. A text which, let’s be honest, means much more than it says. My brain subconsciously peels back years of interactions with my wife to decode the message appearing on my phone, but between her and me there’s still a thicket of sociotechnical mediation, a stew of people and history and parts, that can never be untangled.

The Aftermath

So here I am, in the office late one Sunday night. “I love you,” my wife texted from the bedroom downstairs, before the message traversed 40 or so feet to my phone in a handful of seconds. I realize what it means: it’s time to wish her goodnight, and perhaps wrap up this essay. I tap away the last few words, now slightly more cognizant of the complex layering of miles, signals, years of history, and human sweat it took to keep my wife from having to shout upstairs that it’s about damn time I get some rest.

Thanks to Christopher Warren, Vika Zafrin, and Nechama Weingart for comments on earlier drafts.

Encouraging Misfits

tl;dr Academics’ individual policing of disciplinary boundaries at the expense of intellectual merit does a disservice to our global research community, which is already structured to reinforce disciplinarity at every stage. We should work harder to encourage research misfits to offset this structural pull.


The academic game is stacked to reinforce old community practices. PhDs aren’t only about specialization, but about teaching you to think, act, write, and cite like the discipline you’ll soon join. Tenure is about proving to your peers you are like them. Publishing and winning grants are as much about goodness of fit as about quality of work.

This isn’t bad. One of science’s most important features is that it’s often cumulative or at least agglomerative, that scientists don’t start from scratch with every new project, but build on each other’s work to construct an edifice that often resembles progress. The scientific pipeline uses PhDs, tenure, journals, and grants as built-in funnels, ensuring everyone is squeezed snugly inside the pipes at every stage of their career. It’s a clever institutional trick to keep science cumulative.

But the funnels work too well. Or at least, there’s no equally entrenched clever institutional mechanism for building new pipes, for allowing the development of new academic communities that break the mold. Publishing in established journals that enforce their community boundaries is necessary for your career; most of the world’s scholarly grant programs are earmarked for and evaluated by specific academic communities. It’s easy to be disciplinary, and hard to be a misfit.

To be sure, this is a known problem. Patches abound. Universities set aside funds for “interdisciplinary research” or “underfunded areas”; postdoc positions, centers, and antidsciplinary journals exist to encourage exactly the sort of weird research I’m claiming has no little place in today’s university. These solutions are insufficient.

University or even external grant programs fostering “interdisciplinarity” for its own sake become mostly useless because of the laws of Goodhart & Campbell. They’re usually designed to bring disciplines together rather than to sidestep disciplinarity altogether, which while admirable, is a system that’s pretty easy to game, and often leads to awkward alliances of convenience.

Dramatic rendition of types of -disciplinarity from Lotrecchiano in 2010, shown here never actually getting outside disciplines.

Universities do a bit better in encouraging certain types of centers that, rather than being “interdisciplinary”, are focused on a specific goal, method, or topic that doesn’t align easily with the local department structure. A new pipe, to extend my earlier bad metaphor. The problems arise here because centers often lack the institutional benefits available to departments: they rely on soft money, don’t get kickback from grant overheads, don’t get money from cross-listed courses, and don’t get tenure lines. Antidisciplinary postdoc positions suffer a similar fate, allowing misfits to thrive for a year or so before having to go back on the job market to rinse & repeat.

In short, the overwhelming inertial force of academic institutions pulls towards disciplinarity despite frequent but half-assed or poorly-supported attempts to remedy the situation. Even when new disciplinary configurations break free of institutional inertia, presenting themselves as means to knowledge every bit as legitimate as traditional departments (chemistry, history, sociology, etc.), it can take decades for them to even be given the chance to fail.

It is perhaps unsurprising that the community which taught us about autopoiesis proved incapable of sustaining itself, though half a century on its influences are glaringly apparent and far-reaching across today’s research universities. I wonder if we reconfigured the organization of colleges and departments from scratch today, whether there would be more departments of environmental studies and fewer departments of [redacted] 1.

I bring this all up to raise awareness of the difficulty facing good work with no discernible home, and to advocate for some individual action which, though it won’t change the system overnight, will hopefully make the world a bit easier for those who deserve it.

It is this: relax the reflexive disciplinary boundary drawing, and foster programs or communities which celebrate misfits. I wrote a bit about this last year in the context of history and culturomics; historians clamored to show that culturomics was bad history, but culturomics never attempted to be good history—it attempted to be good culturomics. Though I’d argue it often failed at that as well, it should have been evaluated by its own criteria, not the criteria of some related but different discipline.

Some potential ways to move forward:

  • If you are reviewing for a journal or grant and the piece is great, but doesn’t quite fit, and you can’t think of a better home for it, push against the editor to let it in anyway.
  • If you’re a journal editor or grant program officer, be more flexible with submissions which don’t fit your mold but don’t have easy homes elsewhere.
  • If you control funds for research grants, earmark half your money for good work that lacks a home. Not “good work that lacks a home but still looks like the humanities”, or “good work that looks like economics but happens to involve a computer scientist and a biologist”, but truly homeless work. I realize this won’t happen, but if I’m advocating, I might as well advocate big!
  • If you are training graduate students, hiring faculty, or evaluating tenure cases, relax the boundary-drawing urge to say “her work is fascinating, but it’s not exactly our department.”
  • If you have administrative and financial power at a university, commit to supporting nondisciplinary centers and agendas with the creation of tenure lines, the allocation of course & indirect funds, and some of the security offered to departments.

Ultimately, we need clever systems to foster nondisciplinary thinking which are as robust as those systems that foster cumulative research. This problem is above my paygrade. In the meantime, though, we can at least avoid the urge to equate disciplinary fitness with intellectual quality.

Notes:

  1. You didn’t seriously expect me to name names, did you?

Experience

Last week, I publicly outed myself as a non-tenure-track academic diagnosed on the autism spectrum, 1 hoping that doing so might help other struggling academics find solace knowing they are not alone. I was unprepared for the outpouring of private and public support. Friends, colleagues, and strangers thanked me for helping them feel a little less alone, which in turn helped me feel much less alone. Thank you all, deeply and sincerely.

In a similar spirit, for interested allies and struggling fellows, this post is about how my symptoms manifest in the academic world, and how I manage them. 2

Navigating the social world is tough—a fact that may surprise some of my friends and most of my colleagues. I do alright at conferences and in groups, when conversation is polite and skin-deep, but it requires careful concentration and a lot of smoke and mirrors. Inside, it feels like I’m translating from Turkish to Cantonese without knowing either language. Every time this is said, that is the appropriate reply, though I struggle to understand why. I just possess a translation book, and recite what is expected. Stimulus and response. This skill was only recently acquired.

Looking at the point between people’s eyes makes it appear as though I am making direct eye contact during conversations. Certain observations (“you look tired”) are apparently less well-received than others (“you look excited”), and I’ve mostly learned which are which.

After a long day keeping up this appearance, especially at conferences, I find a nice dark room and stay there. Sharing conference hotel rooms with fellow academics is never an option. Some strategies I figured out myself; others, like the eye contact trick, I built over extended discussions with an old girlfriend after she handed me a severely-highlighted copy of The Partner’s Guide to Asperger Syndrome.

ADHD and Autism Spectrum Disorder are highly co-morbid, and I have been diagnosed with either or both by several independent professionals in the last twenty years. Working is hard, and often takes at least twice as much time for me as it does for the peers with whom I have discussed this. When interested in something, I lose myself entirely in it for hours on end, but a single break in concentration will leave me scrambling. It may take hours or days to return to a task, if I do at all. My best work is done in marathon, and work that takes longer than a few days may never get finished, or may drop in quality precipitously. Keeping the internet disconnected and my phone off during regular periods every day, locked in my windowless office, helps keep distractions at bay. But, I have yet to discover a good strategy to manage long projects. A career in the book-driven humanities may have been a poor choice.

Paying bills on time, keeping schedules, and replying to emails are among the most stressful tasks in my life. When I don’t adequately handle all of these mundane tasks, it sets in motion a cycle of horror that paralyzes my ability to get anything done, until I eventually file for task bankruptcy and inevitably disappoint colleagues, friends, or creditors to whom action is owed. Poor time management and stress-cycles lead me to over-promise and under-deliver. On the bright side, I recently received help in strategies to improve that, and they work. Sometimes.

Friendships, surprisingly, are easy to maintain but difficult to nourish. My friends consider me trustworthy and willing to help (if not necessarily always dependable), but I lose track of friends or family who aren’t geographically close. Deeper emotional relationships are rare or, for swaths of my life, non-existent. I get no fits of anger or depression or elation or excitement. Indeed, my friends and family remark how impossible it is to see if I like a gift they’ve given me.

People occasionally describe my actions as offensive, rude, or short, and I get frustrated trying to understand exactly why what I’m doing fits into those categories. Apparently, early in grad school, I had a bit of a reputation for asking obnoxious questions in lectures. But I don’t like upsetting people, and actively (maybe successfully?) try to curb these traits when they are pointed out.

Thankfully, academic life allows me the freedom to lock myself in a room and focus on a task. Using work as a coping mechanism for social difficulties may be unhealthy, but hey, at least I found a career that rewards my peculiarities.

My life is pretty great. I have good friends, a loving family, and hobbies that challenge me. As long as I maintain the proper controlled environment, my fixations and obsessions are a perfect complement to an academic career, especially in a culture that (unfortunately) rewards workaholism. The same tenacity often compensates for difficulties in navigating romantic relationships, of which I’ve had a few incredibly fulfilling and valuable ones over my life thus-far.

Unfortunately, my experience on the autism spectrum is not shared by all academics. Some have enough difficulty managing the social world that they end up alienating colleagues who are on their tenure committees, to disastrous effect. From private conversations, it seems autistic women suffer more from this than men, as they are expected to perform more service work and to be more social. Supportive administrators can be vital in these situations, and autism-spectrum academics may want to negotiate accommodations for themselves as part of their hiring process.

Despite some frustrations, I have found my atypical way of interacting with the world to be a feature, not a bug. My atypicality presents as what used to be called Asperger Syndrome, and it is easier for me to interact with the world, and easier for the world to interact with me, than many other autistic individuals. That said, whether or not my friends and colleagues notice, I still struggle with many aspects common to those diagnosed on the autism spectrum: social-emotional difficulties, alexithymia, intensity of focus, hypersensitivity, system-oriented thinking, etc.

Relationships or friendships with someone on the spectrum can be tough, even with someone who doesn’t outwardly present common characteristics, like me. An old partner once vented her frustrations that she couldn’t turn to her friends for advice, because: “everyone just said Scott is so normal and I was thinking [no], he’s just very very good at passing [as socially aware].” Like many who grow up non-neurotypical, I learned a complex set of coping strategies to help me fit in and succeed in a neurotypical world. To concentrate on work, I create an office cave to shut out the world. I use a complicated set of journals, calendars, and apps to keep me on task and ensure I pay bills on time. To stay attentive, I sit at the front of a lecture hall—it even works, sometimes. Some ADHD symptoms are managed pharmacologically.

These strategies give me the 80% push I need to be a functioning member of society, to become someone who can sustain relationships, not get kicked out of his house for forgetting rent, and can almost finish a PhD. Almost. It’s not quite enough to prevent me from a dozen incompletes on my transcripts, but I make do. A host of unrealistically patient and caring friends, family, and colleagues helps. (If you’re someone to whom I still owe work, but am too scared to reply to because of how delinquent I am, thanks for understanding! waves and runs away). Caring allies help. A lot.

My life so far has been a series of successes and confusions. Not unlike anybody else’s life, I suppose. I occupy my own corner of weirdness, which is itself unique enough, but everyone has their own corner. I doubt my writing this will help anyone understand themselves any better, but hopefully it will help fellow academics feel a bit safer in their own weirdness. And if this essay helps our neurotypical colleagues be a bit more understanding of our struggles, and better-informed as allies, all the better.

Notes:

  1. The original article, Stigma, was written for the Conditionally Accepted column of Inside Higher Ed. Jeana Jorgensen, Eric Grollman and Sarah Bray provided invaluable feedback, and I wouldn’t have written it without them. They invited me to write this second article for Inside Higher Ed as well, which was my original intent. I wound up posting it on my blog instead because their posting schedule didn’t quite align with my writing schedule. This shouldn’t be counted as a negative reflection on the process of publishing with that fine establishment.
  2. Let me be clear: I know very little about autism, beyond that I have been diagnosed with it. I’m still learning a lot. This post is about me. Knowing other people face similar struggles has been profoundly helpful, regardless of what causes those struggles.

The Turing Point

Below is some crazy, uninformed ramblings about the least-complex possible way to trick someone into thinking a computer is a human, for the purpose of history research. I’d love some genuine AI/Machine Intelligence researchers to point me to the actual discussions on the subject. These aren’t original thoughts; they spring from countless sci-fi novels and AI research from the ’70s-’90s. Humanists beware: this is super sci-fi speculative, but maybe an interesting thought experiment.


If someone’s chatting with a computer, but doesn’t realize her conversation partner isn’t human, that computer passes the Turing Test. Unrelatedly, if a robot or piece of art is just close enough to reality to be creepy, but not close enough to be convincingly real, it lies in the Uncanny ValleyI argue there is a useful concept in the simplest possible computer which is still convincingly human, and that computer will be at the Turing Point. 1 

By Smurrayinchester - self-made, based on image by Masahiro Mori and Karl MacDorman at http://www.androidscience.com/theuncannyvalley/proceedings2005/uncannyvalley.html, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2041097
By Smurrayinchester – self-made, based on image by Masahiro Mori and Karl MacDorman, CC BY-SA 3.0

Forgive my twisting Turing Tests and Uncanny Valleys away from their normal use, for the sake of outlining the Turing Point concept:

  • A human simulacrum is a simulation of a human, or some aspect of a human, in some medium, which is designed to be as-close-as-possible to that which is being modeled, within the scope of that medium.
  • A Turing Test winner is any human simulacrum which humans consistently mistake for the real thing.
  • An occupant of the Uncanny Valley is any human simulacrum which humans consistently doubt as representing a “real” human.
  • Between the Uncanny Valley and Turing Test winners lies the Turing Point, occupied by the least-sophisticated human simulacrum that can still consistently pass as human in a given medium. The Turing Point is a hyperplane in a hypercube, such that there are many points of entry for the simulacrum to “phase-transition” from uncanny to convincing.

Extending the Turing Test

The classic Turing Test scenario is a text-only chatbot which must, in free conversation, be convincing enough for a human to think it is speaking with another human. A piece of software named Eugene Goostman sort-of passed this test in 2014, convincing a third of judges it was a 13-year-old Ukrainian boy.

There are many possible modes in which a computer can act convincingly human. It is easier to make a convincing simulacrum of a 13-year-old non-native English speaker who is confined to text messages than to make a convincing college professor, for example. Thus the former has a lower Turing Point than the latter.

Playing with the constraints of the medium will also affect the Turing Point threshold. The Turing Point for a flesh-covered robot is incredibly difficult to surpass, since so many little details (movement, design, voice quality, etc.) may place it into the Uncanny Valley. A piece of software posing as a Twitter user, however, would have a significantly easier time convincing fellow users it is human.

The Turing Point, then, is flexible to the medium in which the simulacrum intends to deceive, and the sort of human it simulates.

From Type to Token

Convincing the world a simulacrum is any old human is different than convincing the world it is some specific human. This is the token/type distinction; convincingly simulating a specific person (token) is much more difficult than convincingly simulating any old person (type).

Simulations of specific people are all over the place, even if they don’t intend to deceive. Several Twitter-bots exist as simulacra of Donald Trump, reading his tweets and creating new ones in a similar style. Perhaps imitating Poe’s Law, certain people’s styles, or certain types of media (e.g. Twitter), may provide such a low Turing Point that it is genuinely difficult to distinguish humans from machines.

Put differently, the way some Turing Tests may be designed, humans could easily lose.

It’ll be useful to make up and define two terms here. I imagine the concepts already exist, but couldn’t find them, so please comment if they do so I can use less stupid words:

  • type-bot is a machine designed to be represent something at the type-level. For example, a bot that can be mistaken for some random human, but not some specific human.
  • token-bot is a machine designed to represent something at the token-level. For example, a bot that can be mistaken for Donald Trump.

Replaying History

Using traces to recreate historical figures (or at least things they could have done) as token-bots is not uncommon. The most recent high-profile example of this is a project to create a new Rembrandt painting in the original style. Shawn Graham and I wrote an article on using simulations to create new plausible histories, among many other examples old and new.

This all got me thinking, if we reach the Turing Point for some social media personalities (that is, it is difficult to distinguish between their social media presence, and a simulacrum of it), what’s to say we can’t reach it for an entire social media ecosystem? Can we take a snapshot of Twitter and project it several seconds/minutes/hours/days into the future, a bit like a meteorological model?

A few questions and obvious problems:

  • Much of Twitter’s dynamics are dependent upon exogenous forces: memes from other media, real world events, etc. Thus, no projection of Twitter alone would ever look like the real thing. One can, however, potentially use such a simulation to predict how certain types of events might affect the system.
  • This is way overkill, and impossibly computationally complex at this scale. You can simulate the dynamics of Twitter without simulating every individual user, because people on average act pretty systematically. That said, for the humanities-inclined, we may gain more insight from the ground-level of the system (individual agents) than macroscopic properties.
  • This is key. Would a set of plausibly-duplicate Twitter personalities on aggregate create a dynamic system that matches Twitter as an aggregate system? That is, just because the algorithms pass the Turing Test, because humans believe them to be humans, does that necessarily imply the algorithms have enough fidelity to accurately recreate the dynamics of a large scale social network? Or will small unnoticeable differences between the simulacrum and the original accrue atop each other, such that in aggregate they no longer act like a real social network?

The last point is I think a theoretically and methodologically fertile one for people working in DH, AI, and Cognitive Science: whether reducing human-appreciable traits between machines and people is sufficient to simulate aggregate social behavior, or whether human-appreciability (i.e., Turing Test) is a strict enough criteria for making accurate predictions about societies.

These points aside, if we ever do manage to simulate specific people (even in a very limited scope) as token-bots based on the traces they leave, it opens up interesting pedagogical and research opportunities for historians. Scott Enderle tweeted a great metaphor for this:

Imagine, as a student, being able to have a plausible discussion with Marie Curie, or sitting in an Enlightenment-era salon. 2 Or imagine, as a researcher (if individual Turing Point machines do aggregate well), being able to do well-grounded counterfactual history that works at the token level rather than at the type level.

Turing Point Simulations

Bringing this slightly back into the realm of the sane, the interesting thing here is the interplay between appreciability (a person’s ability to appreciate enough difference to notice something wrong with a simulacrum) and fidelity.

We can specifically design simulation conditions with incredibly low-threshold Turing Points, even for token-bots. That is to say, we can create a condition where the interactions are simple enough to make a bot that acts indistinguishably from the specific human it is simulating.

At the most extreme end, this is obviously pointless. If our system is one in which a person can only answer “yes” or “no” to pre-selected preference questions (“Do you like ice-cream?”), making a bot to simulate that person convincingly would be trivial.

Putting that aside (lest we get into questions of the Turing Point of a set of Turing Points), we can potentially design reasonably simplistic test scenarios that would allow for an easy-to-reach Turing Point while still being historiographically or sociologically useful. It’s sort of a minimization problem in topological optimizations. Such a goal would limit the burden of the simulation while maximizing the potential research benefit (but only if, as mentioned before, the difference between true fidelity and the ability to win a token-bot Turing Test is small enough to allow for generalization).

In short, the concept of a Turing Point can help us conceptualize and build token-simulacra that are useful for research or teaching. It helps us ask the question: what’s the least-complex-but-still-useful token-simulacra? It’s also kind-of maybe sort-of like Kolmogorov complexity for human appreciability of other humans: that is, the simplest possible representation of a human that is convincing to other humans.

I’ll end by saying, once again, I realize how insane this sounds, and how far-off. And also how much an interloper I am to this space, having never so much as designed a bot. Still, as Bill Hart-Davidson wrote,

the possibility seems more plausible than ever, even if not soon-to-come. I’m not even sure why I posted this on the Irregular, but it seemed like it’d be relevant enough to some regular readers’ interests to be worth spilling some ink.

Notes:

  1. The name itself is maybe too on-the-nose, being a pun for turning point and thus connected to the rhetoric of singularity, but ¯\_(ツ)_/¯
  2. Yes yes I know, this is SecondLife all over again, but hopefully much more useful.

Work with me! CMU is hiring a DH Developer

Carnegie Mellon University is hiring a DH Developer!

I’ve had a blast since starting as Digital Humanities Specialist at CMU. Enough administrators, faculty, and students are on board to make building a DH strength here pretty easy, and we’re neighbors to Pitt DHRX, a really supportive supercomputing center, and great allies in the Mayor’s Office keen on a city rich with art, data, and both combined.

We want a developer to help jump-start our research efforts. You’ll be working as a full collaborator on projects from all sorts of domains, and as a review board member you’ll have a strong say in which projects they are and how they get implemented. You and I will work together in achievable rapid prototyping, analyzing data, and web deployment.

The idea is we build or do stuff that’s scholarly, interesting, and can have a proof-of-concept or article done in a semester or two. With that, the project can go on to seek additional funding and a full-time specialized programmer, or we can finish there and all be proud authors or creators of something we enjoyed making.

Ideally, you have a social science, humanities, journalism, or similar research background, and the broad tech chops to create a d3 viz, DeepDream some dogs into a work of art, manage a NoSQL database, and whatever else seems handy. Ruby on Rails, probably.

We’re looking for someone who loves playing with new tech stacks, isn’t afraid to get their hands dirty, and knows how to talk to humans. You probably have a static site and a github account. You get excited by interactive data stories, and want to make them with us. This job values breadth over depth and done over perfect.

The job isn’t as insane as it sounds—you don’t actually need to be able to do all this already, just be the sort of person who can learn on the fly. A bachelor’s degree or similar experience is required, with a strong preference for candidates with some research background. You’ll need to submit or point to some examples of work you’ve done.

We’re an equal opportunity employer, and would love to see applications from women, minorities, or other groups who often have a tough time getting developer jobs. If you work here you can take two free classes a semester. Say, who wants a fancy CMU computer science graduate degree? We can offer an awesome city, friendly coworkers, and a competitive salary (also Pittsburgh’s cheap so you wouldn’t live in a closet, like in SF or NYC).

What I’m saying is you should apply ’cause we love you.


The ad, if you’re too lazy to click the link, or are scared CMU hosts viruses:

Job Description
Digital Humanities Developer, Dietrich College of Humanities and Social Sciences

Summary
The Dietrich College of Humanities and Social Sciences at Carnegie Mellon University (CMU) is undertaking a long-term initiative to foster digital humanities research among its faculty, staff, and students. As part of this initiative, CMU seeks an experienced Developer to collaborate on cutting edge interdisciplinary projects.

CMU is a world leader in technology-oriented research, and a highly supportive environment for cross-departmental teams. The Developer would work alongside researchers from Dietrich and elsewhere to plan and implement digital humanities projects, from statistical analyses of millions of legal documents to websites that crowdsource grammars of endangered languages. Located in the the Office of The Dean under CMU’s Digital Humanities Specialist, the developer will help start up faculty projects into functioning prototypes where they can acquire sustaining funding to hire specialists for more focused development.

The position emphasizes rapid, iterative deployment and the ability to learn new techniques on the job, with a focus on technologies intersecting data science and web development, such as D3.js, NoSQL, Shiny (R), IPython Notebooks, APIs, and Ruby on Rails. Experience with digital humanities or computational social sciences is also beneficial, including work with machine learning, GIS, or computational linguistics.

The individual in this position will work with clients and the digital humanities specialist to determine achievable short-term prototypes in web development or data analysis/presentation, and will be responsible for implementing the technical aspects of these goals in a timely fashion. As a collaborator, the Digital Humanities Developer will play a role in project decision-making, where appropriate, and will be credited on final products to which they extensively contribute.

Please submit a cover letter, phone numbers and email addresses for two references, a résumé or cv, and a page describing how your previous work fits the job, including links to your github account or other relevant previous work examples.

Qualifications

  • Bachelor’s Degree in humanities computing, digital humanities, informatics, computer science, related field, or equivalent combination of training and experience.
  • At least one year of experience in modern web development and/or data science, preferably in a research and development team setting.
  • Demonstrated knowledge of modern machine learning and web development languages and environments, such as some combination of Ruby on Rails, LAMP, Relational Databases or NoSQL (MongoDB, Cassanda, etc.), MV* & JavaScript (including D3.js), PHP, HTML5, Python/R, as well as familiarity with open source project development.
  • Some system administration.

Preferred Qualifications

  • Advanced degree in digital humanities, computational social science, informatics, or data science. Coursework in data visualization, machine learning, statistics, or MVC web applications.
  • Three or more years at the intersection of web development/deployment and machine learning (e.g. data journalism or digital humanities) in an agile software environment.
  • Ability to assess client needs and offer creative research or publication solutions.
  • Any combination of GIS, NLTK, statistical models, ABMs, web scraping, mahout/hadoop, network analysis, data visualization, RESTful services, testing frameworks, XML, HPC.

Job Function: Research Programming

Primary Location: United States-Pennsylvania-Pittsburgh

Time Type: Full Time

Organization: DIETRICH DEAN’S OFFICE

Minimum Education Level: Bachelor’s Degree or equivalent

Salary: Negotiable

Ghosts in the Machine

Musings on materiality and cost after a tour of The Shoah Foundation.

Forgetting The Holocaust

As the only historian in my immediate family, I’m responsible for our genealogy, saved in a massive GEDCOM file. Through the wonders of the web, I now manage quite the sprawling tree: over 100,000 people, hundreds of photos, thousands of census records & historical documents. The majority came from distant relations managing their own trees, with whom I share.

Such a massive well-kept dataset is catnip for a digital humanist. I can analyze my family! The obvious first step is basic stats, like the most common last name (Aber), average number of kids (2), average age at death (56), or most-frequently named location (New York). As an American Jew, I wasn’t shocked to see New York as the most-common place name in the list. But I was unprepared for the second-most-common named location: Auschwitz.

I’m lucky enough to write this because my great grandparents all left Europe before 1915. My grandparents don’t have tattoos on their arms or horror stories about concentration camps, though I’ve met survivors their age. I never felt so connected to The Holocaust, HaShoah, until I took time to see explore the hundreds of branches of my family tree that simply stopped growing in the 1940s.

Aerial photo of Auschwitz-Birkenau. [via wikipedia]
1 of every 16 Jews in the entire world were murdered in Auschwitz, about a million in all. Another 5 million were killed elsewhere. The global Jewish population before the Holocaust was 16.5 million, a number we’re only now approaching again, 70 years later. And yet, somehow, last month a school official and national parliamentary candidate in Canada admitted she “didn’t know what Auschwitz was”.

I grew up hearing “Never Forget” as a mantra to honor the 11 million victims of hate and murder at the hands of Nazis, and to ensure it never happens again. That a Canadian official has forgotten—that we have all forgotten many of the other genocides that haunt human history—suggests how easy it is to forget. And how much work it is to remember.

The material cost of remembering 50,000 Holocaust survivors & witnesses

Yad Vashem (“a place and a name”) represents the attempt to inscribe, preserve, and publicize the names of Jewish Holocaust victims who have no-one to remember them. Over four million names have been collected to date.

The USC Shoah Foundation, founded by Steven Spielberg in 1994 to remember Holocaust survivors and witnesses, is both smaller and larger than Yad Vashem. Smaller because the number of survivors and witnesses still alive in 1994 numbered far fewer than Yad Vashem‘s 4.3 million; larger because the foundation conducted video interviews: 100,000 hours of testimony from 50,000 individuals, plus recent additions of witnesses and survivors of other genocides around the world. Where Yad Vashem remembers those killed, the Shoah Foundation remembers those who survived.  What does it take to preserve the memories of 50,000 people?

I got a taste of the answer to that question at a workshop this week hosted by USC’s Digital Humanities Program, who were kind enough to give us a tour of the Shoah Foundation facilities. Sam Gustman, the foundation’s CTO and Associate Dean of USC’s Libraries, gave the tour.

Shoah Foundation Digitization Facility
Shoah Foundation Digitization Facility [via my camera]
Digital preservation it a complex process. In this case, it began by digitizing 235,000 analog Betacam SP Videocassettes, on which the original interviews had been recorded, a process which took from 2008-2012. This had to be done quickly (automatically/robotically), given that cassette tapes are prone to become sticky, brittle, and unplayable within a few decades due to hydrolysis. They digitized about 30,000 hours per year. The process eventually produced 8 petabytes (link to more technical details) of  lossless JPEG 2000 videos, roughly the equivalent of 2 million DVDs. Stacked on top of each other, those DVDs would reach three times higher than Burj Khalifa, the world’s tallest tower.

From there, the team spent quite some time correcting errors that existed in the original tapes, and ones that were introduced in the process of digitization. They employed a small army of signal processing students, patented new technologies for automated error detection & processing/cleaning, and wound up cleaning video from about 12,000 tapes. According to our tour guide, cleaning is still happening.

Lest you feel safe knowing that digitization lengthens the preservation time, turns out you’re wrong. Film lasts longer than most electronic storage, but making film copies would have cost the foundation $140,000,000 and made access incredibly difficult. Digital copies would only cost tens of millions of dollars, even though hard-drives couldn’t be trusted to last more than a decade. Their solution was a RAID hard-drive system in an Oracle StorageTek SL8500 (of which they have two), and a nightly process of checking video files for even the slightest of errors. If an error is found, a backup is loaded to a new cartridge, and the old cartridge is destroyed. Their two StorageTeks each fit over 10,000 drive cartridges, have 55 petabytes worth of storage space, weigh about 4,000 lbs, and are about the size of a New York City apartment. If a drive isn’t backed up and replaced within three years, they throw it out and replace it anyway, just in case. And this setup apparently saved the Shoah Foundation $6 million.

Digital StillCamera
StorageTek SL8500 [via CERN]
Oh, and they have another facility a few states away, connected directly via high-bandwidth fiber optic cables, where everything just described is duplicated in case California falls into the ocean.

Not bad for something that costs libraries $15,000 per year, which is about the same the library would pay for one damn chemistry journal.

So how much does it cost to remember 50,000 Holocaust witnesses and survivors for, say, 20 years? I mean, above and beyond the cost of building a cutting edge facility, developing new technologies of preservation, cooling and housing a freight container worth of hard drives, laying fiber optic cables below ground across several states, etc.? I don’t know. But I do know how much the Shoah Foundation would charge you to save 8 petabytes worth of videos for 20 years, if you were a USC Professor. They’d charge you $1,000/TB/20 years.

The Foundation’s videos take up 8,000 terabytes, which at $1,000 each would cost you $8 million per 20 years, or about half a million dollars per year. Combine that with all the physical space it takes up, and never forgetting the Holocaust is sounding rather prohibitive. And what about after 20 years, when modern operating systems forget how to read JPEG 2000 or interface with StorageTek T10000C Tape Drives, and the Shoah Foundation needs to undertake another massive data conversion? I can see why that Canadian official didn’t manage it.

The Reconcentration of Holocaust Survivors

While I appreciated the guided tour of the exhibit, and am thankful for the massive amounts of money, time, and effort scholars and donors are putting into remembering Holocaust survivors, I couldn’t help but be creeped out by the experience.

Our tour began by entering a high security facility. We signed our names on little pieces of paper and were herded through several layers of locked doors and small rooms. Not quite the way one expects to enter the project tasked with remembering and respecting the victims of genocide.

The Nazi’s assembly-line techniques for mass extermination led to starkly regular camps, like Auschwitz pictured above, laid out in efficient grids for the purpose of efficient control and killings. “Concentration camp”, by the way, refers to the concentration of people into small spaces, coming from “reconcentration camps” in Cuba. Now we’re concentrating 50,000 testimonies into a couple of closets with production line efficiency, reconcentrating the stories of people who dispersed across the world, so they’re all in one easy-to-access place.

Server farm [via wikipedia]
We’ve squeezed 100,000 hours of testimony into a server farm that consists of a series of boxes embedded in a series of larger boxes, all aligned to a grid; input, output, and eventual destruction of inferior entities handled by robots. Audits occur nightly.

The Shoah Foundation materials were collected, developed, and preserved with the utmost respect. The goal is just, the cause respectable, and the efforts incredibly important. And by reconcentrating survivors’ stories, they can now be accessed by the world. I don’t blame the Foundation for the parallels which are as much a construct of my mind as they are of the society in which this technology developed. Still, on Halloween, it’s hard to avoid reflecting on the material, monetary, and ultimately dehumanizing costs of processing ghosts into the machine.