The Route of a Text Message

This is the third post in my full-stack dev (f-s d) series on the secret life of data. This installment is about a single text message: how it was typed, stored, sent, received, and displayed. I sprinkle in some history and context to break up the alphabet soup of protocols, but though the piece gets technical, it should all be easily understood.

The first two installments of this series are Cetus, about the propagation of errors in a 17th century spreadsheet, and Down the Rabbit Hole, about the insane lengths one occasionally needs to go through to track down the source of a dataset.

A Love Story

My leg involuntarily twitches with vibration—was it my phone, or just a phantom feeling?—and a quick inspection reveals a blinking blue notification. “I love you”, my wife texted me. I walk downstairs to wish her goodnight, because I know the difference between the message and the message, you know?

It’s a bit like encryption, or maybe steganography: anyone can see the text, but only I can decode the hidden data.

My translation, if we’re being honest, is just one extra link in a remarkably long chain of data events, all to send a message (“come downstairs and say goodnight”) in under five seconds across about 40 feet.

The message presumably began somewhere in my wife’s brain and somehow ended up in her thumbs, but that’s a signal for a different story. Ours begins as her thumb taps a translucent screen, one letter at a time, and ends as light strikes my retinas.

Through the Looking Glass

With each tap, a small electrical current passes from the screen to her hand. Because electricity flows easily through human bodies, sensors on the phone register a change in voltage wherever her thumb presses against the screen. But the world is messy, and the phone senses random fluctuations in voltage across the rest of the screen, too, so an algorithm determines the biggest, thumbiest-looking voltage fluctuations and assumes that’s where she intended to press.


Figure 0. Capacitive touch.

So she starts tap-tap-tapping on the keyboard, one letter at a time.

I-spacebar-l-o-v-e-spacebar-y-o-u.

She’s not a keyboard swiper (I am, but somehow she still types faster than me). The phone reliably records the (x,y) coordinates of each thumbprint and aligns it with the coordinates of each key on the screen. It’s harder than you think; sometimes her thumb slips, yet somehow the phone realizes she’s not trying to swipe, that it was just a messy press.

Deep in the metal guts of the device, an algorithm tests whether each thumb-shaped voltage disruption moves more that a certain number of pixels, called touch slop. If the movement is sufficiently small, the phone registers a keypress rather than a swipe.

Fig 1. Android’s code for detecting ‘touch slop’. Notice the developers had my wife’s gender in mind.

She finishes her message, a measly 10 characters of her allotted 160.

The allotment of 160 characters is a carefully chosen number, if you believe the legend: In 1984, German telephone engineer Friedhelm Hillebrand sat at his typewriter and wrote as many random sentences as came to his mind. His team then looked at postcards and telex messages, and noticed most fell below 160 characters. “Eureka!”, they presumably yelled in German, before setting the character limit of text messages in stone for the next three-plus decades.

Character Limits & Legends

Legends rarely tell the whole story, and the legend of SMS is no exception. Hillebrand and his team hoped to relay messages over a secondary channel that phones were already using to exchange basic information with their local stations.

Signalling System no. 7 (SS7) are a set of protocols used by cell phones to stay in constant contact with their local tower; they need this continuous connection to know when to ring, to get basic location tracking, to check for voicemail, and communicate other non-internet reliant messages. Since the protocol’s creation in 1980, it had a hard limit of 279 bytes of information. If Hillebrand wanted text messages to piggyback on the SS7 protocol, he had to deal with this pretty severe limit.

Normally, 279 bytes equals 279 characters. A byte is eight bits (each bit is a 0 or 1), and in common encodings, a single letter is equivalent to eight 0s and 1s in a row.

‘A’ is

0100 0001

‘B’ is

0100 0010

‘C’ is

0100 0011

and so on.

Unfortunately, getting messages across the SS7 protocol isn’t a simple matter of sending 2,232 (that’s 279 bytes at 8 bits each) 0s or 1s through radio signals from my phone to yours. Part of that 279-byte signal needs to contain your phone number, and part of it needs to contain my phone number. Part of it needs to let the cell tower know “hey, this is a message, not a call, don’t ring the phone!”.

By the time Hillebrand and his team finished cramming all the necessary contextual bits into the 279-byte signal, they were left with only enough space for 140 characters at 1 byte (8 bits) a piece, or 1,120 bits.

But what if they could encode a character in only 7 bits? At 7 bits per character, they could squeeze 160 (1,140 / 7 = 160) characters into each SMS, but those extra twenty characters demanded a sacrifice: fewer possible letters.

An 8-bit encoding allows 256 possible characters: lowercase ‘a’ takes up one possible space, uppercase ‘A’ another space, a period takes up a third space, an ‘@’ symbol takes up a fourth space, a line break takes up a fifth space, and so on up to 256. To squeeze an alphabet down to 7 bits, you need to remove some possible characters: the 1/2 symbol (½), the degree symbol (°), the pi symbol (π), and so on. But assuming people will never use those symbols in text messages (a poor assumption, to be sure), this allowed Hillebrand and his colleagues to stuff 160 characters into a 140-byte space, which in turn fit neatly into a 279-byte SS7 signal: exactly the number of characters they claim to have discovered was the perfect length of a message. (A bit like the miracle of Hanukkah, if you ask me.)

Fig 2. The GSM-7 character set.

So there my wife is, typing “I love you” into a text message, all the while the phone converts those letters into this 7-bit encoding scheme, called GSM-7.

“I” (notice it’s at the intersection of 4x and x9 above) =

49 

Spacebar (notice it’s at the intersection of 2x and x0 above) =

20 

“l” =

6C

“o” =

6F

and so on down the line.

In all, her slim message becomes:

49 20 6C 6F 76 65 20 79 6F 75 

(10 bytes combined). Each two-character code, called a hex code, is one 8-bit chunk, and together it spells “I love you”.

But this is actually not how the message is stored on her phone. It has to convert the 8-bit text to 7-bit hex codes, which it does by essentially borrowing the remaining bit at the end of every byte. The math is a bit more complicated than is worth getting into here, but the resulting message appears as

49 10 FB 6D 2F 83 F2 EF 3A 

(9 bytes in all) in her phone.

When my wife finally finishes her message (it takes only a few seconds), she presses ‘send’ and a host of tiny angels retrieve the encoded message, flutter their invisible wings the 40 feet up to the office, and place it gently into my phone. The process isn’t entirely frictionless, which is why my phone vibrates lightly upon delivery.

The so-called “telecommunication engineers” will tell you a different story, and for the sake of completeness I’ll relay it to you, but I wouldn’t trust them if I were you.

SIM-to-Send

The engineers would say that, when the phone senses voltage fluctuations over the ‘send’ button, it sends the encoded message to the SIM card (that tiny card your cell provider puts in your phone so it knows what your phone number is), and in the process it wraps it in all sorts of useful contextual data. By the time it reaches my wife’s SIM, it goes from a 140-byte message (just the text) to a 176-byte message (text + context).

The extra 36 bytes are used to encode all sorts of information, seen below.

Fig 3. Here, bytes are called octets (8 bits). Counting all possible bytes yields 174 (10+1+1+12+1+1+7+1+140). The other two bytes are reserved for some SIM card bookkeeping.

The first ten bytes are reserved for the telephone number (service center address, or SCA) of the SMS service center (SMSC), tasked with receiving, storing, forwarding, and delivering text messages. It’s essentially a switchboard: my wife’s phone sends out a signal to the local cell tower and gives it the number of the SMSC, which forwards her text message from the tower to the SMSC. The SMSC, which in our case is operated by AT&T, routes the text to the mobile station nearest to my phone. Because I’m sitting three rooms away from my wife, the text just bounces back to the same mobile station, and then to my phone.

Fig 4. SMS cellular network

The next byte (PDU-type) encodes some basic housekeeping on how the phone should interpret the message, including whether it was sent successfully, whether the carrier requests a status report, and (importantly) whether this is a single text or part of a string of connected messages.

The byte after the PDU-Type is the message reference (MR). It’s a number between 1 and 255, and is essentially used as a short-term ID number to let the phone and the carrier know which text message it’s dealing with. In my wife’s case the number is set to 0, because her phone has its own message ID system independent of this particular file.

The next twelve bytes or so are reserved for the recipient’s phone number, called the destination address (DA). With the exception of the 7-bit letter character encoding I mentioned earlier, that helps us stuff 160 letters into a 140-character space, the phone number encoding is the stupidest, most confusing bits you’ll encounter in this SMS. It’s called reverse nibble notation, and it reverses every other digit in a large number. (Get it? Part of a byte is a nibble, hahahahaha, nobody’s laughing, engineers.)

My number, which is usually 1-352-537-8376, is logged in my wife’s phone as:

3125358773f6

The 1-3 is represented by

31

The 52 is represented by

25

The 53 is represented by

35

The 7-8 is represented by

87

The 37 is represented by

73

And the 6 is represented by…

f6

Where the fuck did the ‘f’ come from? It means it’s the end of the phone number, but for some awful reason (again, reverse nibble notation) it’s one character before the final digit.

It’s like pig latin for numbers.

tIs'l ki eip galit nof runbmre.s

But I’m not bitter.

[Edit: Sean Gies points out that reverse nibble notation is an inevitable artifact of representing 4-bit little-endian numbers in 8-bit chunks. That doesn’t invalidate the above description, but it does add some context for those who know what it means, and makes the decision seem more sensible.]

The Protocol Identifier (PID) byte is honestly, at this point, mostly wasted space. It takes about 40 possible values, and it tells the service provider how to route the message. A value of

22 

means my wife is sending “I love you” to a fax machine; a value of

24 

means she’s sending it to a voice line, somehow. Since she’s sending it as an SMS to my phone, which receives texts, the PID is set to

0

(Like every other text sent in the modern world.)

Fig 5. Possible PID Values

The next byte is the Data Coding Scheme (DCS, see this doc for details), which tells the carrier and the receiving phone which character encoding scheme was used. My wife used GSM-7, the 7-bit alphabet I mentioned above that allows her to stuff 160 letters into a 140-character space, but you can easily imagine someone wanting to text in Chinese, or someone texting a complex math equation (ok, maybe you can’t easily imagine that, but a guy can dream, right?).

In my wife’s text, the DCS byte was set to

0

meaning she used a 7-bit alphabet, but she may have changed that value to use an 8- or 16-bit alphabet, which would allow her many more possible letters, but a much smaller space to fit them. Incidentally, this is why when you text emoji to your friend, you have fewer characters to work with.

There’s also a little flag in the DCS byte that tells the phone whether to self-destruct the message after sending it, Mission Impossible style, so that’s neat.

The validity period (VP) space can take up to seven bytes, and sends us into another aspect of how text messages actually work. Take another look at Figure 4, above. It’s okay, I’ll wait.

When my wife finally hits ‘send’, the text gets sent to the SMS Service Center (SMSC), which then routes the message to me. I’m upstairs and my phone is on, so I receive the text in a handful of seconds, but what if my phone were off? Surely my phone can’t accept a message when it’s not receiving any power, so the SMSC has to do something with the text.

If the SMSC can’t find my phone, my wife’s message will just bounce around in its system until the moment my phone reconnects, at which point it sends the text out immediately. I like to think of the SMSC continuously checking every online phone to see if its mine like a puppy waiting for its human by the door: is that smell my human? No. Is that smell my human? No. Is this smell my human? YESYESJUMPNOW.

The validity period (VP) bytes tell the carrier how long the puppy will wait before it gets bored and finds a new home. It’s either a timestamp or a duration, and it basically says “if you don’t see the recipient phone pop online in the next however-many days, just don’t bother sending it.” The default validity period for a text is 10,080 minutes, which means if it takes me more than seven days to turn my phone back on, I’ll never receive her text.

Because there’s often a lot of empty space in an SMS, a few bits here or there are dedicated to letting the phone and carrier know exactly which bytes are unused. If my wife’s SIM card expects a 176-byte SMS, but because she wrote an exceptionally short message it only receives a 45-byte SMS, it may get confused and assume something broke along the way. The user data length (UDL) byte solves this problem: it relays exactly how many bytes the text in the text message actually take up.

In the case of “I love you”, the UDL claims the subsequent message is 9 bytes. You’d expect it to be 10 bytes, one for each of the 10 characters in

I-spacebar-l-o-v-e-spacebar-y-o-u

but because each character is 7 bits rather than 8 bits (a full byte), we’re able to shave an extra byte off in the translation. That’s because 7 bits * 10 characters = 70 bits, divided by 8 (the number of bits in a byte) = 8.75 bytes, rounded up to 9 bytes.

Which brings us to the end of every SMS: the message itself, or the UD (User Data). The message can take up to 140 bytes, though as I just mentioned, “I love you” will pack into a measly 9. Amazing how much is packed into those 9 bytes—not just the message (my wife’s presumed love for me, which is already difficult enough to compress into 0s and 1s), but also the message (I need to come downstairs and wish her goodnight). Those bytes are:

49 10 FB 6D 2F 83 F2 EF 3A.

In all, then, this is the text message stored on my wife’s SIM card:

SCA[1-10]-PDU[1]-MR[1]-DA[1-12]-DCS[1]-VP[0, 1, or 7]-UDL[1]-UD[0-140]

00 - 11 - 00 - 07 31 25 35 87 73 F6 - ?? 00 ?? - ?? - 09 - 49 10 FB 6D 2F 83 F2 EF 3A

(Note: to get the full message, I need to do some more digging. Alas, you only see most of the message here, hence the ??s.)

Waves in the Æther

Somehow [he says in David Attenborough’s voice], the SMS must now begin its arduous journey from the SIM card to the nearest base station.  To do that, my wife’s phone must convert a string of 176 bytes to the 279 bytes readable by the SS7 protocol, convert those digital bytes to an analog radio signal, and then send those signals out into the æther at a frequency of somewhere between 800 and 2000 megahertz. That means each wave is between 6 and 14 inches from one peak to the next.

Fig 6. Wavelength

In order to efficiently send and receive signals, antennas should be no smaller than half the size of the radio waves they’re dealing with. If cell waves are 6 to 14 inches, their antennas need to be 3-7 inches. Now stop and think about the average height of a mobile phone, and why they never seem to get much smaller.

Through some digital gymnastics that would take entirely too long to explain, suddenly my wife’s phone shoots a 279-byte information packet containing “I love you” at the speed of light in every direction, eventually fizzling into nothing after about 30 miles.

Well before getting that far, her signal strikes the AT&T HSPA Base Station ID199694204 LAC21767. This base transceiver station (BTS) is about 5 blocks from my favorite bakery in Hazelwood, La Gourmandine, and though I was able to find its general location using an android app called OpenSignal, the antenna is camouflaged beyond my ability to find it.

The really fascinating bit here is that it reaches the base transceiver station at all, given everything else going on. Not only is my wife texting me “I love you” in the 1000ish mhz band of the electromagnetic spectrum; tens of thousands of other people are likely talking on the phone or texting within the 30 mile radius around my house, beyond which cell signals disintegrate. On top of that, a slew of radio and TV signals are jostling for attention in our immediate airspace, alongside visible light bouncing this way and that, to name a few of the many electromagnetic waves that seem like they ought to be getting in the way.

As Richard Feynman eloquently put it in 1983, it’s a bit like the cell tower is a little blind bug resting gently atop the water on one end of a pool, and based only on the frequency and direction of waves that cause it to bounce up and down, it’s able to reconstruct who’s swimming and where.

Feynman discussing waves.

In part due to the complexity of competing signals, each base transceiver station generally can’t handle more than 200 active users (using voice or data) at a time. So “I love you” pings my local base transceiver station, about a half a mile away, and then shouts itself into the void in every direction until it fades into the noise of everyone else.

Switching

I’m pretty lucky, all things considered. Were my wife and I on different cell providers, or were we in different cities, the route of her message to me would be a good deal more circuitous.

My wife’s message is massaged into the 279-byte SS7 channel, and sent along to the local base transceiver station (BTS) near the bakery. From there, it gets routed to the base station controller (BSC), which is the brain of not just our antenna, but several other local antennas besides. The BSC flings the text to AT&T Pittsburgh’s mobile switching center (MSC), which relies on the text message’s SCA (remember the service center address embedded within every SMS? That’s where this comes in) to get it to the appropriate short message service center (SMSC).

This alphabet soup is easier to understand with the diagram from figure 7; I just described steps 1 and 3. If my wife were on a different carrier, we’d continue through steps 4-7, because that’s where the mobile carriers all talk to each other. The SMS has to go from the SMSC to a global switchboard and then potentially bounce around the world before finding its way to my phone.

Fig 7. SMS routed through a GSM network.

But she’s on AT&T and I’m on AT&T, and our phones are connected to the same tower, so after step 3 the 279-byte packet of love just does an about-face and returns through the same mobile service center, through the same base station, and now to my phone instead of hers. A trip of a few dozen miles in the blink of an eye.

Sent-to-SIM

Buzzzzz. My pocket vibrates. A notification lets me know an SMS has arrived through my nano-SIM card, a circuit board about the size of my pinky nail. Like Bilbo Baggins or any good adventurer, it changed a bit in its trip there and back again.

Fig 8. Received message, as opposed to sent message (figure 3).

Figure 8 shows the structure of the message “I love you” now stored on my phone. Comparing figures 3 and 8, we see a few differences. The SCA (phone number of the short message service center), the PDU (some mechanical housekeeping), the PID (phone-to-phone rather than, say, phone-to-fax), the DCS (character encoding scheme), the UDL (length of message), and the UD (the message itself) are all mostly the same, but the VP (the text’s expiration date), the MR (the text’s ID number), and the DA (my phone number) are missing.

Instead, on my phone, there are two new pieces of information: the OA (originating address, or my wife’s phone number), and the SCTS (service center time stamp, or when my wife sent the message).

My wife’s phone number is stored in the same annoying reverse nibble notation (like dyslexia but for computers) that my phone number was stored in on her phone, and the timestamp is stored in the same format as the expiration date was stored in on on her phone.

These two information inversions make perfect contextual sense. Her phone needed to reach me by a certain time at a certain address, and I now need to know who sent the message and when. Without the home address, so to speak, I wouldn’t know whether the “I love you” came from my wife or my mother, and the difference would change my interpretation of the message fairly significantly.

Through a Glass Brightly

In much the same way that any computer translates a stream of bytes into a series of (x,y) coordinates with specific color assignments, my phone’s screen gets the signal to render

49 10 FB 6D 2F 83 F2 EF 3A

on the screen in front of me as “I love you” in backlit black-and-white. It’s an interesting process, but as it’s not particularly unique to smartphones, you’ll have to look it up elsewhere. Let’s instead focus on how those instructions become points of light.

The friendly marketers at Samsung call my screen a Super AMOLED (Active Matrix Organic Light-Emitting Diode) display, which is somehow both redundant and not particularly informative, so we’ll ignore unpacking the acronym as yet another distraction, and dive right into the technology.

There are about 330,000 tiny sources of light, or pixels, crammed inside each of my phone screen’s 13 square inches. For that many pixels, each needs to be about 45µm (micrometers) wide: thinner than a human hair. There’s 4 million of ‘em in all packed into the palm of my hand.

But you already know how screens work. You know that every point of light, like the Christian God or Musketeers (minus d’Artagnan), is always a three-for-one sort of deal. Red, green, and blue combine to form white light in a single pixel. Fiddle with the luminosity of each channel, and you get every color in the rainbow. And since 4 x 3 = 12, that’s 12 million tiny sources of light sitting innocently dormant behind my black mirror, waiting for me to press the power button to read my wife’s text.

Fig 9. The subpixel array of a Samsung OLED display.

Each pixel, as the acronym suggests, is an organic light-emitting diode. That’s fancy talk for an electricity sandwich:

Fig 10. An electricity sandwich.

The layers aren’t too important, beyond the fact that it’s a cathode plate (negatively charged), below a layer of organic molecules (remember back to highschool: it’s just some atoms strung together with carbon), below an anode plate (positively charged).

When the phone wants the screen on, it sends electrons from the cathode plate to the anode plate. The sandwiched molecules intercept the energy, and in response they start emitting visible light, photons, up through the transparent anode, up through the screen, and into my waiting eyes.

Since each pixel is three points of light (red, green, and blue), there’s actually three of these sandwiches per pixel. They’re all essentially the same, except the organic molecule is switched out: poly(p-phenylene) for blue light, polythiophene for red light, and poly(p-phenylene vinylene) for green light. Because each is slightly different, they shine different colors when electrified.

(Fun side fact: blue subpixels burn out much faster, due to a process called “exciton-polaron annihilation”, which sounds really exciting, doesn’t it?)

All 4 million pixels are laid out on an indexed matrix. An index works in a computer much the same way it works in a book: when my phone wants a specific pixel to light a certain color, it looks that pixel up in the index, and then sends a signal to the address it finds. Let there be light, and there was light.

(Fun side fact: now you know what “Active Matrix Light-Emitting Diode” means, and you didn’t even try.)

My phone’s operating system interprets my wife’s text message, figures out the shape of each letter, and maps those shapes to the indexed matrix. It sends just the right electric pulses through the Super AMOLED screen to render those three little words that have launched ships and vanquished curses.

The great strangeness here is that my eyes never see “I love you” in bright OLED lights; it appear on the screen black-on-white. The phone creates the illusion of text through negative space, washing the screen white by setting every red, green, & blue to maximum brightness, then turning off the bits where letters should be. Its complexity is offensively mundane.

Fig 11. Negative space.

In displaying everything but my wife’s text message, and letting me read it in the gaps, my phone succinctly betrays the lie at the heart of the information age: that communication is simple. Speed and ease hide a mountain of mediation.

And that mediation isn’t just technical. My wife’s text wouldn’t have reached me had I not paid the phone bill on time, had there not been a small army of workers handling financial systems behind the scenes. Technicians keep the phone towers in working order, which they reach via a network of roads partially subsidized by federal taxes collected from hundreds of millions of Americans across 50 states. Because so many transactions still occur via mail, if the U.S. postal system collapsed tomorrow, my phone service would falter. Exploited factory workers in South America and Asia assembled parts in both our phones, and exhausted programmers renting expensive Silicon Valley closets are as-you-read-this pushing out code ensuring our phones communicate without interruption.

All of this underneath a 10-character text. A text which, let’s be honest, means much more than it says. My brain subconsciously peels back years of interactions with my wife to decode the message appearing on my phone, but between her and me there’s still a thicket of sociotechnical mediation, a stew of people and history and parts, that can never be untangled.

The Aftermath

So here I am, in the office late one Sunday night. “I love you,” my wife texted from the bedroom downstairs, before the message traversed 40 or so feet to my phone in a handful of seconds. I realize what it means: it’s time to wish her goodnight, and perhaps wrap up this essay. I tap away the last few words, now slightly more cognizant of the complex layering of miles, signals, years of history, and human sweat it took to keep my wife from having to shout upstairs that it’s about damn time I get some rest.

Thanks to Christopher Warren, Vika Zafrin, and Nechama Weingart for comments on earlier drafts.

Encouraging Misfits

tl;dr Academics’ individual policing of disciplinary boundaries at the expense of intellectual merit does a disservice to our global research community, which is already structured to reinforce disciplinarity at every stage. We should work harder to encourage research misfits to offset this structural pull.


The academic game is stacked to reinforce old community practices. PhDs aren’t only about specialization, but about teaching you to think, act, write, and cite like the discipline you’ll soon join. Tenure is about proving to your peers you are like them. Publishing and winning grants are as much about goodness of fit as about quality of work.

This isn’t bad. One of science’s most important features is that it’s often cumulative or at least agglomerative, that scientists don’t start from scratch with every new project, but build on each other’s work to construct an edifice that often resembles progress. The scientific pipeline uses PhDs, tenure, journals, and grants as built-in funnels, ensuring everyone is squeezed snugly inside the pipes at every stage of their career. It’s a clever institutional trick to keep science cumulative.

But the funnels work too well. Or at least, there’s no equally entrenched clever institutional mechanism for building new pipes, for allowing the development of new academic communities that break the mold. Publishing in established journals that enforce their community boundaries is necessary for your career; most of the world’s scholarly grant programs are earmarked for and evaluated by specific academic communities. It’s easy to be disciplinary, and hard to be a misfit.

To be sure, this is a known problem. Patches abound. Universities set aside funds for “interdisciplinary research” or “underfunded areas”; postdoc positions, centers, and antidsciplinary journals exist to encourage exactly the sort of weird research I’m claiming has no little place in today’s university. These solutions are insufficient.

University or even external grant programs fostering “interdisciplinarity” for its own sake become mostly useless because of the laws of Goodhart & Campbell. They’re usually designed to bring disciplines together rather than to sidestep disciplinarity altogether, which while admirable, is a system that’s pretty easy to game, and often leads to awkward alliances of convenience.

Dramatic rendition of types of -disciplinarity from Lotrecchiano in 2010, shown here never actually getting outside disciplines.

Universities do a bit better in encouraging certain types of centers that, rather than being “interdisciplinary”, are focused on a specific goal, method, or topic that doesn’t align easily with the local department structure. A new pipe, to extend my earlier bad metaphor. The problems arise here because centers often lack the institutional benefits available to departments: they rely on soft money, don’t get kickback from grant overheads, don’t get money from cross-listed courses, and don’t get tenure lines. Antidisciplinary postdoc positions suffer a similar fate, allowing misfits to thrive for a year or so before having to go back on the job market to rinse & repeat.

In short, the overwhelming inertial force of academic institutions pulls towards disciplinarity despite frequent but half-assed or poorly-supported attempts to remedy the situation. Even when new disciplinary configurations break free of institutional inertia, presenting themselves as means to knowledge every bit as legitimate as traditional departments (chemistry, history, sociology, etc.), it can take decades for them to even be given the chance to fail.

It is perhaps unsurprising that the community which taught us about autopoiesis proved incapable of sustaining itself, though half a century on its influences are glaringly apparent and far-reaching across today’s research universities. I wonder if we reconfigured the organization of colleges and departments from scratch today, whether there would be more departments of environmental studies and fewer departments of [redacted] 1.

I bring this all up to raise awareness of the difficulty facing good work with no discernible home, and to advocate for some individual action which, though it won’t change the system overnight, will hopefully make the world a bit easier for those who deserve it.

It is this: relax the reflexive disciplinary boundary drawing, and foster programs or communities which celebrate misfits. I wrote a bit about this last year in the context of history and culturomics; historians clamored to show that culturomics was bad history, but culturomics never attempted to be good history—it attempted to be good culturomics. Though I’d argue it often failed at that as well, it should have been evaluated by its own criteria, not the criteria of some related but different discipline.

Some potential ways to move forward:

  • If you are reviewing for a journal or grant and the piece is great, but doesn’t quite fit, and you can’t think of a better home for it, push against the editor to let it in anyway.
  • If you’re a journal editor or grant program officer, be more flexible with submissions which don’t fit your mold but don’t have easy homes elsewhere.
  • If you control funds for research grants, earmark half your money for good work that lacks a home. Not “good work that lacks a home but still looks like the humanities”, or “good work that looks like economics but happens to involve a computer scientist and a biologist”, but truly homeless work. I realize this won’t happen, but if I’m advocating, I might as well advocate big!
  • If you are training graduate students, hiring faculty, or evaluating tenure cases, relax the boundary-drawing urge to say “her work is fascinating, but it’s not exactly our department.”
  • If you have administrative and financial power at a university, commit to supporting nondisciplinary centers and agendas with the creation of tenure lines, the allocation of course & indirect funds, and some of the security offered to departments.

Ultimately, we need clever systems to foster nondisciplinary thinking which are as robust as those systems that foster cumulative research. This problem is above my paygrade. In the meantime, though, we can at least avoid the urge to equate disciplinary fitness with intellectual quality.

Notes:

  1. You didn’t seriously expect me to name names, did you?

Experience

Last week, I publicly outed myself as a non-tenure-track academic diagnosed on the autism spectrum, 1 hoping that doing so might help other struggling academics find solace knowing they are not alone. I was unprepared for the outpouring of private and public support. Friends, colleagues, and strangers thanked me for helping them feel a little less alone, which in turn helped me feel much less alone. Thank you all, deeply and sincerely.

In a similar spirit, for interested allies and struggling fellows, this post is about how my symptoms manifest in the academic world, and how I manage them. 2

Navigating the social world is tough—a fact that may surprise some of my friends and most of my colleagues. I do alright at conferences and in groups, when conversation is polite and skin-deep, but it requires careful concentration and a lot of smoke and mirrors. Inside, it feels like I’m translating from Turkish to Cantonese without knowing either language. Every time this is said, that is the appropriate reply, though I struggle to understand why. I just possess a translation book, and recite what is expected. Stimulus and response. This skill was only recently acquired.

Looking at the point between people’s eyes makes it appear as though I am making direct eye contact during conversations. Certain observations (“you look tired”) are apparently less well-received than others (“you look excited”), and I’ve mostly learned which are which.

After a long day keeping up this appearance, especially at conferences, I find a nice dark room and stay there. Sharing conference hotel rooms with fellow academics is never an option. Some strategies I figured out myself; others, like the eye contact trick, I built over extended discussions with an old girlfriend after she handed me a severely-highlighted copy of The Partner’s Guide to Asperger Syndrome.

ADHD and Autism Spectrum Disorder are highly co-morbid, and I have been diagnosed with either or both by several independent professionals in the last twenty years. Working is hard, and often takes at least twice as much time for me as it does for the peers with whom I have discussed this. When interested in something, I lose myself entirely in it for hours on end, but a single break in concentration will leave me scrambling. It may take hours or days to return to a task, if I do at all. My best work is done in marathon, and work that takes longer than a few days may never get finished, or may drop in quality precipitously. Keeping the internet disconnected and my phone off during regular periods every day, locked in my windowless office, helps keep distractions at bay. But, I have yet to discover a good strategy to manage long projects. A career in the book-driven humanities may have been a poor choice.

Paying bills on time, keeping schedules, and replying to emails are among the most stressful tasks in my life. When I don’t adequately handle all of these mundane tasks, it sets in motion a cycle of horror that paralyzes my ability to get anything done, until I eventually file for task bankruptcy and inevitably disappoint colleagues, friends, or creditors to whom action is owed. Poor time management and stress-cycles lead me to over-promise and under-deliver. On the bright side, I recently received help in strategies to improve that, and they work. Sometimes.

Friendships, surprisingly, are easy to maintain but difficult to nourish. My friends consider me trustworthy and willing to help (if not necessarily always dependable), but I lose track of friends or family who aren’t geographically close. Deeper emotional relationships are rare or, for swaths of my life, non-existent. I get no fits of anger or depression or elation or excitement. Indeed, my friends and family remark how impossible it is to see if I like a gift they’ve given me.

People occasionally describe my actions as offensive, rude, or short, and I get frustrated trying to understand exactly why what I’m doing fits into those categories. Apparently, early in grad school, I had a bit of a reputation for asking obnoxious questions in lectures. But I don’t like upsetting people, and actively (maybe successfully?) try to curb these traits when they are pointed out.

Thankfully, academic life allows me the freedom to lock myself in a room and focus on a task. Using work as a coping mechanism for social difficulties may be unhealthy, but hey, at least I found a career that rewards my peculiarities.

My life is pretty great. I have good friends, a loving family, and hobbies that challenge me. As long as I maintain the proper controlled environment, my fixations and obsessions are a perfect complement to an academic career, especially in a culture that (unfortunately) rewards workaholism. The same tenacity often compensates for difficulties in navigating romantic relationships, of which I’ve had a few incredibly fulfilling and valuable ones over my life thus-far.

Unfortunately, my experience on the autism spectrum is not shared by all academics. Some have enough difficulty managing the social world that they end up alienating colleagues who are on their tenure committees, to disastrous effect. From private conversations, it seems autistic women suffer more from this than men, as they are expected to perform more service work and to be more social. Supportive administrators can be vital in these situations, and autism-spectrum academics may want to negotiate accommodations for themselves as part of their hiring process.

Despite some frustrations, I have found my atypical way of interacting with the world to be a feature, not a bug. My atypicality presents as what used to be called Asperger Syndrome, and it is easier for me to interact with the world, and easier for the world to interact with me, than many other autistic individuals. That said, whether or not my friends and colleagues notice, I still struggle with many aspects common to those diagnosed on the autism spectrum: social-emotional difficulties, alexithymia, intensity of focus, hypersensitivity, system-oriented thinking, etc.

Relationships or friendships with someone on the spectrum can be tough, even with someone who doesn’t outwardly present common characteristics, like me. An old partner once vented her frustrations that she couldn’t turn to her friends for advice, because: “everyone just said Scott is so normal and I was thinking [no], he’s just very very good at passing [as socially aware].” Like many who grow up non-neurotypical, I learned a complex set of coping strategies to help me fit in and succeed in a neurotypical world. To concentrate on work, I create an office cave to shut out the world. I use a complicated set of journals, calendars, and apps to keep me on task and ensure I pay bills on time. To stay attentive, I sit at the front of a lecture hall—it even works, sometimes. Some ADHD symptoms are managed pharmacologically.

These strategies give me the 80% push I need to be a functioning member of society, to become someone who can sustain relationships, not get kicked out of his house for forgetting rent, and can almost finish a PhD. Almost. It’s not quite enough to prevent me from a dozen incompletes on my transcripts, but I make do. A host of unrealistically patient and caring friends, family, and colleagues helps. (If you’re someone to whom I still owe work, but am too scared to reply to because of how delinquent I am, thanks for understanding! waves and runs away). Caring allies help. A lot.

My life so far has been a series of successes and confusions. Not unlike anybody else’s life, I suppose. I occupy my own corner of weirdness, which is itself unique enough, but everyone has their own corner. I doubt my writing this will help anyone understand themselves any better, but hopefully it will help fellow academics feel a bit safer in their own weirdness. And if this essay helps our neurotypical colleagues be a bit more understanding of our struggles, and better-informed as allies, all the better.

Notes:

  1. The original article, Stigma, was written for the Conditionally Accepted column of Inside Higher Ed. Jeana Jorgensen, Eric Grollman and Sarah Bray provided invaluable feedback, and I wouldn’t have written it without them. They invited me to write this second article for Inside Higher Ed as well, which was my original intent. I wound up posting it on my blog instead because their posting schedule didn’t quite align with my writing schedule. This shouldn’t be counted as a negative reflection on the process of publishing with that fine establishment.
  2. Let me be clear: I know very little about autism, beyond that I have been diagnosed with it. I’m still learning a lot. This post is about me. Knowing other people face similar struggles has been profoundly helpful, regardless of what causes those struggles.

The Turing Point

Below is some crazy, uninformed ramblings about the least-complex possible way to trick someone into thinking a computer is a human, for the purpose of history research. I’d love some genuine AI/Machine Intelligence researchers to point me to the actual discussions on the subject. These aren’t original thoughts; they spring from countless sci-fi novels and AI research from the ’70s-’90s. Humanists beware: this is super sci-fi speculative, but maybe an interesting thought experiment.


If someone’s chatting with a computer, but doesn’t realize her conversation partner isn’t human, that computer passes the Turing Test. Unrelatedly, if a robot or piece of art is just close enough to reality to be creepy, but not close enough to be convincingly real, it lies in the Uncanny ValleyI argue there is a useful concept in the simplest possible computer which is still convincingly human, and that computer will be at the Turing Point. 1 

By Smurrayinchester - self-made, based on image by Masahiro Mori and Karl MacDorman at http://www.androidscience.com/theuncannyvalley/proceedings2005/uncannyvalley.html, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2041097
By Smurrayinchester – self-made, based on image by Masahiro Mori and Karl MacDorman, CC BY-SA 3.0

Forgive my twisting Turing Tests and Uncanny Valleys away from their normal use, for the sake of outlining the Turing Point concept:

  • A human simulacrum is a simulation of a human, or some aspect of a human, in some medium, which is designed to be as-close-as-possible to that which is being modeled, within the scope of that medium.
  • A Turing Test winner is any human simulacrum which humans consistently mistake for the real thing.
  • An occupant of the Uncanny Valley is any human simulacrum which humans consistently doubt as representing a “real” human.
  • Between the Uncanny Valley and Turing Test winners lies the Turing Point, occupied by the least-sophisticated human simulacrum that can still consistently pass as human in a given medium. The Turing Point is a hyperplane in a hypercube, such that there are many points of entry for the simulacrum to “phase-transition” from uncanny to convincing.

Extending the Turing Test

The classic Turing Test scenario is a text-only chatbot which must, in free conversation, be convincing enough for a human to think it is speaking with another human. A piece of software named Eugene Goostman sort-of passed this test in 2014, convincing a third of judges it was a 13-year-old Ukrainian boy.

There are many possible modes in which a computer can act convincingly human. It is easier to make a convincing simulacrum of a 13-year-old non-native English speaker who is confined to text messages than to make a convincing college professor, for example. Thus the former has a lower Turing Point than the latter.

Playing with the constraints of the medium will also affect the Turing Point threshold. The Turing Point for a flesh-covered robot is incredibly difficult to surpass, since so many little details (movement, design, voice quality, etc.) may place it into the Uncanny Valley. A piece of software posing as a Twitter user, however, would have a significantly easier time convincing fellow users it is human.

The Turing Point, then, is flexible to the medium in which the simulacrum intends to deceive, and the sort of human it simulates.

From Type to Token

Convincing the world a simulacrum is any old human is different than convincing the world it is some specific human. This is the token/type distinction; convincingly simulating a specific person (token) is much more difficult than convincingly simulating any old person (type).

Simulations of specific people are all over the place, even if they don’t intend to deceive. Several Twitter-bots exist as simulacra of Donald Trump, reading his tweets and creating new ones in a similar style. Perhaps imitating Poe’s Law, certain people’s styles, or certain types of media (e.g. Twitter), may provide such a low Turing Point that it is genuinely difficult to distinguish humans from machines.

Put differently, the way some Turing Tests may be designed, humans could easily lose.

It’ll be useful to make up and define two terms here. I imagine the concepts already exist, but couldn’t find them, so please comment if they do so I can use less stupid words:

  • type-bot is a machine designed to be represent something at the type-level. For example, a bot that can be mistaken for some random human, but not some specific human.
  • token-bot is a machine designed to represent something at the token-level. For example, a bot that can be mistaken for Donald Trump.

Replaying History

Using traces to recreate historical figures (or at least things they could have done) as token-bots is not uncommon. The most recent high-profile example of this is a project to create a new Rembrandt painting in the original style. Shawn Graham and I wrote an article on using simulations to create new plausible histories, among many other examples old and new.

This all got me thinking, if we reach the Turing Point for some social media personalities (that is, it is difficult to distinguish between their social media presence, and a simulacrum of it), what’s to say we can’t reach it for an entire social media ecosystem? Can we take a snapshot of Twitter and project it several seconds/minutes/hours/days into the future, a bit like a meteorological model?

A few questions and obvious problems:

  • Much of Twitter’s dynamics are dependent upon exogenous forces: memes from other media, real world events, etc. Thus, no projection of Twitter alone would ever look like the real thing. One can, however, potentially use such a simulation to predict how certain types of events might affect the system.
  • This is way overkill, and impossibly computationally complex at this scale. You can simulate the dynamics of Twitter without simulating every individual user, because people on average act pretty systematically. That said, for the humanities-inclined, we may gain more insight from the ground-level of the system (individual agents) than macroscopic properties.
  • This is key. Would a set of plausibly-duplicate Twitter personalities on aggregate create a dynamic system that matches Twitter as an aggregate system? That is, just because the algorithms pass the Turing Test, because humans believe them to be humans, does that necessarily imply the algorithms have enough fidelity to accurately recreate the dynamics of a large scale social network? Or will small unnoticeable differences between the simulacrum and the original accrue atop each other, such that in aggregate they no longer act like a real social network?

The last point is I think a theoretically and methodologically fertile one for people working in DH, AI, and Cognitive Science: whether reducing human-appreciable traits between machines and people is sufficient to simulate aggregate social behavior, or whether human-appreciability (i.e., Turing Test) is a strict enough criteria for making accurate predictions about societies.

These points aside, if we ever do manage to simulate specific people (even in a very limited scope) as token-bots based on the traces they leave, it opens up interesting pedagogical and research opportunities for historians. Scott Enderle tweeted a great metaphor for this:

Imagine, as a student, being able to have a plausible discussion with Marie Curie, or sitting in an Enlightenment-era salon. 2 Or imagine, as a researcher (if individual Turing Point machines do aggregate well), being able to do well-grounded counterfactual history that works at the token level rather than at the type level.

Turing Point Simulations

Bringing this slightly back into the realm of the sane, the interesting thing here is the interplay between appreciability (a person’s ability to appreciate enough difference to notice something wrong with a simulacrum) and fidelity.

We can specifically design simulation conditions with incredibly low-threshold Turing Points, even for token-bots. That is to say, we can create a condition where the interactions are simple enough to make a bot that acts indistinguishably from the specific human it is simulating.

At the most extreme end, this is obviously pointless. If our system is one in which a person can only answer “yes” or “no” to pre-selected preference questions (“Do you like ice-cream?”), making a bot to simulate that person convincingly would be trivial.

Putting that aside (lest we get into questions of the Turing Point of a set of Turing Points), we can potentially design reasonably simplistic test scenarios that would allow for an easy-to-reach Turing Point while still being historiographically or sociologically useful. It’s sort of a minimization problem in topological optimizations. Such a goal would limit the burden of the simulation while maximizing the potential research benefit (but only if, as mentioned before, the difference between true fidelity and the ability to win a token-bot Turing Test is small enough to allow for generalization).

In short, the concept of a Turing Point can help us conceptualize and build token-simulacra that are useful for research or teaching. It helps us ask the question: what’s the least-complex-but-still-useful token-simulacra? It’s also kind-of maybe sort-of like Kolmogorov complexity for human appreciability of other humans: that is, the simplest possible representation of a human that is convincing to other humans.

I’ll end by saying, once again, I realize how insane this sounds, and how far-off. And also how much an interloper I am to this space, having never so much as designed a bot. Still, as Bill Hart-Davidson wrote,

the possibility seems more plausible than ever, even if not soon-to-come. I’m not even sure why I posted this on the Irregular, but it seemed like it’d be relevant enough to some regular readers’ interests to be worth spilling some ink.

Notes:

  1. The name itself is maybe too on-the-nose, being a pun for turning point and thus connected to the rhetoric of singularity, but ¯\_(ツ)_/¯
  2. Yes yes I know, this is SecondLife all over again, but hopefully much more useful.

Work with me! CMU is hiring a DH Developer

Carnegie Mellon University is hiring a DH Developer!

I’ve had a blast since starting as Digital Humanities Specialist at CMU. Enough administrators, faculty, and students are on board to make building a DH strength here pretty easy, and we’re neighbors to Pitt DHRX, a really supportive supercomputing center, and great allies in the Mayor’s Office keen on a city rich with art, data, and both combined.

We want a developer to help jump-start our research efforts. You’ll be working as a full collaborator on projects from all sorts of domains, and as a review board member you’ll have a strong say in which projects they are and how they get implemented. You and I will work together in achievable rapid prototyping, analyzing data, and web deployment.

The idea is we build or do stuff that’s scholarly, interesting, and can have a proof-of-concept or article done in a semester or two. With that, the project can go on to seek additional funding and a full-time specialized programmer, or we can finish there and all be proud authors or creators of something we enjoyed making.

Ideally, you have a social science, humanities, journalism, or similar research background, and the broad tech chops to create a d3 viz, DeepDream some dogs into a work of art, manage a NoSQL database, and whatever else seems handy. Ruby on Rails, probably.

We’re looking for someone who loves playing with new tech stacks, isn’t afraid to get their hands dirty, and knows how to talk to humans. You probably have a static site and a github account. You get excited by interactive data stories, and want to make them with us. This job values breadth over depth and done over perfect.

The job isn’t as insane as it sounds—you don’t actually need to be able to do all this already, just be the sort of person who can learn on the fly. A bachelor’s degree or similar experience is required, with a strong preference for candidates with some research background. You’ll need to submit or point to some examples of work you’ve done.

We’re an equal opportunity employer, and would love to see applications from women, minorities, or other groups who often have a tough time getting developer jobs. If you work here you can take two free classes a semester. Say, who wants a fancy CMU computer science graduate degree? We can offer an awesome city, friendly coworkers, and a competitive salary (also Pittsburgh’s cheap so you wouldn’t live in a closet, like in SF or NYC).

What I’m saying is you should apply ’cause we love you.


The ad, if you’re too lazy to click the link, or are scared CMU hosts viruses:

Job Description
Digital Humanities Developer, Dietrich College of Humanities and Social Sciences

Summary
The Dietrich College of Humanities and Social Sciences at Carnegie Mellon University (CMU) is undertaking a long-term initiative to foster digital humanities research among its faculty, staff, and students. As part of this initiative, CMU seeks an experienced Developer to collaborate on cutting edge interdisciplinary projects.

CMU is a world leader in technology-oriented research, and a highly supportive environment for cross-departmental teams. The Developer would work alongside researchers from Dietrich and elsewhere to plan and implement digital humanities projects, from statistical analyses of millions of legal documents to websites that crowdsource grammars of endangered languages. Located in the the Office of The Dean under CMU’s Digital Humanities Specialist, the developer will help start up faculty projects into functioning prototypes where they can acquire sustaining funding to hire specialists for more focused development.

The position emphasizes rapid, iterative deployment and the ability to learn new techniques on the job, with a focus on technologies intersecting data science and web development, such as D3.js, NoSQL, Shiny (R), IPython Notebooks, APIs, and Ruby on Rails. Experience with digital humanities or computational social sciences is also beneficial, including work with machine learning, GIS, or computational linguistics.

The individual in this position will work with clients and the digital humanities specialist to determine achievable short-term prototypes in web development or data analysis/presentation, and will be responsible for implementing the technical aspects of these goals in a timely fashion. As a collaborator, the Digital Humanities Developer will play a role in project decision-making, where appropriate, and will be credited on final products to which they extensively contribute.

Please submit a cover letter, phone numbers and email addresses for two references, a résumé or cv, and a page describing how your previous work fits the job, including links to your github account or other relevant previous work examples.

Qualifications

  • Bachelor’s Degree in humanities computing, digital humanities, informatics, computer science, related field, or equivalent combination of training and experience.
  • At least one year of experience in modern web development and/or data science, preferably in a research and development team setting.
  • Demonstrated knowledge of modern machine learning and web development languages and environments, such as some combination of Ruby on Rails, LAMP, Relational Databases or NoSQL (MongoDB, Cassanda, etc.), MV* & JavaScript (including D3.js), PHP, HTML5, Python/R, as well as familiarity with open source project development.
  • Some system administration.

Preferred Qualifications

  • Advanced degree in digital humanities, computational social science, informatics, or data science. Coursework in data visualization, machine learning, statistics, or MVC web applications.
  • Three or more years at the intersection of web development/deployment and machine learning (e.g. data journalism or digital humanities) in an agile software environment.
  • Ability to assess client needs and offer creative research or publication solutions.
  • Any combination of GIS, NLTK, statistical models, ABMs, web scraping, mahout/hadoop, network analysis, data visualization, RESTful services, testing frameworks, XML, HPC.

Job Function: Research Programming

Primary Location: United States-Pennsylvania-Pittsburgh

Time Type: Full Time

Organization: DIETRICH DEAN’S OFFICE

Minimum Education Level: Bachelor’s Degree or equivalent

Salary: Negotiable

Ghosts in the Machine

Musings on materiality and cost after a tour of The Shoah Foundation.

Forgetting The Holocaust

As the only historian in my immediate family, I’m responsible for our genealogy, saved in a massive GEDCOM file. Through the wonders of the web, I now manage quite the sprawling tree: over 100,000 people, hundreds of photos, thousands of census records & historical documents. The majority came from distant relations managing their own trees, with whom I share.

Such a massive well-kept dataset is catnip for a digital humanist. I can analyze my family! The obvious first step is basic stats, like the most common last name (Aber), average number of kids (2), average age at death (56), or most-frequently named location (New York). As an American Jew, I wasn’t shocked to see New York as the most-common place name in the list. But I was unprepared for the second-most-common named location: Auschwitz.

I’m lucky enough to write this because my great grandparents all left Europe before 1915. My grandparents don’t have tattoos on their arms or horror stories about concentration camps, though I’ve met survivors their age. I never felt so connected to The Holocaust, HaShoah, until I took time to see explore the hundreds of branches of my family tree that simply stopped growing in the 1940s.

Aerial photo of Auschwitz-Birkenau. [via wikipedia]
1 of every 16 Jews in the entire world were murdered in Auschwitz, about a million in all. Another 5 million were killed elsewhere. The global Jewish population before the Holocaust was 16.5 million, a number we’re only now approaching again, 70 years later. And yet, somehow, last month a school official and national parliamentary candidate in Canada admitted she “didn’t know what Auschwitz was”.

I grew up hearing “Never Forget” as a mantra to honor the 11 million victims of hate and murder at the hands of Nazis, and to ensure it never happens again. That a Canadian official has forgotten—that we have all forgotten many of the other genocides that haunt human history—suggests how easy it is to forget. And how much work it is to remember.

The material cost of remembering 50,000 Holocaust survivors & witnesses

Yad Vashem (“a place and a name”) represents the attempt to inscribe, preserve, and publicize the names of Jewish Holocaust victims who have no-one to remember them. Over four million names have been collected to date.

The USC Shoah Foundation, founded by Steven Spielberg in 1994 to remember Holocaust survivors and witnesses, is both smaller and larger than Yad Vashem. Smaller because the number of survivors and witnesses still alive in 1994 numbered far fewer than Yad Vashem‘s 4.3 million; larger because the foundation conducted video interviews: 100,000 hours of testimony from 50,000 individuals, plus recent additions of witnesses and survivors of other genocides around the world. Where Yad Vashem remembers those killed, the Shoah Foundation remembers those who survived.  What does it take to preserve the memories of 50,000 people?

I got a taste of the answer to that question at a workshop this week hosted by USC’s Digital Humanities Program, who were kind enough to give us a tour of the Shoah Foundation facilities. Sam Gustman, the foundation’s CTO and Associate Dean of USC’s Libraries, gave the tour.

Shoah Foundation Digitization Facility
Shoah Foundation Digitization Facility [via my camera]
Digital preservation it a complex process. In this case, it began by digitizing 235,000 analog Betacam SP Videocassettes, on which the original interviews had been recorded, a process which took from 2008-2012. This had to be done quickly (automatically/robotically), given that cassette tapes are prone to become sticky, brittle, and unplayable within a few decades due to hydrolysis. They digitized about 30,000 hours per year. The process eventually produced 8 petabytes (link to more technical details) of  lossless JPEG 2000 videos, roughly the equivalent of 2 million DVDs. Stacked on top of each other, those DVDs would reach three times higher than Burj Khalifa, the world’s tallest tower.

From there, the team spent quite some time correcting errors that existed in the original tapes, and ones that were introduced in the process of digitization. They employed a small army of signal processing students, patented new technologies for automated error detection & processing/cleaning, and wound up cleaning video from about 12,000 tapes. According to our tour guide, cleaning is still happening.

Lest you feel safe knowing that digitization lengthens the preservation time, turns out you’re wrong. Film lasts longer than most electronic storage, but making film copies would have cost the foundation $140,000,000 and made access incredibly difficult. Digital copies would only cost tens of millions of dollars, even though hard-drives couldn’t be trusted to last more than a decade. Their solution was a RAID hard-drive system in an Oracle StorageTek SL8500 (of which they have two), and a nightly process of checking video files for even the slightest of errors. If an error is found, a backup is loaded to a new cartridge, and the old cartridge is destroyed. Their two StorageTeks each fit over 10,000 drive cartridges, have 55 petabytes worth of storage space, weigh about 4,000 lbs, and are about the size of a New York City apartment. If a drive isn’t backed up and replaced within three years, they throw it out and replace it anyway, just in case. And this setup apparently saved the Shoah Foundation $6 million.

Digital StillCamera
StorageTek SL8500 [via CERN]
Oh, and they have another facility a few states away, connected directly via high-bandwidth fiber optic cables, where everything just described is duplicated in case California falls into the ocean.

Not bad for something that costs libraries $15,000 per year, which is about the same the library would pay for one damn chemistry journal.

So how much does it cost to remember 50,000 Holocaust witnesses and survivors for, say, 20 years? I mean, above and beyond the cost of building a cutting edge facility, developing new technologies of preservation, cooling and housing a freight container worth of hard drives, laying fiber optic cables below ground across several states, etc.? I don’t know. But I do know how much the Shoah Foundation would charge you to save 8 petabytes worth of videos for 20 years, if you were a USC Professor. They’d charge you $1,000/TB/20 years.

The Foundation’s videos take up 8,000 terabytes, which at $1,000 each would cost you $8 million per 20 years, or about half a million dollars per year. Combine that with all the physical space it takes up, and never forgetting the Holocaust is sounding rather prohibitive. And what about after 20 years, when modern operating systems forget how to read JPEG 2000 or interface with StorageTek T10000C Tape Drives, and the Shoah Foundation needs to undertake another massive data conversion? I can see why that Canadian official didn’t manage it.

The Reconcentration of Holocaust Survivors

While I appreciated the guided tour of the exhibit, and am thankful for the massive amounts of money, time, and effort scholars and donors are putting into remembering Holocaust survivors, I couldn’t help but be creeped out by the experience.

Our tour began by entering a high security facility. We signed our names on little pieces of paper and were herded through several layers of locked doors and small rooms. Not quite the way one expects to enter the project tasked with remembering and respecting the victims of genocide.

The Nazi’s assembly-line techniques for mass extermination led to starkly regular camps, like Auschwitz pictured above, laid out in efficient grids for the purpose of efficient control and killings. “Concentration camp”, by the way, refers to the concentration of people into small spaces, coming from “reconcentration camps” in Cuba. Now we’re concentrating 50,000 testimonies into a couple of closets with production line efficiency, reconcentrating the stories of people who dispersed across the world, so they’re all in one easy-to-access place.

Server farm [via wikipedia]
We’ve squeezed 100,000 hours of testimony into a server farm that consists of a series of boxes embedded in a series of larger boxes, all aligned to a grid; input, output, and eventual destruction of inferior entities handled by robots. Audits occur nightly.

The Shoah Foundation materials were collected, developed, and preserved with the utmost respect. The goal is just, the cause respectable, and the efforts incredibly important. And by reconcentrating survivors’ stories, they can now be accessed by the world. I don’t blame the Foundation for the parallels which are as much a construct of my mind as they are of the society in which this technology developed. Still, on Halloween, it’s hard to avoid reflecting on the material, monetary, and ultimately dehumanizing costs of processing ghosts into the machine.

What’s Counted Counts

tl;dr. Don’t rely on data to fix the world’s injustices. An unusually self-reflective and self-indulgent post.

[Edit: this question was prompted by a series of analyses and visualizations I’ve done in collaboration with Nickoal Eichmann, but I purposefully left her out of the majority of this post, as it was one of self-reflection about my own personal choices. A respected colleague pointed out in private that by doing so, I nullified my female collaborator’s contributions to the project, for which I apologize deeply. Nickoal’s input has been integral to all of this, and she and many others, including particularly Jeana Jorgensen and Heather Froehlich (who has written on this very subject), have played vital roles in my own learning about these issues. Recent provocations by Miriam Posner helped solidify a lot of these thoughts and inspired this post. What follows is a self-exploration, recapping what many people have already said, but hopefully still useful to some. Mistakes below shouldn’t reflect poorly on those who influenced or inspired me. The post from this point on is as it originally appeared.]


Someone asked yesterday why I cared enough 1 about gender equality in academia to make this chart (with Nickoal Eichmann).

Gender representation as authors at DH conferences over the last decade. (Women consistently represent around 33% of authors)
Gender representation as authors at DH conferences over the last decade. Context. (Women consistently represent around 33% of authors)

I didn’t know how to answer the question. Our culture gives some more and better opportunities than others, so in order to make things better for more people, we must reveal and work towards resolving points of inequality. “Why do I care?” Don’t most of us want to make things better, we just go about it in different ways, and have different ideas of what’s “better”?

But the question did make me consider why I’d started with gender equality, when there are clearly so many other equally important social issues to tackle, within and outside academia. The answer was immediately obvious: ease. I’d attempted to explore racial and ethnic diversity as well, but it was simply more fraught, complicated, and less amenable to my methods than gender, so I started with gender and figured I’d work my way into the weeds from there. 2

I’ll cut to the chase. My well-intentioned attempts at battling inequality suffer their own sort of bias: by focusing on measurements of inequality, I bias that which is easily measured. It’s not that gender isn’t complex (see Miriam Posner’s wonderful recent keynote on these and related issues), but at least it’s a little easier to measure than race & ethnicity, when all you have available to you is what you can look up on the internet.

[scroll down]

Saturday Morning Breakfast Cereal. [source]
Saturday Morning Breakfast Cereal. [source]
While this problem is far from new, it takes special significance in a data-driven world. That which is countable counts, and damn the rest. At its heart, this problem is one of classification and categorization: those social divides which have the clearest seams are those most easily counted. And in a data-driven world, it’s inequality along these clear divides which get noticed first, even when injustice elsewhere is far greater.

Sex is easy, compared to gender. At most 2% of people are born intersex according to most standards (but not accounting for dysmorphia & similar). And gender is relatively easy compared to race and ethnicity. Nationality is pretty easy because of bureaucratic requirements for passports and citizenship, and country of residence is even easier, unless you live somewhere like Palestine.

But even the Palestine issue isn’t completely problematic, because counting still works fine when one thing exists in multiple categories, or may be categorized differently in different systems. That’s okay.

[source]
[source]
Where math gets lost is where there are simply no good borders to draw around entities—or worse, there are borders, but those borders themselves are drawn by insensitive outgroups. We see this a lot in the history of colonialism. Have you ever been to the Pitt Rivers Museum in Oxford? It’s a 19th century museum that essentially shows what the 19th century British mind felt about the world: everything that looks like a flute is in the flute cabinet, everything that looks like a gun is in the gun cabinet, and everything that looks like a threatening foreign religious symbol is in the threatening foreign religious symbol cabinet. Counting such a system doesn’t reveal any injustice except that of the counters themselves.

Pitt Rivers Museum [source]
Pitt Rivers Museum [source]
And I’ll be honest here: I want to help make the world a better place, but I’ve got to work to my strengths and know my limits. I’m a numbers guy. I’m at my best when counting stuff, and when there are no sensitive ways to classify, I avoid counting, because I don’t want to be That Colonizing White Dude who tries to fit everything into boxes of his own invention to make himself feel better about what he’s doing for the world. I probably still fall into that trap a lot anyway.

So why did I care enough to count gender at DH conferences? It was (relatively) easy. And it’s needed, as we saw at DH2015 and we’ve seen throughout the digital humanities – we have a gender issue, and a feminism issue, and they both need to be pointed out and addressed. But we also have lots of other issues that I’ll simply never be able to approach, and don’t know how to approach, and am in danger of ignoring entirely if I only rely on quantitative evidence of inequality.

useless by xkcd
useless by xkcd

Of course, only relying on non-quantitative evidence has its own pitfalls. People evolved and are socialized to spot patterns, to extrapolate from limited information, even when those extrapolations aren’t particularly meaningful or lead to Jesus in a slice of toast. I’m not advocating we avoid metrics entirely (for one, I’d be out of a job), but echoing Miriam Posner’s recent provocation, we need to engage with techniques, approaches, and perspectives that don’t rely on easy classification schemes. Especially, we need to listen when people notice injustice that isn’t easily classified or counted.

“Uh, yes, Scott, who are you writing this for? We already knew this!” most of you are likely asking if you’ve read this far. I’m writing to myself in early college, an engineering student obsessed with counting, who’s slowly learned the holes in a worldview that only relies on quantitative evidence. The one who spent years quantifying his health issues, only to discover the pursuit of a number eventually took precedence over the pursuit of his own health. 3

Hopefully this post helps balance all the bias implicit in my fighting for a better world from a data-driven perspective, by suggesting “data-driven” is only one of many valuable perspectives.

Notes:

  1. Upon re-reading the original question, it was actually “Why did you do it? (or why are you interested?)”. Still, this post remains relevant.
  2. I’m light on details here because I don’t want this to be an overlong post, but you can read some more of the details on what Nickoal and I are doing, and the decisions we make, in this blog series.
  3. A blog post on mental & physical health in academia is forthcoming.

Down the Rabbit Hole

WHEREIN I get angry at the internet and yell at it to get off my lawn.

You know what’s cool? Ryan Cordell and friends’ Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

Which newspapers copied from one another, from the Viral Texts project.
Which newspapers copied from one another, from the Viral Texts project.

Isn’t that a neat little slice of journalistic history? Different copyright laws, different technologies of text, different constraints of the medium, they all led to an interesting moment of textual virality in 19th-century America. If I weren’t a historian who knew better, I’d call it something like “quaint” or “charming”.

You know what isn’t quaint or charming? Living in the so-called “information age“, where everything is intertwingled, with hyperlinks and text costing pretty much zilch, and seeing the same gorram practices.

What proceeds is a rant. They say never to blog in anger. But seriously.

Inequality in Science

Tonight Alex Vespignani, notable network scientist, tweeted a link to an interesting-sounding study about inequality in scientific publishing. In Quartz! I like Quartz, it’s where Christopher Mims used to post awesome science things. Part of their mission statement reads:

In all that we do at Quartz, we embrace openness: open source code, an open newsroom, and open access to the data behind our journalism.

Pretty cool, right?

Anyway, here’s the tweet:

It links to this article on a “map of the world’s scientific research“. Because Vespignani tweeted it, I took it seriously (yes yes I know rt≠endorsement), and read the article. It describes a cartogram map of scientific research publications which shows how the U.S. and Western Europe (and a bit of China) dominates the research world, making the point that such a disparity is “disturbingly unequal”.

Map of scientific research, pulled from qz.com
Map of scientific research, by how many published articles are produced in a country, pulled from qz.com

“What’s driving the inequality?” they ask. Money & tech play a big role. So does what counts as “high impact” in science. What’s worse, the journalist writes,

In the worst cases, the global south simply provides novel empirical sites and local academics may not become equal partners in these projects about their own contexts.

The author points out an issue with the data: it only covers journals, not monographs, grey literature, edited volumes, etc. This often excludes the humanities and social sciences. The author also raises the issue of journal paywalls and how it decreases access to researchers in countries without large research budges. But we need to do better on “open dissemination”, the article claims.

Sources

Hey, that was a good read! I agree with everything the author said. What’s more, it speaks to my research, because I’ve done a fair deal of science mapping myself at the Cyberinfrastructure for Network Science Center under Katy Börner. Great, I think, let’s take a look at the data they’re using, given Quartz’s mission statement about how they always use open data.

I want to see the data because I know a lot of scientific publication indexing sites do a poor job of indexing international publications, and I want to see how it accounts for that bias. I look at the bottom of the page.

Crap.

This post originally appeared at The Conversation. Follow @US_conversation on Twitter. We welcome your comments at ideas@qz.com.

Alright, no biggie, time to look at the original article on The Conversation, a website whose slogan is “Academic rigor, journalistic flair“. Neat, academic rigor, I like the sound of that.

I scroll to the bottom, looking for the source.

A longer version of this article originally appeared on the London School of Economics’ Impact Blog.

Hey, the LSE Impact blog! They usually publish great stuff surrounding metrics and the like. Cool, I’ll click the link to read the longer version. The author writes something interesting right up front:

What would it take to redraw the knowledge production map to realise a vision of a more equitable and accurate world of knowledge?

A more accurate world of knowledge? Was this map inaccurate in a way the earlier articles didn’t report? I read on.

Well, this version of the article goes on a little to say that people in the global south aren’t always publishing in “international” journals. That’s getting somewhere, maybe the map only shows “international journals”! (Though she never actually makes that claim). Interestingly, the author writes of literature in the global south:

Even when published, this kind of research is often not attributed to its actual authors. It has the added problem of often being embargoed, with researchers even having to sign confidentiality agreements or “official secrets acts” when they are given grants. This is especially bizarre in an era where the mantra of publically funded research being made available to the public has become increasingly accepted.

Amen to that. Authorship information and openness all the way!

So who made this map?

Oh, the original article (though not the one in Quantz or The Conversation) has a link right up front to something called “The World of Science“. The link doesn’t actually take you to the map pictured, it just takes you to a website called worldmapper that’s filled with maps, letting you fend for yourself. That’s okay, my google-fu is strong.

www.worldmapper.org
www.worldmapper.org

I type “science” in the search bar.

Found it! Map #205, created by no-author-name-listed. The caption reads:

Territory size shows the proportion of all scientific papers published in 2001 written by authors living there.

Also, it only covers “physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering, technology, and earth and space sciences.” I dunno about you, but I can name at least 2.3 other types of science, but that’s cool.

In tiny letters near the bottom of the page, there are a bunch of options, including the ability to see the poster or download the data in Excel.

SUCCESS. ish.

Map of Science Poster from worldmapper.org
Map of Science Poster from worldmapper.org

Ahhhhh I found the source! I mean, it took a while, but here it is. You apparently had to click “Open PDF poster, designed for printing.” It takes you to a 2006 poster, which marks that it was made by the SASI Group from Sheffield and Mark Newman, famous and awesome complex systems scientist from Michigan. An all-around well-respected dude.

To recap, that’s a 7/11/2015 tweet, pointing to a 7/11/2015 article on Quartz, pointing to a 7/8/2015 article on The Conversation, pointing to a 4/29/2013 article on the LSE Impact Blog, pointing to a website made Thor-knows-when, pointing to a poster made in 2006 with data from 2001. And only the poster cites the name of the creative team who originally made the map. Blood and bloody ashes.

Intermission

Please take a moment out of your valuable time to watch this video clip from the BBC’s television adaptation of Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I’ll wait.

If you’re hard-of-hearing, read some of the transcript instead.

What I’m saying is, the author of this map was “on display at the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying beware of the leopard.”

The Saga Continues

Okay, at least I now can trust the creation process of the map itself, knowing Mark Newman had a hand in it. What about the data?

Helpfully, worldmapper.org has a link to the data as an Excel Spreadsheet. Let’s download and open it!

Frak. Frak frak frak frak frak.

My eyes.

Excel data for the science cartogram from worldmapper.org
Excel data for the science cartogram from worldmapper.org

Okay Scott. Deep breaths. You can brave the unicornfarts color scheme and find the actual source of the data. Be strong.

“See the technical notes” it says. Okay, I can do that. It reads:

Nearly two thirds of a million papers were published in enumerated science journals in 2001

Enumerated science journals? What does enumerated mean? Whatever, let’s read on.

The source of this data is the World Bank’s 2005 World Development Indicators, in the series on Scientific and technical journal articles (IP.JRN.ARTC.SC).

Okay, sweet, IP.JRN.ARTC.SC at the World Bank. I can Google that!

It brings me to the World Bank’s site on Scientific and technical journal articles. About the data it says:

Scientific and technical journal articles refer to the number of scientific and engineering articles published in the following fields: physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences

Yep, knew that already, but it’s good to see the sources agreeing with each other.

I look for the data source to no avail, but eventually do see a small subtitle “National Science Foundation, Science and Engineering Indicators.”

Alright /me *rolls sleeves*, IRC-style.

Eventually, through the Googles, I find my way to what I assume is the original data source website, although at this point who the hell knows? NSF Science and Engineering Indicators 2006.

Want to know what I find? A 1,092-page report (honestly, see the pdfs, volumes 1 & 2) within which, presumably, I can find exactly what I need to know. In the 1,092-page report.

I start with Chapter 5: Academic Research and Development. Seems promising.

Three-quarters-of-the-way-down-the-page, I see it. It’s shimmering in blue and red and gold to my Excel-addled eyes.

S&E

Could this be it? Could this be the data source I was searching for, the Science Citation Index and the Social Sciences Citation Index? It sounds right! Remember the technical notes which states “Nearly two thirds of a million papers were published in enumerated science journals in 2001?” That fits with the number in the picture above! Let’s click on the link to the data.

There is no link to the data.

There is no reference to the data.

That’s OKAY. WE’RE ALRIGHT. THERE ARE DATA APPENDICES IT MUST BE THERE. EVEN THOUGH THIS IS A REAL WEBSITE WITH HYPERTEXT LINKS AND THEY DIDN’T LINK TO DATA IT’S PROBABLY IN THE APPENDICES RIGHT?

Do you think the data are in the section labeled “Tables” or “Appendix Tables“? Don’t you love life’s little mysteries?

(Hint: I checked. After looking at 14 potential tables in the “Tables” section, I decided it was in the “Appendix Tables” section.)

Success! The World Bank data is from Appendix Table 5-41, “S&E articles, by region and country/economy: 1988–2003”.

Wait a second, friends, this can’t be right. If this is from the Science Citation Index and the Social Science Citation Index, then we can’t really use these metrics as a good proxy for global scientific output, because the criteria for national inclusion in the index is apparently kind of weird and can skew the output results.

Also, and let me be very clear about this,

This dataset actually covers both science and social science. It is, you’ll recall, the Science Citation Index and the Social Sciences Citation Index. [edit: at least as far as I can tell. Maybe they used different data, but if they did, it’s World Bank’s fault for not making it clear. This is the best match I could find.]

In Short

Which brings us back to Do. The article on Quartz made (among other things) two claims: that the geographic inequality of scientific output is troubling, and that the map really ought to include social scientific output.

And I agree with both of these points! And all the nuanced discussion is respectable and well-needed.

But by looking at the data, I just learned that A) the data the map draws from is not really a great representation of global output, and B) social scientific output is actually included.

I leave you with the first gif I’ve ever posted on my blog:

source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html real source: Seinfeld. Seriously, people.
source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html
real source: Seinfeld. Seriously, people.

You know what’s cool? Ryan Cordell and friend’s Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

—————————————————————————————————

(p.s. I don’t blame the people involved, doing the linking. It’s just the tumblr-world of 19th century newspapers we live in.)

[edit: I’m noticing some tweets are getting the wrong idea, so let me clarify: this post isn’t a negative reflection on the research therein, which is needed and done by good people. It’s frustration at the fact that we write in an environment that affords full references and rich hyperlinking, and yet we so often revert to context-free tumblr-like reblogging which separates text from context and data. We’re reverting to the affordances of 18th century letters, 19th century newspapers, 20th century academic articles, etc., and it’s frustrating.]

[edit 2: to further clarify, two recent tweets:

]

The moral role of DH in a data-driven world

This is the transcript from my closing keynote address at the 2014 DH Forum in Lawrence, Kansas. It’s the result of my conflicted feelings on the recent Facebook emotional contagion controversy, and despite my earlier tweets, I conclude the study was important and valuable specifically because it was so controversial.

For the non-Digital-Humanities (DH) crowd, a quick glossary. Distant Reading is our new term for reading lots of books at once using computational assistance; Close Reading is the traditional term for reading one thing extremely minutely, exhaustively.


Networked Society

Distant reading is a powerful thing, an important force in the digital humanities. But so is close reading. Over the next 45 minutes, I’ll argue that distant reading occludes as much as it reveals, resulting in significant ethical breaches in our digital world. Network analysis and the humanities offers us a way out, a way to bridge personal stories with the big picture, and to bring a much-needed ethical eye to the modern world.

Today, by zooming in and out, from the distant to the close, I will outline how networks shape our world and our lives, and what we in this room can do to set a path going forward.

Let’s begin locally.

1. Pale Blue Dot

Pale Blue Dot

You are here. That’s a picture of Kansas, from four billion miles away.

In February 1990, after years of campaigning, Carl Sagan convinced NASA to turn the Voyager 1 spacecraft around to take a self-portrait of our home, the Earth. This is the most distant reading of humanity that has ever been produced.

I’d like to begin my keynote with Carl Sagan’s own words, his own distant reading of humanity. I’ll spare you my attempt at the accent:

Consider again that dot. That’s here. That’s home. That’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every ‘superstar,’ every ‘supreme leader,’ every saint and sinner in the history of our species lived there – on a mote of dust suspended in a sunbeam.

What a lonely picture Carl Sagan paints. We live and die in isolation, alone in a vast cosmic darkness.

I don’t like this picture. From too great a distance, everything looks the same. Every great work of art, every bomb, every life is reduced to a single point. And our collective human experience loses all definition. If we want to know what makes us, us, we must move a little closer.

2. Black Rock City

Black Rock City

We’ve zoomed into Black Rock City, more popularly known as Burning Man, a city of 70,000 people that exists for only a week in a Nevada desert, before disappearing back into the sand until the following year. Here life is apparent; the empty desert is juxtaposed against a network of camps and cars and avenues, forming a circle with some ritualistic structure at its center.

The success of Burning Man is contingent on collaboration and coordination; on the careful allocation of resources like water to keep its inhabitants safe; on the explicit planning of organizers to keep the city from descending into chaos year after year.

And the creation of order from chaos, the apparent reversal of entropy, is an essential feature of life. Organisms and societies function through the careful coordination and balance of their constituent parts. As these parts interact, patterns and behaviors emerge which take on a life of their own.

3. Complex Systems

Thus cells combine to form organs, organs to form animals, and animals to form flocks.

We call these networks of interactions complex systems, and we study complex systems using network analysis. Network analysis as a methodology takes as a given that nothing can be properly understood in total isolation. Carl Sagan’s pale blue dot, though poignant and beautiful, is too lonely and too distant to reveal anything of we creatures who inhabit it.

We are not alone.

4. Connecting the Dots

When looking outward rather than inward, we find we are surrounded on all sides by a hundred billion galaxies each with a hundred billion stars. And for as long as we can remember, when we’ve stared up into the night sky, we’ve connected the dots. We’ve drawn networks in the stars in order to make them feel more like us, more familiar, more comprehensible.

Nothing exists in isolation. We use networks to make sense of our place in the vast complex system that contains protons and trees and countries and galaxies.The beauty of network analysis is its ability to transcend differences in scale, such that there is a place for you and for me, and our pieces interact with other pieces to construct the society we occupy. Networks allow us to see the forest and the trees, to give definition to the microcosms and macrocosms which describe the world around us.

5. Networked World

Networks open up the world. Over the past four hundred years, the reach of the West extended to the globe, overtaking trade routes created first by eastern conquerors. From these explorations, we produced new medicines and technologies. Concomitant with this expansion came unfathomable genocide and a slave trade that spanned many continents and far too many centuries.

Despite the efforts of the Western World, it could only keep the effects of globalization to itself for so long. Roads can be traversed in either direction, and the network created by Western explorers, businesses, slave traders, and militaries eventually undermined or superseded the Western centers of power. In short order, the African slave trade in the Americas led to a rich exchange of knowledge of plants and medicines between Native Americans and Africans.

In Southern and Southeast Asia, trade routes set up by the Dutch East India Company unintentionally helped bolster economies and trade routes within Asia. Captains with the company, seeking extra profits, would illicitly trade goods between Asian cities. This created more tightly-knit internal cultural and economic networks than had existed before, and contributed to a global economy well beyond the reach of the Dutch East India Company.

In the 1960s, the U.S. military began funding what would later become the Internet, a global communication network which could transfer messages at unfathomable speeds. The infrastructure provided by this network would eventually become a tool for control and surveillance by governments around the world, as well as a distribution mechanism for fuel that could topple governments in the Middle East or spread state secrets in the United States. The very pervasiveness which makes the internet particularly effective in government surveillance is also what makes it especially dangerous to governments through sites like WikiLeaks.

In short, science and technology lay the groundwork for our networked world, and these networks can be great instruments of creation, or terrible conduits of destruction.

6. Macro Scale

So here we are, occupying this tiny mote of dust suspended in a sunbeam. In the grand scheme of things, how does any of this really matter? When we see ourselves from so great a distance, it’s as difficult to be enthralled by the Sistine Chapel as it is to be disgusted by the havoc we wreak upon our neighbors.

7. Meso Scale

But networks let us zoom in, they let us keep the global system in mind while examining the parts. Here, once again, we see Kansas, quite a bit closer than before. We see how we are situated in a national and international set of interconnections. These connections come in every form, from physical transportation to electronic communication. From this scale, wars and national borders are visible. Over time, cultural migration patterns and economic exchange become apparent. This scale shows us the networks which surround and are constructed by us.

slide7

And this is the scale which is seen by the NSA and the CIA, by Facebook and Google, by social scientists and internet engineers. Close enough to provide meaningful aggregations, but far enough that individual lives remain private and difficult to discern. This scale teaches us how epidemics spread, how minorities interact, how likely some city might be a target for the next big terrorist attack.

From here, though, it’s impossible to see the hundred hundred towns whose factories have closed down, leaving many unable to feed their families. It’s difficult to see the small but endless inequalities that leave women and minorities systematically underappreciated and exploited.

8. Micro Scale

slide8

We can zoom in further still, Lawrence Kansas at a few hundred feet, and if we watch closely we can spot traffic patterns, couples holding hands, how the seasons affect people’s activities. This scale is better at betraying the features of communities, rather than societies.

But for tech companies, governments, and media distributors, it’s all-too-easy to miss the trees for the forest. When they look at the networks of our lives, they do so in aggregate. Indeed, privacy standards dictate that the individual be suppressed in favor of the community, of the statistical average that can deliver the right sort of advertisement to the right sort of customer, without ever learning the personal details of that customer.

This strange mix of individual personalization and impersonal aggregation drives quite a bit of the modern world. Carefully micro-targeted campaigning is credited with President Barack Obama’s recent presidential victories, driven by a hundred data scientists in an office in Chicago in lieu of thousands of door-to-door canvassers. Three hundred million individually crafted advertisements without ever having to look a voter in the face.

9. Target

And this mix of impersonal and individual is how Target makes its way into the wombs of its shoppers. We saw this play out a few years ago when a furious father went to complain to a Target store manager. Why, he asked the manager, is my high school daughter getting ads for maternity products in the mail? After returning home, the father spoke to his daughter to discover she was, indeed pregnant.  How did this happen? How’d Target know?

 It turns out, Target uses credit cards, phone numbers, and e-mail addresses to give every customer a unique ID. Target discovered a list of about 25 products that, if purchased in a certain sequence by a single customer, is pretty indicative of a customer’s pregnancy. What’s more, the date of the purchased products can pretty accurately predict the date the baby would be delivered. Unscented lotion, magnesium, cotton balls, and washcloths are all on that list.

When Target’s systems learns one of its customers is probably pregnant, it does its best to profit from that pregnancy, sending appropriately timed coupons for diapers and bottles. This backfired, creeping out customers and invading their privacy, as with the angry father who didn’t know his daughter was pregnant. To remedy the situation, rather than ending the personalized advertising, Target began interspersing ads for unrelated products with personalized products in order to trick the customer into thinking the ads were random or general. All the while, a good portion of the coupons in the book were still targeted directly towards those customers.

One Target executive told a New York Times reporter:

We found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

The scheme did work, raising Target’s profits by billions of dollars by subtly matching their customers with coupons they were likely to use. 

10. Presidential Elections

Political campaigns have also enjoyed the successes of microtargeting. President Bush’s 2004 campaign pioneered this technique, targeting socially conservative Democratic voters in key states in order to either convince them not to vote, or to push them over the line to vote Republican. This strategy is credited with increasing the pro-Bush African American vote in Ohio from 9% in 2000 to 16% in 2004, appealing to anti-gay marriage sentiments and other conservative values.

The strategy is also celebrated for President Obama’s 2008 and especially 2012 campaigns, where his staff maintained a connected and thorough database of a large portion of American voters. They knew, for instance, that people who drink Dr. Pepper, watch the Golf Channel, drive a Land Rover, and eat at Cracker Barrel are both very likely to vote, and very unlikely to vote Democratic. These insights lead to the right political ads targeted exactly at those they were most likely to sway.

So what do these examples have to do with networks? These examples utilize, after all, the same sorts of statistical tools that have always been available to us, only with a bit more data and power to target individuals thrown in the mix.

It turns out that networks are the next logical step in the process of micronudging, the mass targeting of individuals based on their personal lives in order to influence them toward some specific action.

In 2010, a Facebook study, piggy-backing on social networks, influenced about 340,000 additional people to vote in the US mid-term elections. A team of social scientists at UCSD experimented on 61 million facebook users in order to test the influence of social networks on political action.

A portion of American Facebook users who logged in on election day were given the ability to press an “I voted” button, which shared the fact that they voted with their friends. Facebook then presented users with pictures of their friends who voted, and it turned out that these messages increased voter turnout by about 0.4%. Further, those who saw that close friends had voted were more likely to go out and vote than those who had seen that distant friends voted. The study was framed as “voting contagion” – how well does the action of voting spread among close friends?

This large increase in voter turnout was prompted by a single message on Facebook spread among a relatively small subset of its users. Imagine that, instead of a research question, the study was driven by a particular political campaign. Or, instead, imagine that Facebook itself had some political agenda – it’s not too absurd a notion to imagine.

11. Blackout

slide11

In fact, on January 18, 2012, a great portion of the social web rallied under a single political agenda. An internet blackout. In protest of two proposed U.S. congressional laws that threatened freedom of speech on the Web, SOPA and PIPA, 115,000 websites voluntarily blacked out their homepages, replacing them with pleas to petition congress to stop the a bills.

Reddit, Wikipedia, Google, Mozilla, Twitter, Flickr, and others asked their users to petition Congress, and it worked. Over 3 million people emailed their congressional representatives directly, another million sent a pre-written message to Congress from the Electronic Frontier Foundation, a Google petition reached 4.5 million signatures, and lawmakers ultimated collected the names of over 14 million people who protested the bills. Unsurprisingly, the bills were never put up to vote.

These techniques are increasingly being leveraged to influence consumers and voters into acting in-line with whatever campaign is at hand. Social networks and the social web, especially, are becoming tools for advertisers and politicians.

12a. Facebook and Social Guessing

In 2010, Tim Tangherlini invited a few dozen computer scientists, social scientists, and humanists to a two-week intensive NEH-funded summer workshop on network analysis for the humanities. Math camp for nerds, we called it. The environment was electric with potential projects and collaborations, and I’d argue it was this workshop that really brought network analysis to the humanities in force.

During the course of the workshop, one speaker sticks out in my memory: a data scientist at Facebook. He reached the podium, like so many did during those two weeks, and described the amazing feats they were able to perform using basic linguistic and network analyses. We can accurately predict your gender and race, he claimed, regardless of whether you’ve told us. We can learn your political leanings, your sexuality, your favorite band.

Much like most talks from computer scientists at the event, the purpose was to show off the power of large-scale network analysis when applied to people, and didn’t focus much on its application. The speaker did note, however, that they used these measurements to effectively advertise to their users; electronics vendors could advertise to wealthy 20-somethings; politicians could target impoverished African Americans in key swing states.

It was a few throw-away lines in the presentation, but the force of the ensuing questions revolved around those specifically. How can you do this without any sort of IRB oversight? What about the ethics of all this? The Facebook scientist’s responses were telling: we’re not doing research, we’re just running a business.

And of course, Facebook isn’t the only business doing this. The Twitter analytics dashboard allows you to see your male-to-female follower ratio, even though users are never asked their gender. Gender is guessed based on features of language and interactions, and they claim around 90% accuracy.

Google, when it targets ads towards you as a user, makes some predictions based on your search activity. Google guessed, without my telling it, that I am a 25-34 year old male who speaks English and is interested in, among other things, Air Travel, Physics, Comics, Outdoors, and Books. Pretty spot-on.

12b. Facebook and Emotional Contagion

And, as we saw with the Facebook voting study, social web services are not merely capable of learning about you; they are capable of influencing your actions. Recently, this ethical question has pushed its way into the public eye in the form of another Facebook study, this one about “emotional contagion.”

A team of researchers and Facebook data scientists collaborated to learn the extent to which emotions spread through a social network. They selectively filtered the messages seen by about 700,000 Facebook users, making sure that some users only saw emotionally positive posts by their friends, and others only saw emotionally negative posts. After some time passed, they showed that users who were presented with positive posts tended to post positive updates, and those presented with negative posts tended to post negative updates.

The study stirred up quite the controversy, and for a number of reasons. I’ll unpack a few of them:

First of all, there were worries about the ethics of consent. How could Facebook do an emotional study of 700,000 users without getting their consent, first? The EULA that everyone clicks through when signing up for Facebook only has one line saying that data may be used for research purposes, and even that line didn’t appear until several months after the study occurred.

A related issue raised was one of IRB approval: how could the editors at PNAS have approved the study given that the study took place under Facebook’s watch, without an external Institutional Review Board? Indeed, the university-affiliated researchers did not need to get approval, because the data were gathered before they ever touched the study. The counter-argument was that, well, Facebook conducts these sorts of studies all the time for the purposes of testing advertisements or interface changes, as does every other company, so what’s the problem?

A third issue discussed was one of repercussions: if the study showed that Facebook could genuinely influence people’s emotions, did anyone in the study physically harm themselves as a result of being shown a primarily negative newsfeed? Should Facebook be allowed to wield this kind of influence? Should they be required to disclose such information to their users?

The controversy spread far and wide, though I believe for the wrong reasons, which I’ll explain shortly. Social commentators decried the lack of consent, arguing that PNAS shouldn’t have published the paper without proper IRB approval. On the other side, social scientists argued the Facebook backlash was antiscience and would cause more harm than good. Both sides made valid points.

One well-known social scientist noted that the Age of Exploration, when scientists finally started exploring the further reaches of the Americas and Africa, was attacked by poets and philosophers and intellectuals as being dangerous and unethical. But, he argued, did not that exploration bring us new wonders? Miracle medicines and great insights about the world and our place in it?

I call bullshit. You’d be hard-pressed to find a period more rife with slavery and genocide and other horrible breaches of human decency than that Age of Exploration. We can’t sacrifice human decency in the name of progress. On the flip-side, though, we can’t sacrifice progress for the tiniest fears of misconduct. We must proceed with due diligence to ethics without being crippled by inefficacy.

But this is all a red herring. The issue here isn’t whether and to what extent these activities are ethical science, but to what extent they are ethical period, and if they aren’t, what we should do about it. We can’t have one set of ethical standards for researchers, and another for businesses, but that’s what many of the arguments in recent months have boiled down to. Essentially, it was argued, Facebook does this all the time. It’s something called A/B testing: they make changes for some users and not others, and depending on how the users react, they change the site accordingly. It’s standard practice in web development.

13. An FDA/FTC for Data?

It is surprising, then, that the crux of the anger revolved around the published research. Not that Facebook shouldn’t do A/B testing, but that researchers shouldn’t be allowed to publish on it. This seems to be the exact opposite of what should be happening: if indeed every major web company practices these methods already, then scholarly research on how such practices can sway emotions or voting practices are exactly what we need. We must bring these practices to light, in ways the public can understand, and decide as a society whether they cross ethical boundaries. A similar discussion occurred during the early decades of the 20th century, when the FDA and FTC were formed, in part, to prevent false advertising of snake oils and foods and other products.

We are at the cusp of a new era. The mix of big data, social networks, media companies, content creators, government surveillance, corporate advertising, and ubiquitous computing is a perfect storm for intense influence both subtle and far-reaching. Algorithmic nudging has the power to sell products, win elections, topple governments, and oppress a people, depending on how it is wielded and by whom. We have seen this work from the bottom-up, in Occupy Wallstreet, the Revolutions in the Middle East, and the ALS Ice-Bucket Challenge, and from the top-down in recent presidential campaigns, Facebook studies, and coordinated efforts to preserve net neutrality. And these have been works of non-experts: people new to this technology, scrambling in the dark to develop the methods as they are deployed. As we begin to learn more about network-based control and influence, these examples will multiply in number and audacity.

14. Surveillance

And this story leaves out one of the most major players of all: government. When Edward Snowden leaked the details of classified NSA surveillance program, the world was shocked at the government’s interest in and capacity for omniscience. Data scientists, on the other hand, were mostly surprised that people didn’t realize this was happening. If the technology is there, you can bet it will be used.

And so here, in the NSA’s $1.5 billion dollar data center in Utah, are the private phone calls, parking receipts, emails, and Google searches of millions of American citizens. It stores a few exabytes of our data, over a billion gigabytes and roughly equivalent to a hundred thousand times the size of the library of congress. More than enough space, really.

The humanities have played some role in this complex machine. During the Cold War, the U.S. government covertly supported artists and authors to create cultural works which would spread American influence abroad and improve American sentiment at home.

Today the landscape looks a bit different. For the last few years DARPA, the research branch of the U.S. Department of Defense, has been funding research and hosting conferences in what they call “Narrative Networks.” Computer scientists, statisticians, linguists, folklorists, and literary scholars have come together to discuss how ideas spread and, possibly, how to inject certain sentiments within specific communities. It’s a bit like the science of memes, or of propaganda.

Beyond this initiative, DARPA funds have gone toward several humanities-supported projects to develop actionable plans for the U.S. military. One project, for example, creates as-complete-as-possible simulations of cultures overseas, which can model how groups might react to the dropping of bombs or the spread of propaganda. These models can be used to aid in the decision-making processes of officers making life-and-death decisions on behalf of troops, enemies, and foreign citizens. Unsurprisingly, these initiatives, as well as NSA surveillance at home, all rely heavily on network analysis.

In fact, when the news broke on the captures of Osama bin Laden and Saddam Hussein, and how they were discovered via network analysis, some of my family called me after reading the newspapers claiming “we finally understand what you do!” This wasn’t the reaction I was hoping for.

In short, the world is changing incredibly rapidly, in large part driven by the availability of data, network science and statistics, and the ever-increasing role of technology in our lives. Are these corporate, political, and grassroots efforts overstepping their bounds? We honestly don’t know. We are only beginning to have sustained, public discussions about the new role of technology in society, and the public rarely has enough access to information to make informed decisions. Meanwhile, media and web companies may be forgiven for overstepping ethical boundaries, as our culture hasn’t quite gotten around to drawing those boundaries yet.

15. The Humanities’ Place

This is where the humanities come in – not because we have some monopoly on ethics (goodness knows the way we treat our adjuncts is proof we do not) – but because we are uniquely suited to the small scale. To close reading. While what often sets the digital humanities apart from its analog counterpart is the distant reading, the macroanalysis, what sets us all apart is our unwillingness to stray too far from the source. We intersperse the distant with the close, attempting to reintroduce the individual into the aggregate.

Network analysis, not coincidentally, is particularly suited to this endeavor. While recent efforts in sociophysics have stressed the importance of the grand scale, let us not forget that network theory was built on the tiniest of pieces in psychology and sociology, used as a tool to explore individuals and their personal relationships. In the intervening years, all manner of methods have been created to bridge macro and micro, from Granovetter’s theory of weak ties to Milgram’s of Small Worlds, and the way in which people navigate the networks they find themselves in. Networks work at every scale, situating the macro against the meso against the micro.

But we find ourselves in a world that does not adequately utilize this feature of networks, and is increasingly making decisions based on convenience and money and politics and power without taking the human factor into consideration. And it’s not particularly surprising: it’s easy, in the world of exabytes of data, to lose the trees for the forest.

This is not a humanities problem. It is not a network scientist problem. It is not a question of the ethics of research, but of the ethics of everyday life. Everyone is a network scientist. From Twitter users to newscasters, the boundary between people who consume and people who are aware of and influence the global social network is blurring, and we need to deal with that. We must collaborate with industries, governments, and publics to become ethical stewards of this networked world we find ourselves in.

16. Big and Small

Your challenge, as researchers on the forefront of network analysis and the humanities, is to tie the very distant to the very close. To do the research and outreach that is needed to make companies, governments, and the public aware of how perturbations of the great mobile that is our society affect each individual piece.

We have a number of routes available to us, in this respect. The first is in basic research: the sort that got those Facebook study authors in such hot water. We need to learn and communicate the ways in which pervasive surveillance and algorithmic influence can affect people’s lives and steer societies.

A second path towards influencing an international discussion is in the development of new methods that highlight the place of the individual in the larger network. We seem to have a critical mass of humanists collaborating with or becoming computer scientists, and this presents a perfect opportunity to create algorithms which highlight a node’s uniqueness, rather than its similarity.

Another step to take is one of public engagement that extends beyond the academy, and takes place online, in newspapers or essays, in interviews, in the creation of tools or museum exhibits. The MIT Media Lab, for example, created a tool after the Snowden leaks that allows users to download their email metadata to reveal the networks they form. The tool was a fantastic example of a way to show the public exactly what “simply metadata” can reveal about a person, and its viral spread was a testament to its effectiveness. Mike Widner of Stanford called for exactly this sort of engagement from digital humanists a few years ago, and it is remarkable how little that call has been heeded.

Pedagogy is a fourth option. While people cry that the humanities are dying, every student in the country will have taken many humanities-oriented courses by the time they graduate. These courses, ostensibly, teach them about what it means to be human in our complex world. Alongside the history, the literature, the art, let’s teach what it means to be part of a global network, constantly contributing to and being affected by its shadow.

With luck, reconnecting the big with the small will hasten a national discussion of the ethical norms of big data and network analysis. This could result in new government regulating agencies, ethical standards for media companies, or changes in ways people interact with and behave on the social web.

17. Going Forward

When you zoom out far enough, everything looks the same. Occupy Wall Street; Ferguson Riots; the ALS Ice Bucket Challenge; the Iranian Revolution. They’re all just grassroots contagion effects across a social network. Rhetorically, presenting everything as a massive network is the same as photographing the earth from four billion miles: beautiful, sobering, and homogenizing. I challenge you to compare network visualizations of Ferguson Tweets with the ALS Ice Bucket Challenge, and see if you can make out any differences. I couldn’t. We need to zoom in to make meaning.

The challenge of network analysis in the humanities is to bring our close reading perspectives to the distant view, so media companies and governments don’t see everyone as just some statistic, some statistical blip floating on this pale blue dot.

I will end as I began, with a quote from Carl Sagan, reflecting on a time gone by but every bit as relevant for the moment we face today:

I know that science and technology are not just cornucopias pouring good deeds out into the world. Scientists not only conceived nuclear weapons; they also took political leaders by the lapels, arguing that their nation — whichever it happened to be — had to have one first. … There’s a reason people are nervous about science and technology. And so the image of the mad scientist haunts our world—from Dr. Faust to Dr. Frankenstein to Dr. Strangelove to the white-coated loonies of Saturday morning children’s television. (All this doesn’t inspire budding scientists.) But there’s no way back. We can’t just conclude that science puts too much power into the hands of morally feeble technologists or corrupt, power-crazed politicians and decide to get rid of it. Advances in medicine and agriculture have saved more lives than have been lost in all the wars in history. Advances in transportation, communication, and entertainment have transformed the world. The sword of science is double-edged. Rather, its awesome power forces on all of us, including politicians, a new responsibility — more attention to the long-term consequences of technology, a global and transgenerational perspective, an incentive to avoid easy appeals to nationalism and chauvinism. Mistakes are becoming too expensive.

Let us take Carl Sagan’s advice to heart. Amidst cries from commentators on the irrelevance of the humanities, it seems there is a large void which we are both well-suited and morally bound to fill. This is the path forward.

Thank you.


Thanks to Nickoal Eichmann and Elijah Meeks for editing & inspiration.

Stanford Musings

It’s official: I am Stanford’s new DH data scientist from May to August. What does that mean? I haven’t the foggiest idea – I think figuring that out is part of my job description. Over the next few months, I’ll be assisting a small platoon of Stanfordites with their networks, their visualizations, their data, and who knows, maybe their love lives. I’m reporting to the inimitable Glen Worthey and the indomitable Elijah Meeks, who will keep me on the straight and narrow. I’ll also be blogging, teaching workshops, writing papers, and crunching numbers, all under the Stanford banner.

This announcement is on the heels of my recent trip to Stanford, and I have to say, I was incredibly impressed by the operation they had going there. The library has at least three branches under which DH projects occur, and of particular interest are the Academic Technology Specialists like Mike Widner. A half a dozen of them are embedded in different schools around campus, and they act as technology liaisons and researchers within those schools, supporting faculty projects, developing their own research, and just generally fostering a fantastic digital humanities presence on the Stanford campus.

Stanford! Did you know it’s actually “Leland Stanford Junior University”? Weird, right?

Then there’s Elijah Meeks and Karl Grossner. Do you know those TV shows where contestants vie for a fancy house from some team of super creative builders? They basically do that, except instead of offering cool new digs, they offer their impressive technical services for a few months. There’s also the Lit Lab, CESTA, the DH Focal Group, and probably a dozen other projects which do DH on campus in some way or another.

As far as I can tell, I’ll be just one more chaotic agent in this complex DH environment. Many of the big projects going on at Stanford rely in some way on networks, and I’m going to try to bring them all together and set agendas for how they can best utilize and analyze the networks at hand. I’ll also design some tools that’ll make it easier for future network-y projects to get off the ground. There’s also a bunch of Famous Network Scientists who operate out of Stanford, and I plan on nurturing some collaborations between them, the DH community, and some humanities-curious tenants of Silicon Valley.

It will be interesting to see how this position unfolds. As far as I’m aware, the “resident data scientist” model for DH is an untried one at any university, and I’m lucky and honored that Stanford has decided to take a chance on such a new position with me at the helm. If this proves successful, it will provide even more proof that the role of libraries in fostering DH on campus can be a powerful one. Of course there’s also the chance I could fail spectacularly, but in true DH tradition, I believe such a public failure would also be a worthy outcome. If the process works, great; if not, we’ll know what to fix for the next try.