The Route of a Text Message

[Note: Find the more polished, professionally illustrated version of this piece at Motherboard|Vice!]

This is the third post in my full-stack dev (f-s d) series on the secret life of data. This installment is about a single text message: how it was typed, stored, sent, received, and displayed. I sprinkle in some history and context to break up the alphabet soup of protocols, but though the piece gets technical, it should all be easily understood.

The first two installments of this series are Cetus, about the propagation of errors in a 17th century spreadsheet, and Down the Rabbit Hole, about the insane lengths one occasionally needs to go through to track down the source of a dataset.

A Love Story

My leg involuntarily twitches with vibration—was it my phone, or just a phantom feeling?—and a quick inspection reveals a blinking blue notification. “I love you”, my wife texted me. I walk downstairs to wish her goodnight, because I know the difference between the message and the message, you know?

It’s a bit like encryption, or maybe steganography: anyone can see the text, but only I can decode the hidden data.

My translation, if we’re being honest, is just one extra link in a remarkably long chain of data events, all to send a message (“come downstairs and say goodnight”) in under five seconds across about 40 feet.

The message presumably began somewhere in my wife’s brain and somehow ended up in her thumbs, but that’s a signal for a different story. Ours begins as her thumb taps a translucent screen, one letter at a time, and ends as light strikes my retinas.

Through the Looking Glass

With each tap, a small electrical current passes from the screen to her hand. Because electricity flows easily through human bodies, sensors on the phone register a change in voltage wherever her thumb presses against the screen. But the world is messy, and the phone senses random fluctuations in voltage across the rest of the screen, too, so an algorithm determines the biggest, thumbiest-looking voltage fluctuations and assumes that’s where she intended to press.


Figure 0. Capacitive touch.

So she starts tap-tap-tapping on the keyboard, one letter at a time.

I-spacebar-l-o-v-e-spacebar-y-o-u.

She’s not a keyboard swiper (I am, but somehow she still types faster than me). The phone reliably records the (x,y) coordinates of each thumbprint and aligns it with the coordinates of each key on the screen. It’s harder than you think; sometimes her thumb slips, yet somehow the phone realizes she’s not trying to swipe, that it was just a messy press.

Deep in the metal guts of the device, an algorithm tests whether each thumb-shaped voltage disruption moves more that a certain number of pixels, called touch slop. If the movement is sufficiently small, the phone registers a keypress rather than a swipe.

Fig 1. Android’s code for detecting ‘touch slop’. Notice the developers had my wife’s gender in mind.

She finishes her message, a measly 10 characters of her allotted 160.

The allotment of 160 characters is a carefully chosen number, if you believe the legend: In 1984, German telephone engineer Friedhelm Hillebrand sat at his typewriter and wrote as many random sentences as came to his mind. His team then looked at postcards and telex messages, and noticed most fell below 160 characters. “Eureka!”, they presumably yelled in German, before setting the character limit of text messages in stone for the next three-plus decades.

Character Limits & Legends

Legends rarely tell the whole story, and the legend of SMS is no exception. Hillebrand and his team hoped to relay messages over a secondary channel that phones were already using to exchange basic information with their local stations.

Signalling System no. 7 (SS7) are a set of protocols used by cell phones to stay in constant contact with their local tower; they need this continuous connection to know when to ring, to get basic location tracking, to check for voicemail, and communicate other non-internet reliant messages. Since the protocol’s creation in 1980, it had a hard limit of 279 bytes of information. If Hillebrand wanted text messages to piggyback on the SS7 protocol, he had to deal with this pretty severe limit.

Normally, 279 bytes equals 279 characters. A byte is eight bits (each bit is a 0 or 1), and in common encodings, a single letter is equivalent to eight 0s and 1s in a row.

‘A’ is

0100 0001

‘B’ is

0100 0010

‘C’ is

0100 0011

and so on.

Unfortunately, getting messages across the SS7 protocol isn’t a simple matter of sending 2,232 (that’s 279 bytes at 8 bits each) 0s or 1s through radio signals from my phone to yours. Part of that 279-byte signal needs to contain your phone number, and part of it needs to contain my phone number. Part of it needs to let the cell tower know “hey, this is a message, not a call, don’t ring the phone!”.

By the time Hillebrand and his team finished cramming all the necessary contextual bits into the 279-byte signal, they were left with only enough space for 140 characters at 1 byte (8 bits) a piece, or 1,120 bits.

But what if they could encode a character in only 7 bits? At 7 bits per character, they could squeeze 160 (1,140 / 7 = 160) characters into each SMS, but those extra twenty characters demanded a sacrifice: fewer possible letters.

An 8-bit encoding allows 256 possible characters: lowercase ‘a’ takes up one possible space, uppercase ‘A’ another space, a period takes up a third space, an ‘@’ symbol takes up a fourth space, a line break takes up a fifth space, and so on up to 256. To squeeze an alphabet down to 7 bits, you need to remove some possible characters: the 1/2 symbol (½), the degree symbol (°), the pi symbol (π), and so on. But assuming people will never use those symbols in text messages (a poor assumption, to be sure), this allowed Hillebrand and his colleagues to stuff 160 characters into a 140-byte space, which in turn fit neatly into a 279-byte SS7 signal: exactly the number of characters they claim to have discovered was the perfect length of a message. (A bit like the miracle of Hanukkah, if you ask me.)

Fig 2. The GSM-7 character set.

So there my wife is, typing “I love you” into a text message, all the while the phone converts those letters into this 7-bit encoding scheme, called GSM-7.

“I” (notice it’s at the intersection of 4x and x9 above) =

49 

Spacebar (notice it’s at the intersection of 2x and x0 above) =

20 

“l” =

6C

“o” =

6F

and so on down the line.

In all, her slim message becomes:

49 20 6C 6F 76 65 20 79 6F 75 

(10 bytes combined). Each two-character code, called a hex code, is one 8-bit chunk, and together it spells “I love you”.

But this is actually not how the message is stored on her phone. It has to convert the 8-bit text to 7-bit hex codes, which it does by essentially borrowing the remaining bit at the end of every byte. The math is a bit more complicated than is worth getting into here, but the resulting message appears as

49 10 FB 6D 2F 83 F2 EF 3A 

(9 bytes in all) in her phone.

When my wife finally finishes her message (it takes only a few seconds), she presses ‘send’ and a host of tiny angels retrieve the encoded message, flutter their invisible wings the 40 feet up to the office, and place it gently into my phone. The process isn’t entirely frictionless, which is why my phone vibrates lightly upon delivery.

The so-called “telecommunication engineers” will tell you a different story, and for the sake of completeness I’ll relay it to you, but I wouldn’t trust them if I were you.

SIM-to-Send

The engineers would say that, when the phone senses voltage fluctuations over the ‘send’ button, it sends the encoded message to the SIM card (that tiny card your cell provider puts in your phone so it knows what your phone number is), and in the process it wraps it in all sorts of useful contextual data. By the time it reaches my wife’s SIM, it goes from a 140-byte message (just the text) to a 176-byte message (text + context).

The extra 36 bytes are used to encode all sorts of information, seen below.

Fig 3. Here, bytes are called octets (8 bits). Counting all possible bytes yields 174 (10+1+1+12+1+1+7+1+140). The other two bytes are reserved for some SIM card bookkeeping.

The first ten bytes are reserved for the telephone number (service center address, or SCA) of the SMS service center (SMSC), tasked with receiving, storing, forwarding, and delivering text messages. It’s essentially a switchboard: my wife’s phone sends out a signal to the local cell tower and gives it the number of the SMSC, which forwards her text message from the tower to the SMSC. The SMSC, which in our case is operated by AT&T, routes the text to the mobile station nearest to my phone. Because I’m sitting three rooms away from my wife, the text just bounces back to the same mobile station, and then to my phone.

Fig 4. SMS cellular network

The next byte (PDU-type) encodes some basic housekeeping on how the phone should interpret the message, including whether it was sent successfully, whether the carrier requests a status report, and (importantly) whether this is a single text or part of a string of connected messages.

The byte after the PDU-Type is the message reference (MR). It’s a number between 1 and 255, and is essentially used as a short-term ID number to let the phone and the carrier know which text message it’s dealing with. In my wife’s case the number is set to 0, because her phone has its own message ID system independent of this particular file.

The next twelve bytes or so are reserved for the recipient’s phone number, called the destination address (DA). With the exception of the 7-bit letter character encoding I mentioned earlier, that helps us stuff 160 letters into a 140-character space, the phone number encoding is the stupidest, most confusing bits you’ll encounter in this SMS. It’s called reverse nibble notation, and it reverses every other digit in a large number. (Get it? Part of a byte is a nibble, hahahahaha, nobody’s laughing, engineers.)

My number, which is usually 1-352-537-8376, is logged in my wife’s phone as:

3125358773f6

The 1-3 is represented by

31

The 52 is represented by

25

The 53 is represented by

35

The 7-8 is represented by

87

The 37 is represented by

73

And the 6 is represented by…

f6

Where the fuck did the ‘f’ come from? It means it’s the end of the phone number, but for some awful reason (again, reverse nibble notation) it’s one character before the final digit.

It’s like pig latin for numbers.

tIs'l ki eip galit nof runbmre.s

But I’m not bitter.

[Edit: Sean Gies points out that reverse nibble notation is an inevitable artifact of representing 4-bit little-endian numbers in 8-bit chunks. That doesn’t invalidate the above description, but it does add some context for those who know what it means, and makes the decision seem more sensible.]

The Protocol Identifier (PID) byte is honestly, at this point, mostly wasted space. It takes about 40 possible values, and it tells the service provider how to route the message. A value of

22 

means my wife is sending “I love you” to a fax machine; a value of

24 

means she’s sending it to a voice line, somehow. Since she’s sending it as an SMS to my phone, which receives texts, the PID is set to

0

(Like every other text sent in the modern world.)

Fig 5. Possible PID Values

The next byte is the Data Coding Scheme (DCS, see this doc for details), which tells the carrier and the receiving phone which character encoding scheme was used. My wife used GSM-7, the 7-bit alphabet I mentioned above that allows her to stuff 160 letters into a 140-character space, but you can easily imagine someone wanting to text in Chinese, or someone texting a complex math equation (ok, maybe you can’t easily imagine that, but a guy can dream, right?).

In my wife’s text, the DCS byte was set to

0

meaning she used a 7-bit alphabet, but she may have changed that value to use an 8- or 16-bit alphabet, which would allow her many more possible letters, but a much smaller space to fit them. Incidentally, this is why when you text emoji to your friend, you have fewer characters to work with.

There’s also a little flag in the DCS byte that tells the phone whether to self-destruct the message after sending it, Mission Impossible style, so that’s neat.

The validity period (VP) space can take up to seven bytes, and sends us into another aspect of how text messages actually work. Take another look at Figure 4, above. It’s okay, I’ll wait.

When my wife finally hits ‘send’, the text gets sent to the SMS Service Center (SMSC), which then routes the message to me. I’m upstairs and my phone is on, so I receive the text in a handful of seconds, but what if my phone were off? Surely my phone can’t accept a message when it’s not receiving any power, so the SMSC has to do something with the text.

If the SMSC can’t find my phone, my wife’s message will just bounce around in its system until the moment my phone reconnects, at which point it sends the text out immediately. I like to think of the SMSC continuously checking every online phone to see if its mine like a puppy waiting for its human by the door: is that smell my human? No. Is that smell my human? No. Is this smell my human? YESYESJUMPNOW.

The validity period (VP) bytes tell the carrier how long the puppy will wait before it gets bored and finds a new home. It’s either a timestamp or a duration, and it basically says “if you don’t see the recipient phone pop online in the next however-many days, just don’t bother sending it.” The default validity period for a text is 10,080 minutes, which means if it takes me more than seven days to turn my phone back on, I’ll never receive her text.

Because there’s often a lot of empty space in an SMS, a few bits here or there are dedicated to letting the phone and carrier know exactly which bytes are unused. If my wife’s SIM card expects a 176-byte SMS, but because she wrote an exceptionally short message it only receives a 45-byte SMS, it may get confused and assume something broke along the way. The user data length (UDL) byte solves this problem: it relays exactly how many bytes the text in the text message actually take up.

In the case of “I love you”, the UDL claims the subsequent message is 9 bytes. You’d expect it to be 10 bytes, one for each of the 10 characters in

I-spacebar-l-o-v-e-spacebar-y-o-u

but because each character is 7 bits rather than 8 bits (a full byte), we’re able to shave an extra byte off in the translation. That’s because 7 bits * 10 characters = 70 bits, divided by 8 (the number of bits in a byte) = 8.75 bytes, rounded up to 9 bytes.

Which brings us to the end of every SMS: the message itself, or the UD (User Data). The message can take up to 140 bytes, though as I just mentioned, “I love you” will pack into a measly 9. Amazing how much is packed into those 9 bytes—not just the message (my wife’s presumed love for me, which is already difficult enough to compress into 0s and 1s), but also the message (I need to come downstairs and wish her goodnight). Those bytes are:

49 10 FB 6D 2F 83 F2 EF 3A.

In all, then, this is the text message stored on my wife’s SIM card:

SCA[1-10]-PDU[1]-MR[1]-DA[1-12]-DCS[1]-VP[0, 1, or 7]-UDL[1]-UD[0-140]

00 - 11 - 00 - 07 31 25 35 87 73 F6 - ?? 00 ?? - ?? - 09 - 49 10 FB 6D 2F 83 F2 EF 3A

(Note: to get the full message, I need to do some more digging. Alas, you only see most of the message here, hence the ??s.)

Waves in the Æther

Somehow [he says in David Attenborough’s voice], the SMS must now begin its arduous journey from the SIM card to the nearest base station.  To do that, my wife’s phone must convert a string of 176 bytes to the 279 bytes readable by the SS7 protocol, convert those digital bytes to an analog radio signal, and then send those signals out into the æther at a frequency of somewhere between 800 and 2000 megahertz. That means each wave is between 6 and 14 inches from one peak to the next.

Fig 6. Wavelength

In order to efficiently send and receive signals, antennas should be no smaller than half the size of the radio waves they’re dealing with. If cell waves are 6 to 14 inches, their antennas need to be 3-7 inches. Now stop and think about the average height of a mobile phone, and why they never seem to get much smaller.

Through some digital gymnastics that would take entirely too long to explain, suddenly my wife’s phone shoots a 279-byte information packet containing “I love you” at the speed of light in every direction, eventually fizzling into nothing after about 30 miles.

Well before getting that far, her signal strikes the AT&T HSPA Base Station ID199694204 LAC21767. This base transceiver station (BTS) is about 5 blocks from my favorite bakery in Hazelwood, La Gourmandine, and though I was able to find its general location using an android app called OpenSignal, the antenna is camouflaged beyond my ability to find it.

The really fascinating bit here is that it reaches the base transceiver station at all, given everything else going on. Not only is my wife texting me “I love you” in the 1000ish mhz band of the electromagnetic spectrum; tens of thousands of other people are likely talking on the phone or texting within the 30 mile radius around my house, beyond which cell signals disintegrate. On top of that, a slew of radio and TV signals are jostling for attention in our immediate airspace, alongside visible light bouncing this way and that, to name a few of the many electromagnetic waves that seem like they ought to be getting in the way.

As Richard Feynman eloquently put it in 1983, it’s a bit like the cell tower is a little blind bug resting gently atop the water on one end of a pool, and based only on the frequency and direction of waves that cause it to bounce up and down, it’s able to reconstruct who’s swimming and where.

Feynman discussing waves.

In part due to the complexity of competing signals, each base transceiver station generally can’t handle more than 200 active users (using voice or data) at a time. So “I love you” pings my local base transceiver station, about a half a mile away, and then shouts itself into the void in every direction until it fades into the noise of everyone else.

Switching

I’m pretty lucky, all things considered. Were my wife and I on different cell providers, or were we in different cities, the route of her message to me would be a good deal more circuitous.

My wife’s message is massaged into the 279-byte SS7 channel, and sent along to the local base transceiver station (BTS) near the bakery. From there, it gets routed to the base station controller (BSC), which is the brain of not just our antenna, but several other local antennas besides. The BSC flings the text to AT&T Pittsburgh’s mobile switching center (MSC), which relies on the text message’s SCA (remember the service center address embedded within every SMS? That’s where this comes in) to get it to the appropriate short message service center (SMSC).

This alphabet soup is easier to understand with the diagram from figure 7; I just described steps 1 and 3. If my wife were on a different carrier, we’d continue through steps 4-7, because that’s where the mobile carriers all talk to each other. The SMS has to go from the SMSC to a global switchboard and then potentially bounce around the world before finding its way to my phone.

Fig 7. SMS routed through a GSM network.

But she’s on AT&T and I’m on AT&T, and our phones are connected to the same tower, so after step 3 the 279-byte packet of love just does an about-face and returns through the same mobile service center, through the same base station, and now to my phone instead of hers. A trip of a few dozen miles in the blink of an eye.

Sent-to-SIM

Buzzzzz. My pocket vibrates. A notification lets me know an SMS has arrived through my nano-SIM card, a circuit board about the size of my pinky nail. Like Bilbo Baggins or any good adventurer, it changed a bit in its trip there and back again.

Fig 8. Received message, as opposed to sent message (figure 3).

Figure 8 shows the structure of the message “I love you” now stored on my phone. Comparing figures 3 and 8, we see a few differences. The SCA (phone number of the short message service center), the PDU (some mechanical housekeeping), the PID (phone-to-phone rather than, say, phone-to-fax), the DCS (character encoding scheme), the UDL (length of message), and the UD (the message itself) are all mostly the same, but the VP (the text’s expiration date), the MR (the text’s ID number), and the DA (my phone number) are missing.

Instead, on my phone, there are two new pieces of information: the OA (originating address, or my wife’s phone number), and the SCTS (service center time stamp, or when my wife sent the message).

My wife’s phone number is stored in the same annoying reverse nibble notation (like dyslexia but for computers) that my phone number was stored in on her phone, and the timestamp is stored in the same format as the expiration date was stored in on on her phone.

These two information inversions make perfect contextual sense. Her phone needed to reach me by a certain time at a certain address, and I now need to know who sent the message and when. Without the home address, so to speak, I wouldn’t know whether the “I love you” came from my wife or my mother, and the difference would change my interpretation of the message fairly significantly.

Through a Glass Brightly

In much the same way that any computer translates a stream of bytes into a series of (x,y) coordinates with specific color assignments, my phone’s screen gets the signal to render

49 10 FB 6D 2F 83 F2 EF 3A

on the screen in front of me as “I love you” in backlit black-and-white. It’s an interesting process, but as it’s not particularly unique to smartphones, you’ll have to look it up elsewhere. Let’s instead focus on how those instructions become points of light.

The friendly marketers at Samsung call my screen a Super AMOLED (Active Matrix Organic Light-Emitting Diode) display, which is somehow both redundant and not particularly informative, so we’ll ignore unpacking the acronym as yet another distraction, and dive right into the technology.

There are about 330,000 tiny sources of light, or pixels, crammed inside each of my phone screen’s 13 square inches. For that many pixels, each needs to be about 45µm (micrometers) wide: thinner than a human hair. There’s 4 million of ‘em in all packed into the palm of my hand.

But you already know how screens work. You know that every point of light, like the Christian God or Musketeers (minus d’Artagnan), is always a three-for-one sort of deal. Red, green, and blue combine to form white light in a single pixel. Fiddle with the luminosity of each channel, and you get every color in the rainbow. And since 4 x 3 = 12, that’s 12 million tiny sources of light sitting innocently dormant behind my black mirror, waiting for me to press the power button to read my wife’s text.

Fig 9. The subpixel array of a Samsung OLED display.

Each pixel, as the acronym suggests, is an organic light-emitting diode. That’s fancy talk for an electricity sandwich:

Fig 10. An electricity sandwich.

The layers aren’t too important, beyond the fact that it’s a cathode plate (negatively charged), below a layer of organic molecules (remember back to highschool: it’s just some atoms strung together with carbon), below an anode plate (positively charged).

When the phone wants the screen on, it sends electrons from the cathode plate to the anode plate. The sandwiched molecules intercept the energy, and in response they start emitting visible light, photons, up through the transparent anode, up through the screen, and into my waiting eyes.

Since each pixel is three points of light (red, green, and blue), there’s actually three of these sandwiches per pixel. They’re all essentially the same, except the organic molecule is switched out: poly(p-phenylene) for blue light, polythiophene for red light, and poly(p-phenylene vinylene) for green light. Because each is slightly different, they shine different colors when electrified.

(Fun side fact: blue subpixels burn out much faster, due to a process called “exciton-polaron annihilation”, which sounds really exciting, doesn’t it?)

All 4 million pixels are laid out on an indexed matrix. An index works in a computer much the same way it works in a book: when my phone wants a specific pixel to light a certain color, it looks that pixel up in the index, and then sends a signal to the address it finds. Let there be light, and there was light.

(Fun side fact: now you know what “Active Matrix Light-Emitting Diode” means, and you didn’t even try.)

My phone’s operating system interprets my wife’s text message, figures out the shape of each letter, and maps those shapes to the indexed matrix. It sends just the right electric pulses through the Super AMOLED screen to render those three little words that have launched ships and vanquished curses.

The great strangeness here is that my eyes never see “I love you” in bright OLED lights; it appear on the screen black-on-white. The phone creates the illusion of text through negative space, washing the screen white by setting every red, green, & blue to maximum brightness, then turning off the bits where letters should be. Its complexity is offensively mundane.

Fig 11. Negative space.

In displaying everything but my wife’s text message, and letting me read it in the gaps, my phone succinctly betrays the lie at the heart of the information age: that communication is simple. Speed and ease hide a mountain of mediation.

And that mediation isn’t just technical. My wife’s text wouldn’t have reached me had I not paid the phone bill on time, had there not been a small army of workers handling financial systems behind the scenes. Technicians keep the phone towers in working order, which they reach via a network of roads partially subsidized by federal taxes collected from hundreds of millions of Americans across 50 states. Because so many transactions still occur via mail, if the U.S. postal system collapsed tomorrow, my phone service would falter. Exploited factory workers in South America and Asia assembled parts in both our phones, and exhausted programmers renting expensive Silicon Valley closets are as-you-read-this pushing out code ensuring our phones communicate without interruption.

All of this underneath a 10-character text. A text which, let’s be honest, means much more than it says. My brain subconsciously peels back years of interactions with my wife to decode the message appearing on my phone, but between her and me there’s still a thicket of sociotechnical mediation, a stew of people and history and parts, that can never be untangled.

The Aftermath

So here I am, in the office late one Sunday night. “I love you,” my wife texted from the bedroom downstairs, before the message traversed 40 or so feet to my phone in a handful of seconds. I realize what it means: it’s time to wish her goodnight, and perhaps wrap up this essay. I tap away the last few words, now slightly more cognizant of the complex layering of miles, signals, years of history, and human sweat it took to keep my wife from having to shout upstairs that it’s about damn time I get some rest.

Thanks to Christopher Warren, Vika Zafrin, and Nechama Weingart for comments on earlier drafts.

[f-s d] Cetus

Quoting Liz Losh, Jacqueline Wernimont tweeted that behind every visualization is a spreadsheet.

But what, I wondered, is behind every spreadsheet?

Space whales.

Okay, maybe space whales aren’t behind every spreadsheet, but they’re behind this one, dated 1662, notable for the gigantic nail it hammered into the coffin of our belief that heaven above is perfect and unchanging. The following post is the first in my new series full-stack dev (f-s d), where I explore the secret life of data. 1

Hevelius. Mercurius in Sole visus (1662).
Hevelius. Mercurius in Sole visus (1662).

The Princess Bride teaches us a good story involves “fencing, fighting, torture, revenge, giants, monsters, chases, escapes, true love, miracles”. In this story, Cetus, three of those play a prominent role: (red) giants, (sea) monsters, and (cosmic) miracles. Also Greek myths, interstellar explosions, beer-brewing astronomers, meticulous archivists, and top-secret digitization facilities. All together, they reveal how technologies, people, and stars aligned to stick this 350-year-old spreadsheet in your browser today.

The Sea

When Aethiopian queen Cassiopeia claimed herself more beautiful than all the sea nymphs, Poseidon was, let’s say, less than pleased. Mildly miffed. He maybe sent a sea monster named Cetus to destroy Aethiopia.

Because obviously the best way to stop a flood is to drown a princess, Queen Cassiopeia chained her daughter to the rocks as a sacrifice to Cetus. Thankfully the hero Perseus just happened to be passing through Aethiopia, returning home after beheading Medusa, that snake-haired woman whose eyes turned living creatures to stone. Perseus (depicted below as the world’s most boring 2-ball juggler) revealed Medusa’s severed head to Cetus, turning the sea monster to stone and saving the princess. And then they got married because traditional gender roles I guess?

Corinthian vase depicting Perseus, Andromeda and Ketos.
Corinthian vase depicting Perseus, Andromeda and Ketos. [via]
Cetaceans, you may recall from grade school, are those giant carnivorous sea-mammals that Captain Ahab warned you about. Cetaceans, from Cetus. You may also remember we have a thing for naming star constellations and dividing the sky up into sections (see the Zodiac), and that we have a long history of comparing the sky to the ocean (see Carl Sagan or Star Trek IV).

It should come as no surprise, then, that we’ve designated a whole section of space as ‘The Sea‘, home of Cetus (the whale), Aquarius (the God) and Eridanus (the water pouring from Aquarius’ vase, source of river floods), Pisces (two fish tied together by a rope, which makes total sense I promise), Delphinus (the dolphin), and Capricornus (the goat-fish. Listen, I didn’t make these up, okay?).

Jamieson's Celestial Atlas, Plate 21 (1822).
Jamieson’s Celestial Atlas, Plate 21 (1822). [via]
Jamieson's Celestial Atlas, Plate 23 (1822).
Jamieson’s Celestial Atlas, Plate 23 (1822). [via]
Ptolemy listed most of these constellations in his Almagest (ca. 150 A.D.), including Cetus, along with descriptions of over a thousand stars. Ptolemy’s model, with Earth at the center and the constellations just past Saturn, set the course of cosmology for over a thousand years.

Ptolemy's Cosmos [by Robert A. Hatch]
Ptolemy’s Cosmos [by Robert A. Hatch]
In this cosmos, reigning in Western Europe for centuries past Copernicus’ death in 1543, the stars were fixed and motionless. There was no vacuum of space; every planet was embedded in a shell made of aether or quintessence (quint-essence, the fifth element), and each shell sat atop the next until reaching the celestial sphere. This last sphere held the stars, each one fixed to it as with a pushpin. Of course, all of it revolved around the earth.

The domain of heavenly spheres was assumed perfect in all sorts of ways. They slid across each other without friction, and the planets and stars were perfect spheres which could not change and were unmarred by inconsistencies. One reason it was so difficult for even “great thinkers” to believe the earth orbited the sun, rather than vice-versa, was because such a system would be at complete odds with how people knew physics to work. It would break gravity, break motion, and break the outer perfection of the cosmos, which was essential (…heh) 2 to our notions of, well, everything.

Which is why, when astronomers with their telescopes and their spreadsheets started systematically observing imperfections in planets and stars, lots of people didn’t believe them—even other astronomers. Over the course of centuries, though, these imperfections became impossible to ignore, and helped launch the earth in rotation ’round the sun.

This is the story of one such imperfection.

A Star is Born (and then dies)

Around 1296 A.D., over the course of half a year, a red dwarf star some 2 quadrillion miles away grew from 300 to 400 times the size of our sun. Over the next half year, the star shrunk back down to its previous size. Light from the star took 300 years to reach earth, eventually striking the retina of German pastor David Fabricius. It was very early Tuesday morning on August 13, 1596, and Pastor Fabricius was looking for Jupiter. 3

At that time of year, Jupiter would have been near the constellation Cetus (remember our sea monster?), but Fabricius noticed a nearby bright star (labeled ‘Mira’ in the below figure) which he did not remember from Ptolemy or Tycho Brahe’s star charts.

Mira Ceti and Jupiter. [via]
Mira Ceti and Jupiter. [via]
Spotting an unrecognized star wasn’t unusual, but one so bright in so common a constellation was certainly worthy of note. He wrote down some observations of the star throughout September and October, after which it seemed to have disappeared as suddenly as it appeared. The disappearance prompted Fabricius to write a letter about it to famed astronomer Tycho Brahe, who had described a similar appearing-then-disappearing star between 1572 and 1574. Brahe jotted Fabricius’ observations down in his journal. This sort of behavior, after all, was a bit shocking for a supposedly fixed and unchanging celestial sphere.

More shocking, however, was what happened 13 years later, on February 15, 1609. Once again searching for Jupiter, pastor Fabricius spotted another new star in the same spot as the last one. Tycho Brahe having recently died, Fabricius wrote a letter to his astronomical successor, Johannes Kepler, describing the miracle. This was unprecedented. No star had ever vanished and returned, and nobody knew what to make of it.

Unfortunately for Fabricius, nobody did make anything of it. His observations were either ignored or, occasionally, dismissed as an error. To add injury to insult, a local goose thief killed Fabricius with a shovel blow, thus ending his place in this star’s story, among other stories.

Mira Ceti

Three decades passed. On the winter solstice, 1638, Johannes Phocylides Holwarda prepared to view a lunar eclipse. He reported with excitement the star’s appearance and, by August 1639, its disappearance. The new star, Holwarda claimed, should be considered of the same class as Brahe, Kepler, and Fabricius’ new stars. As much a surprise to him as Fabricius, Holwarda saw the star again on November 7, 1639. Although he was not aware of it, his new star was the same as the one Fabricius spotted 30 years prior.

Two more decades passed before the new star in the neck of Cetus would be systematically sought and observed, this time by Johannes Hevelius: local politician, astronomer, and brewer of fine beers. By that time many had seen the star, but it was difficult to know whether it was the same celestial body, or even what was going on.

Hevelius brought everything together. He found recorded observations from Holwarda, Fabricius, and others, from today’s Netherlands to Germany to Poland, and realized these disparate observations were of the same star. Befitting its puzzling and seemingly miraculous nature, Hevelius dubbed the star Mira (miraculous) Ceti. The image below, from Hevelius’ Firmamentum Sobiescianum sive Uranographia (1687), depicts Mira Ceti as the bright star in the sea monster’s neck.

Hevelius. Firmamentum Sobiescianum sive Uranographia (1687).
Hevelius. Firmamentum Sobiescianum sive Uranographia (1687).

Going further, from 1659 to 1683, Hevelius observed Mira Ceti in a more consistent fashion than any before. There were eleven recorded observations in the 65 years between Fabricius’ first sighting of the star and Hevelius’ undertaking; in the following three, he had recorded 75 more such observations. Oddly, while Hevelius was a remarkably meticulous observer, he insisted the star was inherently unpredictable, with no regularity in its reappearances or variable brightness.

Beginning shortly after Hevelius, the astronomer Ismaël Boulliau also undertook a thirty year search for Mira Ceti. He even published a prediction, that the star would go through its vanishing cycle every 332 days, which turned out to be incredibly accurate. As today’s astronomers note, Mira Ceti‘s brightness increases and decreases by several orders of magnitude every 331 days, caused by an interplay between radiation pressure and gravity in the star’s gaseous exterior.

Mira Ceti composite taken by NASA's Galaxy Evolution Explorer. [via]
Mira Ceti composite taken by NASA’s Galaxy Evolution Explorer. [via]
While of course Boulliau didn’t arrive at today’s explanation for Mira‘s variability, his solution did require a rethinking of the fixity of stars, and eventually contributed to the notion that maybe the same physical laws that apply on Earth also rule the sun and stars.

Spreadsheet Errors

But we’re not here to talk about Boulliau, or Mira Ceti. We’re here to talk about this spreadsheet:

Hevelius. Mercurius in Sole visus (1662).
Hevelius. Mercurius in Sole visus (1662).

This snippet represents Hevelius’ attempt to systematically collected prior observations of Mira Ceti. Unreasonably meticulous readers of this post may note an inconsistency: I wrote that Johannes Phocylides Holwarda observed Mira Ceti on November 7th, 1639, yet Hevelius here shows Holwarda observing the star on December 7th, 1639, an entire month later. The little notes on the side are basically the observers saying: “wtf this star keeps reappearing???”

This mistake was not a simple printer’s error. It reappeared in Hevelius’ printed books three times: 1662, 1668, and 1685. This is an early example of what Raymond Panko and others call a spreadsheet error, which appear in nearly 90% of 21st century spreadsheets. Hand-entry is difficult, and mistakes are bound to happen. In this case, a game of telephone also played a part: Hevelius may have pulled some observations not directly from the original astronomers, but from the notes of Tycho Brahe and Johannes Kepler, to which he had access.

Unfortunately, with so few observations, and many of the early ones so sloppy, mistakes compound themselves. It’s difficult to predict a variable star’s periodicity when you don’t have the right dates of observation, which may have contributed to Hevelius’ continued insistence that Mira Ceti kept no regular schedule. The other contributing factor, of course, is that Hevelius worked without a telescope and under cloudy skies, and stars are hard to measure under even the best circumstances.

To Be Continued

Here ends the first half of Cetus. The second half will cover how Hevelius’ book was preserved, the labor behind its digitization, and a bit about the technologies involved in creating the image you see.

Early modern astronomy is a particularly good pre-digital subject for full-stack dev (f-s d), since it required vast international correspondence networks and distributed labor in order to succeed. Hevelius could not have created this table, compiled from the observations of several others, without access to cutting-edge astronomical instruments and the contemporary scholarly network.

You may ask why I included that whole section on Greek myths and Ptolemy’s constellations. Would as many early modern astronomers have noticed Mira Ceti had it not sat in the center of a familiar constellation, I wonder?

I promised this series will be about the secret life of data, answering the question of what’s behind a spreadsheet. Cetus is only the first story (well, second, I guess), but the idea is to upturn the iceberg underlying seemingly mundane datasets to reveal the complicated stories of their creation and usage. Stay-tuned for future installments.

Notes:

  1. I’m retroactively adding my blog rant about data underlying an equality visualization to the f-s d series.
  2. this pun is only for historians of science
  3. Most of the historiography in this and the following section are summarized from Robert A. Hatch’s “Discovering Mira Ceti: Celestial Change and Cosmic Continuity

Down the Rabbit Hole

WHEREIN I get angry at the internet and yell at it to get off my lawn.

You know what’s cool? Ryan Cordell and friends’ Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

Which newspapers copied from one another, from the Viral Texts project.
Which newspapers copied from one another, from the Viral Texts project.

Isn’t that a neat little slice of journalistic history? Different copyright laws, different technologies of text, different constraints of the medium, they all led to an interesting moment of textual virality in 19th-century America. If I weren’t a historian who knew better, I’d call it something like “quaint” or “charming”.

You know what isn’t quaint or charming? Living in the so-called “information age“, where everything is intertwingled, with hyperlinks and text costing pretty much zilch, and seeing the same gorram practices.

What proceeds is a rant. They say never to blog in anger. But seriously.

Inequality in Science

Tonight Alex Vespignani, notable network scientist, tweeted a link to an interesting-sounding study about inequality in scientific publishing. In Quartz! I like Quartz, it’s where Christopher Mims used to post awesome science things. Part of their mission statement reads:

In all that we do at Quartz, we embrace openness: open source code, an open newsroom, and open access to the data behind our journalism.

Pretty cool, right?

Anyway, here’s the tweet:

It links to this article on a “map of the world’s scientific research“. Because Vespignani tweeted it, I took it seriously (yes yes I know rt≠endorsement), and read the article. It describes a cartogram map of scientific research publications which shows how the U.S. and Western Europe (and a bit of China) dominates the research world, making the point that such a disparity is “disturbingly unequal”.

Map of scientific research, pulled from qz.com
Map of scientific research, by how many published articles are produced in a country, pulled from qz.com

“What’s driving the inequality?” they ask. Money & tech play a big role. So does what counts as “high impact” in science. What’s worse, the journalist writes,

In the worst cases, the global south simply provides novel empirical sites and local academics may not become equal partners in these projects about their own contexts.

The author points out an issue with the data: it only covers journals, not monographs, grey literature, edited volumes, etc. This often excludes the humanities and social sciences. The author also raises the issue of journal paywalls and how it decreases access to researchers in countries without large research budges. But we need to do better on “open dissemination”, the article claims.

Sources

Hey, that was a good read! I agree with everything the author said. What’s more, it speaks to my research, because I’ve done a fair deal of science mapping myself at the Cyberinfrastructure for Network Science Center under Katy Börner. Great, I think, let’s take a look at the data they’re using, given Quartz’s mission statement about how they always use open data.

I want to see the data because I know a lot of scientific publication indexing sites do a poor job of indexing international publications, and I want to see how it accounts for that bias. I look at the bottom of the page.

Crap.

This post originally appeared at The Conversation. Follow @US_conversation on Twitter. We welcome your comments at ideas@qz.com.

Alright, no biggie, time to look at the original article on The Conversation, a website whose slogan is “Academic rigor, journalistic flair“. Neat, academic rigor, I like the sound of that.

I scroll to the bottom, looking for the source.

A longer version of this article originally appeared on the London School of Economics’ Impact Blog.

Hey, the LSE Impact blog! They usually publish great stuff surrounding metrics and the like. Cool, I’ll click the link to read the longer version. The author writes something interesting right up front:

What would it take to redraw the knowledge production map to realise a vision of a more equitable and accurate world of knowledge?

A more accurate world of knowledge? Was this map inaccurate in a way the earlier articles didn’t report? I read on.

Well, this version of the article goes on a little to say that people in the global south aren’t always publishing in “international” journals. That’s getting somewhere, maybe the map only shows “international journals”! (Though she never actually makes that claim). Interestingly, the author writes of literature in the global south:

Even when published, this kind of research is often not attributed to its actual authors. It has the added problem of often being embargoed, with researchers even having to sign confidentiality agreements or “official secrets acts” when they are given grants. This is especially bizarre in an era where the mantra of publically funded research being made available to the public has become increasingly accepted.

Amen to that. Authorship information and openness all the way!

So who made this map?

Oh, the original article (though not the one in Quantz or The Conversation) has a link right up front to something called “The World of Science“. The link doesn’t actually take you to the map pictured, it just takes you to a website called worldmapper that’s filled with maps, letting you fend for yourself. That’s okay, my google-fu is strong.

www.worldmapper.org
www.worldmapper.org

I type “science” in the search bar.

Found it! Map #205, created by no-author-name-listed. The caption reads:

Territory size shows the proportion of all scientific papers published in 2001 written by authors living there.

Also, it only covers “physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering, technology, and earth and space sciences.” I dunno about you, but I can name at least 2.3 other types of science, but that’s cool.

In tiny letters near the bottom of the page, there are a bunch of options, including the ability to see the poster or download the data in Excel.

SUCCESS. ish.

Map of Science Poster from worldmapper.org
Map of Science Poster from worldmapper.org

Ahhhhh I found the source! I mean, it took a while, but here it is. You apparently had to click “Open PDF poster, designed for printing.” It takes you to a 2006 poster, which marks that it was made by the SASI Group from Sheffield and Mark Newman, famous and awesome complex systems scientist from Michigan. An all-around well-respected dude.

To recap, that’s a 7/11/2015 tweet, pointing to a 7/11/2015 article on Quartz, pointing to a 7/8/2015 article on The Conversation, pointing to a 4/29/2013 article on the LSE Impact Blog, pointing to a website made Thor-knows-when, pointing to a poster made in 2006 with data from 2001. And only the poster cites the name of the creative team who originally made the map. Blood and bloody ashes.

Intermission

Please take a moment out of your valuable time to watch this video clip from the BBC’s television adaptation of Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I’ll wait.

If you’re hard-of-hearing, read some of the transcript instead.

What I’m saying is, the author of this map was “on display at the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying beware of the leopard.”

The Saga Continues

Okay, at least I now can trust the creation process of the map itself, knowing Mark Newman had a hand in it. What about the data?

Helpfully, worldmapper.org has a link to the data as an Excel Spreadsheet. Let’s download and open it!

Frak. Frak frak frak frak frak.

My eyes.

Excel data for the science cartogram from worldmapper.org
Excel data for the science cartogram from worldmapper.org

Okay Scott. Deep breaths. You can brave the unicornfarts color scheme and find the actual source of the data. Be strong.

“See the technical notes” it says. Okay, I can do that. It reads:

Nearly two thirds of a million papers were published in enumerated science journals in 2001

Enumerated science journals? What does enumerated mean? Whatever, let’s read on.

The source of this data is the World Bank’s 2005 World Development Indicators, in the series on Scientific and technical journal articles (IP.JRN.ARTC.SC).

Okay, sweet, IP.JRN.ARTC.SC at the World Bank. I can Google that!

It brings me to the World Bank’s site on Scientific and technical journal articles. About the data it says:

Scientific and technical journal articles refer to the number of scientific and engineering articles published in the following fields: physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences

Yep, knew that already, but it’s good to see the sources agreeing with each other.

I look for the data source to no avail, but eventually do see a small subtitle “National Science Foundation, Science and Engineering Indicators.”

Alright /me *rolls sleeves*, IRC-style.

Eventually, through the Googles, I find my way to what I assume is the original data source website, although at this point who the hell knows? NSF Science and Engineering Indicators 2006.

Want to know what I find? A 1,092-page report (honestly, see the pdfs, volumes 1 & 2) within which, presumably, I can find exactly what I need to know. In the 1,092-page report.

I start with Chapter 5: Academic Research and Development. Seems promising.

Three-quarters-of-the-way-down-the-page, I see it. It’s shimmering in blue and red and gold to my Excel-addled eyes.

S&E

Could this be it? Could this be the data source I was searching for, the Science Citation Index and the Social Sciences Citation Index? It sounds right! Remember the technical notes which states “Nearly two thirds of a million papers were published in enumerated science journals in 2001?” That fits with the number in the picture above! Let’s click on the link to the data.

There is no link to the data.

There is no reference to the data.

That’s OKAY. WE’RE ALRIGHT. THERE ARE DATA APPENDICES IT MUST BE THERE. EVEN THOUGH THIS IS A REAL WEBSITE WITH HYPERTEXT LINKS AND THEY DIDN’T LINK TO DATA IT’S PROBABLY IN THE APPENDICES RIGHT?

Do you think the data are in the section labeled “Tables” or “Appendix Tables“? Don’t you love life’s little mysteries?

(Hint: I checked. After looking at 14 potential tables in the “Tables” section, I decided it was in the “Appendix Tables” section.)

Success! The World Bank data is from Appendix Table 5-41, “S&E articles, by region and country/economy: 1988–2003”.

Wait a second, friends, this can’t be right. If this is from the Science Citation Index and the Social Science Citation Index, then we can’t really use these metrics as a good proxy for global scientific output, because the criteria for national inclusion in the index is apparently kind of weird and can skew the output results.

Also, and let me be very clear about this,

This dataset actually covers both science and social science. It is, you’ll recall, the Science Citation Index and the Social Sciences Citation Index. [edit: at least as far as I can tell. Maybe they used different data, but if they did, it’s World Bank’s fault for not making it clear. This is the best match I could find.]

In Short

Which brings us back to Do. The article on Quartz made (among other things) two claims: that the geographic inequality of scientific output is troubling, and that the map really ought to include social scientific output.

And I agree with both of these points! And all the nuanced discussion is respectable and well-needed.

But by looking at the data, I just learned that A) the data the map draws from is not really a great representation of global output, and B) social scientific output is actually included.

I leave you with the first gif I’ve ever posted on my blog:

source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html real source: Seinfeld. Seriously, people.
source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html
real source: Seinfeld. Seriously, people.

You know what’s cool? Ryan Cordell and friend’s Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

—————————————————————————————————

(p.s. I don’t blame the people involved, doing the linking. It’s just the tumblr-world of 19th century newspapers we live in.)

[edit: I’m noticing some tweets are getting the wrong idea, so let me clarify: this post isn’t a negative reflection on the research therein, which is needed and done by good people. It’s frustration at the fact that we write in an environment that affords full references and rich hyperlinking, and yet we so often revert to context-free tumblr-like reblogging which separates text from context and data. We’re reverting to the affordances of 18th century letters, 19th century newspapers, 20th century academic articles, etc., and it’s frustrating.]

[edit 2: to further clarify, two recent tweets:

]