What’s Counted Counts

tl;dr. Don’t rely on data to fix the world’s injustices. An unusually self-reflective and self-indulgent post.

[Edit: this question was prompted by a series of analyses and visualizations I’ve done in collaboration with Nickoal Eichmann, but I purposefully left her out of the majority of this post, as it was one of self-reflection about my own personal choices. A respected colleague pointed out in private that by doing so, I nullified my female collaborator’s contributions to the project, for which I apologize deeply. Nickoal’s input has been integral to all of this, and she and many others, including particularly Jeana Jorgensen and Heather Froehlich (who has written on this very subject), have played vital roles in my own learning about these issues. Recent provocations by Miriam Posner helped solidify a lot of these thoughts and inspired this post. What follows is a self-exploration, recapping what many people have already said, but hopefully still useful to some. Mistakes below shouldn’t reflect poorly on those who influenced or inspired me. The post from this point on is as it originally appeared.]


Someone asked yesterday why I cared enough 1 about gender equality in academia to make this chart (with Nickoal Eichmann).

Gender representation as authors at DH conferences over the last decade. (Women consistently represent around 33% of authors)
Gender representation as authors at DH conferences over the last decade. Context. (Women consistently represent around 33% of authors)

I didn’t know how to answer the question. Our culture gives some more and better opportunities than others, so in order to make things better for more people, we must reveal and work towards resolving points of inequality. “Why do I care?” Don’t most of us want to make things better, we just go about it in different ways, and have different ideas of what’s “better”?

But the question did make me consider why I’d started with gender equality, when there are clearly so many other equally important social issues to tackle, within and outside academia. The answer was immediately obvious: ease. I’d attempted to explore racial and ethnic diversity as well, but it was simply more fraught, complicated, and less amenable to my methods than gender, so I started with gender and figured I’d work my way into the weeds from there. 2

I’ll cut to the chase. My well-intentioned attempts at battling inequality suffer their own sort of bias: by focusing on measurements of inequality, I bias that which is easily measured. It’s not that gender isn’t complex (see Miriam Posner’s wonderful recent keynote on these and related issues), but at least it’s a little easier to measure than race & ethnicity, when all you have available to you is what you can look up on the internet.

[scroll down]

Saturday Morning Breakfast Cereal. [source]
Saturday Morning Breakfast Cereal. [source]
While this problem is far from new, it takes special significance in a data-driven world. That which is countable counts, and damn the rest. At its heart, this problem is one of classification and categorization: those social divides which have the clearest seams are those most easily counted. And in a data-driven world, it’s inequality along these clear divides which get noticed first, even when injustice elsewhere is far greater.

Sex is easy, compared to gender. At most 2% of people are born intersex according to most standards (but not accounting for dysmorphia & similar). And gender is relatively easy compared to race and ethnicity. Nationality is pretty easy because of bureaucratic requirements for passports and citizenship, and country of residence is even easier, unless you live somewhere like Palestine.

But even the Palestine issue isn’t completely problematic, because counting still works fine when one thing exists in multiple categories, or may be categorized differently in different systems. That’s okay.

[source]
[source]
Where math gets lost is where there are simply no good borders to draw around entities—or worse, there are borders, but those borders themselves are drawn by insensitive outgroups. We see this a lot in the history of colonialism. Have you ever been to the Pitt Rivers Museum in Oxford? It’s a 19th century museum that essentially shows what the 19th century British mind felt about the world: everything that looks like a flute is in the flute cabinet, everything that looks like a gun is in the gun cabinet, and everything that looks like a threatening foreign religious symbol is in the threatening foreign religious symbol cabinet. Counting such a system doesn’t reveal any injustice except that of the counters themselves.

Pitt Rivers Museum [source]
Pitt Rivers Museum [source]
And I’ll be honest here: I want to help make the world a better place, but I’ve got to work to my strengths and know my limits. I’m a numbers guy. I’m at my best when counting stuff, and when there are no sensitive ways to classify, I avoid counting, because I don’t want to be That Colonizing White Dude who tries to fit everything into boxes of his own invention to make himself feel better about what he’s doing for the world. I probably still fall into that trap a lot anyway.

So why did I care enough to count gender at DH conferences? It was (relatively) easy. And it’s needed, as we saw at DH2015 and we’ve seen throughout the digital humanities – we have a gender issue, and a feminism issue, and they both need to be pointed out and addressed. But we also have lots of other issues that I’ll simply never be able to approach, and don’t know how to approach, and am in danger of ignoring entirely if I only rely on quantitative evidence of inequality.

useless by xkcd
useless by xkcd

Of course, only relying on non-quantitative evidence has its own pitfalls. People evolved and are socialized to spot patterns, to extrapolate from limited information, even when those extrapolations aren’t particularly meaningful or lead to Jesus in a slice of toast. I’m not advocating we avoid metrics entirely (for one, I’d be out of a job), but echoing Miriam Posner’s recent provocation, we need to engage with techniques, approaches, and perspectives that don’t rely on easy classification schemes. Especially, we need to listen when people notice injustice that isn’t easily classified or counted.

“Uh, yes, Scott, who are you writing this for? We already knew this!” most of you are likely asking if you’ve read this far. I’m writing to myself in early college, an engineering student obsessed with counting, who’s slowly learned the holes in a worldview that only relies on quantitative evidence. The one who spent years quantifying his health issues, only to discover the pursuit of a number eventually took precedence over the pursuit of his own health. 3

Hopefully this post helps balance all the bias implicit in my fighting for a better world from a data-driven perspective, by suggesting “data-driven” is only one of many valuable perspectives.

Notes:

  1. Upon re-reading the original question, it was actually “Why did you do it? (or why are you interested?)”. Still, this post remains relevant.
  2. I’m light on details here because I don’t want this to be an overlong post, but you can read some more of the details on what Nickoal and I are doing, and the decisions we make, in this blog series.
  3. A blog post on mental & physical health in academia is forthcoming.

Down the Rabbit Hole

WHEREIN I get angry at the internet and yell at it to get off my lawn.

You know what’s cool? Ryan Cordell and friends’ Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

Which newspapers copied from one another, from the Viral Texts project.
Which newspapers copied from one another, from the Viral Texts project.

Isn’t that a neat little slice of journalistic history? Different copyright laws, different technologies of text, different constraints of the medium, they all led to an interesting moment of textual virality in 19th-century America. If I weren’t a historian who knew better, I’d call it something like “quaint” or “charming”.

You know what isn’t quaint or charming? Living in the so-called “information age“, where everything is intertwingled, with hyperlinks and text costing pretty much zilch, and seeing the same gorram practices.

What proceeds is a rant. They say never to blog in anger. But seriously.

Inequality in Science

Tonight Alex Vespignani, notable network scientist, tweeted a link to an interesting-sounding study about inequality in scientific publishing. In Quartz! I like Quartz, it’s where Christopher Mims used to post awesome science things. Part of their mission statement reads:

In all that we do at Quartz, we embrace openness: open source code, an open newsroom, and open access to the data behind our journalism.

Pretty cool, right?

Anyway, here’s the tweet:

It links to this article on a “map of the world’s scientific research“. Because Vespignani tweeted it, I took it seriously (yes yes I know rt≠endorsement), and read the article. It describes a cartogram map of scientific research publications which shows how the U.S. and Western Europe (and a bit of China) dominates the research world, making the point that such a disparity is “disturbingly unequal”.

Map of scientific research, pulled from qz.com
Map of scientific research, by how many published articles are produced in a country, pulled from qz.com

“What’s driving the inequality?” they ask. Money & tech play a big role. So does what counts as “high impact” in science. What’s worse, the journalist writes,

In the worst cases, the global south simply provides novel empirical sites and local academics may not become equal partners in these projects about their own contexts.

The author points out an issue with the data: it only covers journals, not monographs, grey literature, edited volumes, etc. This often excludes the humanities and social sciences. The author also raises the issue of journal paywalls and how it decreases access to researchers in countries without large research budges. But we need to do better on “open dissemination”, the article claims.

Sources

Hey, that was a good read! I agree with everything the author said. What’s more, it speaks to my research, because I’ve done a fair deal of science mapping myself at the Cyberinfrastructure for Network Science Center under Katy Börner. Great, I think, let’s take a look at the data they’re using, given Quartz’s mission statement about how they always use open data.

I want to see the data because I know a lot of scientific publication indexing sites do a poor job of indexing international publications, and I want to see how it accounts for that bias. I look at the bottom of the page.

Crap.

This post originally appeared at The Conversation. Follow @US_conversation on Twitter. We welcome your comments at ideas@qz.com.

Alright, no biggie, time to look at the original article on The Conversation, a website whose slogan is “Academic rigor, journalistic flair“. Neat, academic rigor, I like the sound of that.

I scroll to the bottom, looking for the source.

A longer version of this article originally appeared on the London School of Economics’ Impact Blog.

Hey, the LSE Impact blog! They usually publish great stuff surrounding metrics and the like. Cool, I’ll click the link to read the longer version. The author writes something interesting right up front:

What would it take to redraw the knowledge production map to realise a vision of a more equitable and accurate world of knowledge?

A more accurate world of knowledge? Was this map inaccurate in a way the earlier articles didn’t report? I read on.

Well, this version of the article goes on a little to say that people in the global south aren’t always publishing in “international” journals. That’s getting somewhere, maybe the map only shows “international journals”! (Though she never actually makes that claim). Interestingly, the author writes of literature in the global south:

Even when published, this kind of research is often not attributed to its actual authors. It has the added problem of often being embargoed, with researchers even having to sign confidentiality agreements or “official secrets acts” when they are given grants. This is especially bizarre in an era where the mantra of publically funded research being made available to the public has become increasingly accepted.

Amen to that. Authorship information and openness all the way!

So who made this map?

Oh, the original article (though not the one in Quantz or The Conversation) has a link right up front to something called “The World of Science“. The link doesn’t actually take you to the map pictured, it just takes you to a website called worldmapper that’s filled with maps, letting you fend for yourself. That’s okay, my google-fu is strong.

www.worldmapper.org
www.worldmapper.org

I type “science” in the search bar.

Found it! Map #205, created by no-author-name-listed. The caption reads:

Territory size shows the proportion of all scientific papers published in 2001 written by authors living there.

Also, it only covers “physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering, technology, and earth and space sciences.” I dunno about you, but I can name at least 2.3 other types of science, but that’s cool.

In tiny letters near the bottom of the page, there are a bunch of options, including the ability to see the poster or download the data in Excel.

SUCCESS. ish.

Map of Science Poster from worldmapper.org
Map of Science Poster from worldmapper.org

Ahhhhh I found the source! I mean, it took a while, but here it is. You apparently had to click “Open PDF poster, designed for printing.” It takes you to a 2006 poster, which marks that it was made by the SASI Group from Sheffield and Mark Newman, famous and awesome complex systems scientist from Michigan. An all-around well-respected dude.

To recap, that’s a 7/11/2015 tweet, pointing to a 7/11/2015 article on Quartz, pointing to a 7/8/2015 article on The Conversation, pointing to a 4/29/2013 article on the LSE Impact Blog, pointing to a website made Thor-knows-when, pointing to a poster made in 2006 with data from 2001. And only the poster cites the name of the creative team who originally made the map. Blood and bloody ashes.

Intermission

Please take a moment out of your valuable time to watch this video clip from the BBC’s television adaptation of Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I’ll wait.

If you’re hard-of-hearing, read some of the transcript instead.

What I’m saying is, the author of this map was “on display at the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying beware of the leopard.”

The Saga Continues

Okay, at least I now can trust the creation process of the map itself, knowing Mark Newman had a hand in it. What about the data?

Helpfully, worldmapper.org has a link to the data as an Excel Spreadsheet. Let’s download and open it!

Frak. Frak frak frak frak frak.

My eyes.

Excel data for the science cartogram from worldmapper.org
Excel data for the science cartogram from worldmapper.org

Okay Scott. Deep breaths. You can brave the unicornfarts color scheme and find the actual source of the data. Be strong.

“See the technical notes” it says. Okay, I can do that. It reads:

Nearly two thirds of a million papers were published in enumerated science journals in 2001

Enumerated science journals? What does enumerated mean? Whatever, let’s read on.

The source of this data is the World Bank’s 2005 World Development Indicators, in the series on Scientific and technical journal articles (IP.JRN.ARTC.SC).

Okay, sweet, IP.JRN.ARTC.SC at the World Bank. I can Google that!

It brings me to the World Bank’s site on Scientific and technical journal articles. About the data it says:

Scientific and technical journal articles refer to the number of scientific and engineering articles published in the following fields: physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences

Yep, knew that already, but it’s good to see the sources agreeing with each other.

I look for the data source to no avail, but eventually do see a small subtitle “National Science Foundation, Science and Engineering Indicators.”

Alright /me *rolls sleeves*, IRC-style.

Eventually, through the Googles, I find my way to what I assume is the original data source website, although at this point who the hell knows? NSF Science and Engineering Indicators 2006.

Want to know what I find? A 1,092-page report (honestly, see the pdfs, volumes 1 & 2) within which, presumably, I can find exactly what I need to know. In the 1,092-page report.

I start with Chapter 5: Academic Research and Development. Seems promising.

Three-quarters-of-the-way-down-the-page, I see it. It’s shimmering in blue and red and gold to my Excel-addled eyes.

S&E

Could this be it? Could this be the data source I was searching for, the Science Citation Index and the Social Sciences Citation Index? It sounds right! Remember the technical notes which states “Nearly two thirds of a million papers were published in enumerated science journals in 2001?” That fits with the number in the picture above! Let’s click on the link to the data.

There is no link to the data.

There is no reference to the data.

That’s OKAY. WE’RE ALRIGHT. THERE ARE DATA APPENDICES IT MUST BE THERE. EVEN THOUGH THIS IS A REAL WEBSITE WITH HYPERTEXT LINKS AND THEY DIDN’T LINK TO DATA IT’S PROBABLY IN THE APPENDICES RIGHT?

Do you think the data are in the section labeled “Tables” or “Appendix Tables“? Don’t you love life’s little mysteries?

(Hint: I checked. After looking at 14 potential tables in the “Tables” section, I decided it was in the “Appendix Tables” section.)

Success! The World Bank data is from Appendix Table 5-41, “S&E articles, by region and country/economy: 1988–2003”.

Wait a second, friends, this can’t be right. If this is from the Science Citation Index and the Social Science Citation Index, then we can’t really use these metrics as a good proxy for global scientific output, because the criteria for national inclusion in the index is apparently kind of weird and can skew the output results.

Also, and let me be very clear about this,

This dataset actually covers both science and social science. It is, you’ll recall, the Science Citation Index and the Social Sciences Citation Index. [edit: at least as far as I can tell. Maybe they used different data, but if they did, it’s World Bank’s fault for not making it clear. This is the best match I could find.]

In Short

Which brings us back to Do. The article on Quartz made (among other things) two claims: that the geographic inequality of scientific output is troubling, and that the map really ought to include social scientific output.

And I agree with both of these points! And all the nuanced discussion is respectable and well-needed.

But by looking at the data, I just learned that A) the data the map draws from is not really a great representation of global output, and B) social scientific output is actually included.

I leave you with the first gif I’ve ever posted on my blog:

source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html real source: Seinfeld. Seriously, people.
source: http://s569.photobucket.com/user/SuperFlame64/media/kramer_screaming.gif.html
real source: Seinfeld. Seriously, people.

You know what’s cool? Ryan Cordell and friend’s Viral Texts project. It tracks how 19th-century U.S. newspapers used to copy texts from each other, little snippets of news or information, and republish them in their own publications. A single snippet of text could wind its way all across the country, sometimes changing a bit like a game of telephone, rarely-if-ever naming the original author.

—————————————————————————————————

(p.s. I don’t blame the people involved, doing the linking. It’s just the tumblr-world of 19th century newspapers we live in.)

[edit: I’m noticing some tweets are getting the wrong idea, so let me clarify: this post isn’t a negative reflection on the research therein, which is needed and done by good people. It’s frustration at the fact that we write in an environment that affords full references and rich hyperlinking, and yet we so often revert to context-free tumblr-like reblogging which separates text from context and data. We’re reverting to the affordances of 18th century letters, 19th century newspapers, 20th century academic articles, etc., and it’s frustrating.]

[edit 2: to further clarify, two recent tweets:

]

The moral role of DH in a data-driven world

This is the transcript from my closing keynote address at the 2014 DH Forum in Lawrence, Kansas. It’s the result of my conflicted feelings on the recent Facebook emotional contagion controversy, and despite my earlier tweets, I conclude the study was important and valuable specifically because it was so controversial.

For the non-Digital-Humanities (DH) crowd, a quick glossary. Distant Reading is our new term for reading lots of books at once using computational assistance; Close Reading is the traditional term for reading one thing extremely minutely, exhaustively.


Networked Society

Distant reading is a powerful thing, an important force in the digital humanities. But so is close reading. Over the next 45 minutes, I’ll argue that distant reading occludes as much as it reveals, resulting in significant ethical breaches in our digital world. Network analysis and the humanities offers us a way out, a way to bridge personal stories with the big picture, and to bring a much-needed ethical eye to the modern world.

Today, by zooming in and out, from the distant to the close, I will outline how networks shape our world and our lives, and what we in this room can do to set a path going forward.

Let’s begin locally.

1. Pale Blue Dot

Pale Blue Dot

You are here. That’s a picture of Kansas, from four billion miles away.

In February 1990, after years of campaigning, Carl Sagan convinced NASA to turn the Voyager 1 spacecraft around to take a self-portrait of our home, the Earth. This is the most distant reading of humanity that has ever been produced.

I’d like to begin my keynote with Carl Sagan’s own words, his own distant reading of humanity. I’ll spare you my attempt at the accent:

Consider again that dot. That’s here. That’s home. That’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every ‘superstar,’ every ‘supreme leader,’ every saint and sinner in the history of our species lived there – on a mote of dust suspended in a sunbeam.

What a lonely picture Carl Sagan paints. We live and die in isolation, alone in a vast cosmic darkness.

I don’t like this picture. From too great a distance, everything looks the same. Every great work of art, every bomb, every life is reduced to a single point. And our collective human experience loses all definition. If we want to know what makes us, us, we must move a little closer.

2. Black Rock City

Black Rock City

We’ve zoomed into Black Rock City, more popularly known as Burning Man, a city of 70,000 people that exists for only a week in a Nevada desert, before disappearing back into the sand until the following year. Here life is apparent; the empty desert is juxtaposed against a network of camps and cars and avenues, forming a circle with some ritualistic structure at its center.

The success of Burning Man is contingent on collaboration and coordination; on the careful allocation of resources like water to keep its inhabitants safe; on the explicit planning of organizers to keep the city from descending into chaos year after year.

And the creation of order from chaos, the apparent reversal of entropy, is an essential feature of life. Organisms and societies function through the careful coordination and balance of their constituent parts. As these parts interact, patterns and behaviors emerge which take on a life of their own.

3. Complex Systems

Thus cells combine to form organs, organs to form animals, and animals to form flocks.

We call these networks of interactions complex systems, and we study complex systems using network analysis. Network analysis as a methodology takes as a given that nothing can be properly understood in total isolation. Carl Sagan’s pale blue dot, though poignant and beautiful, is too lonely and too distant to reveal anything of we creatures who inhabit it.

We are not alone.

4. Connecting the Dots

When looking outward rather than inward, we find we are surrounded on all sides by a hundred billion galaxies each with a hundred billion stars. And for as long as we can remember, when we’ve stared up into the night sky, we’ve connected the dots. We’ve drawn networks in the stars in order to make them feel more like us, more familiar, more comprehensible.

Nothing exists in isolation. We use networks to make sense of our place in the vast complex system that contains protons and trees and countries and galaxies.The beauty of network analysis is its ability to transcend differences in scale, such that there is a place for you and for me, and our pieces interact with other pieces to construct the society we occupy. Networks allow us to see the forest and the trees, to give definition to the microcosms and macrocosms which describe the world around us.

5. Networked World

Networks open up the world. Over the past four hundred years, the reach of the West extended to the globe, overtaking trade routes created first by eastern conquerors. From these explorations, we produced new medicines and technologies. Concomitant with this expansion came unfathomable genocide and a slave trade that spanned many continents and far too many centuries.

Despite the efforts of the Western World, it could only keep the effects of globalization to itself for so long. Roads can be traversed in either direction, and the network created by Western explorers, businesses, slave traders, and militaries eventually undermined or superseded the Western centers of power. In short order, the African slave trade in the Americas led to a rich exchange of knowledge of plants and medicines between Native Americans and Africans.

In Southern and Southeast Asia, trade routes set up by the Dutch East India Company unintentionally helped bolster economies and trade routes within Asia. Captains with the company, seeking extra profits, would illicitly trade goods between Asian cities. This created more tightly-knit internal cultural and economic networks than had existed before, and contributed to a global economy well beyond the reach of the Dutch East India Company.

In the 1960s, the U.S. military began funding what would later become the Internet, a global communication network which could transfer messages at unfathomable speeds. The infrastructure provided by this network would eventually become a tool for control and surveillance by governments around the world, as well as a distribution mechanism for fuel that could topple governments in the Middle East or spread state secrets in the United States. The very pervasiveness which makes the internet particularly effective in government surveillance is also what makes it especially dangerous to governments through sites like WikiLeaks.

In short, science and technology lay the groundwork for our networked world, and these networks can be great instruments of creation, or terrible conduits of destruction.

6. Macro Scale

So here we are, occupying this tiny mote of dust suspended in a sunbeam. In the grand scheme of things, how does any of this really matter? When we see ourselves from so great a distance, it’s as difficult to be enthralled by the Sistine Chapel as it is to be disgusted by the havoc we wreak upon our neighbors.

7. Meso Scale

But networks let us zoom in, they let us keep the global system in mind while examining the parts. Here, once again, we see Kansas, quite a bit closer than before. We see how we are situated in a national and international set of interconnections. These connections come in every form, from physical transportation to electronic communication. From this scale, wars and national borders are visible. Over time, cultural migration patterns and economic exchange become apparent. This scale shows us the networks which surround and are constructed by us.

slide7

And this is the scale which is seen by the NSA and the CIA, by Facebook and Google, by social scientists and internet engineers. Close enough to provide meaningful aggregations, but far enough that individual lives remain private and difficult to discern. This scale teaches us how epidemics spread, how minorities interact, how likely some city might be a target for the next big terrorist attack.

From here, though, it’s impossible to see the hundred hundred towns whose factories have closed down, leaving many unable to feed their families. It’s difficult to see the small but endless inequalities that leave women and minorities systematically underappreciated and exploited.

8. Micro Scale

slide8

We can zoom in further still, Lawrence Kansas at a few hundred feet, and if we watch closely we can spot traffic patterns, couples holding hands, how the seasons affect people’s activities. This scale is better at betraying the features of communities, rather than societies.

But for tech companies, governments, and media distributors, it’s all-too-easy to miss the trees for the forest. When they look at the networks of our lives, they do so in aggregate. Indeed, privacy standards dictate that the individual be suppressed in favor of the community, of the statistical average that can deliver the right sort of advertisement to the right sort of customer, without ever learning the personal details of that customer.

This strange mix of individual personalization and impersonal aggregation drives quite a bit of the modern world. Carefully micro-targeted campaigning is credited with President Barack Obama’s recent presidential victories, driven by a hundred data scientists in an office in Chicago in lieu of thousands of door-to-door canvassers. Three hundred million individually crafted advertisements without ever having to look a voter in the face.

9. Target

And this mix of impersonal and individual is how Target makes its way into the wombs of its shoppers. We saw this play out a few years ago when a furious father went to complain to a Target store manager. Why, he asked the manager, is my high school daughter getting ads for maternity products in the mail? After returning home, the father spoke to his daughter to discover she was, indeed pregnant.  How did this happen? How’d Target know?

 It turns out, Target uses credit cards, phone numbers, and e-mail addresses to give every customer a unique ID. Target discovered a list of about 25 products that, if purchased in a certain sequence by a single customer, is pretty indicative of a customer’s pregnancy. What’s more, the date of the purchased products can pretty accurately predict the date the baby would be delivered. Unscented lotion, magnesium, cotton balls, and washcloths are all on that list.

When Target’s systems learns one of its customers is probably pregnant, it does its best to profit from that pregnancy, sending appropriately timed coupons for diapers and bottles. This backfired, creeping out customers and invading their privacy, as with the angry father who didn’t know his daughter was pregnant. To remedy the situation, rather than ending the personalized advertising, Target began interspersing ads for unrelated products with personalized products in order to trick the customer into thinking the ads were random or general. All the while, a good portion of the coupons in the book were still targeted directly towards those customers.

One Target executive told a New York Times reporter:

We found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

The scheme did work, raising Target’s profits by billions of dollars by subtly matching their customers with coupons they were likely to use. 

10. Presidential Elections

Political campaigns have also enjoyed the successes of microtargeting. President Bush’s 2004 campaign pioneered this technique, targeting socially conservative Democratic voters in key states in order to either convince them not to vote, or to push them over the line to vote Republican. This strategy is credited with increasing the pro-Bush African American vote in Ohio from 9% in 2000 to 16% in 2004, appealing to anti-gay marriage sentiments and other conservative values.

The strategy is also celebrated for President Obama’s 2008 and especially 2012 campaigns, where his staff maintained a connected and thorough database of a large portion of American voters. They knew, for instance, that people who drink Dr. Pepper, watch the Golf Channel, drive a Land Rover, and eat at Cracker Barrel are both very likely to vote, and very unlikely to vote Democratic. These insights lead to the right political ads targeted exactly at those they were most likely to sway.

So what do these examples have to do with networks? These examples utilize, after all, the same sorts of statistical tools that have always been available to us, only with a bit more data and power to target individuals thrown in the mix.

It turns out that networks are the next logical step in the process of micronudging, the mass targeting of individuals based on their personal lives in order to influence them toward some specific action.

In 2010, a Facebook study, piggy-backing on social networks, influenced about 340,000 additional people to vote in the US mid-term elections. A team of social scientists at UCSD experimented on 61 million facebook users in order to test the influence of social networks on political action.

A portion of American Facebook users who logged in on election day were given the ability to press an “I voted” button, which shared the fact that they voted with their friends. Facebook then presented users with pictures of their friends who voted, and it turned out that these messages increased voter turnout by about 0.4%. Further, those who saw that close friends had voted were more likely to go out and vote than those who had seen that distant friends voted. The study was framed as “voting contagion” – how well does the action of voting spread among close friends?

This large increase in voter turnout was prompted by a single message on Facebook spread among a relatively small subset of its users. Imagine that, instead of a research question, the study was driven by a particular political campaign. Or, instead, imagine that Facebook itself had some political agenda – it’s not too absurd a notion to imagine.

11. Blackout

slide11

In fact, on January 18, 2012, a great portion of the social web rallied under a single political agenda. An internet blackout. In protest of two proposed U.S. congressional laws that threatened freedom of speech on the Web, SOPA and PIPA, 115,000 websites voluntarily blacked out their homepages, replacing them with pleas to petition congress to stop the a bills.

Reddit, Wikipedia, Google, Mozilla, Twitter, Flickr, and others asked their users to petition Congress, and it worked. Over 3 million people emailed their congressional representatives directly, another million sent a pre-written message to Congress from the Electronic Frontier Foundation, a Google petition reached 4.5 million signatures, and lawmakers ultimated collected the names of over 14 million people who protested the bills. Unsurprisingly, the bills were never put up to vote.

These techniques are increasingly being leveraged to influence consumers and voters into acting in-line with whatever campaign is at hand. Social networks and the social web, especially, are becoming tools for advertisers and politicians.

12a. Facebook and Social Guessing

In 2010, Tim Tangherlini invited a few dozen computer scientists, social scientists, and humanists to a two-week intensive NEH-funded summer workshop on network analysis for the humanities. Math camp for nerds, we called it. The environment was electric with potential projects and collaborations, and I’d argue it was this workshop that really brought network analysis to the humanities in force.

During the course of the workshop, one speaker sticks out in my memory: a data scientist at Facebook. He reached the podium, like so many did during those two weeks, and described the amazing feats they were able to perform using basic linguistic and network analyses. We can accurately predict your gender and race, he claimed, regardless of whether you’ve told us. We can learn your political leanings, your sexuality, your favorite band.

Much like most talks from computer scientists at the event, the purpose was to show off the power of large-scale network analysis when applied to people, and didn’t focus much on its application. The speaker did note, however, that they used these measurements to effectively advertise to their users; electronics vendors could advertise to wealthy 20-somethings; politicians could target impoverished African Americans in key swing states.

It was a few throw-away lines in the presentation, but the force of the ensuing questions revolved around those specifically. How can you do this without any sort of IRB oversight? What about the ethics of all this? The Facebook scientist’s responses were telling: we’re not doing research, we’re just running a business.

And of course, Facebook isn’t the only business doing this. The Twitter analytics dashboard allows you to see your male-to-female follower ratio, even though users are never asked their gender. Gender is guessed based on features of language and interactions, and they claim around 90% accuracy.

Google, when it targets ads towards you as a user, makes some predictions based on your search activity. Google guessed, without my telling it, that I am a 25-34 year old male who speaks English and is interested in, among other things, Air Travel, Physics, Comics, Outdoors, and Books. Pretty spot-on.

12b. Facebook and Emotional Contagion

And, as we saw with the Facebook voting study, social web services are not merely capable of learning about you; they are capable of influencing your actions. Recently, this ethical question has pushed its way into the public eye in the form of another Facebook study, this one about “emotional contagion.”

A team of researchers and Facebook data scientists collaborated to learn the extent to which emotions spread through a social network. They selectively filtered the messages seen by about 700,000 Facebook users, making sure that some users only saw emotionally positive posts by their friends, and others only saw emotionally negative posts. After some time passed, they showed that users who were presented with positive posts tended to post positive updates, and those presented with negative posts tended to post negative updates.

The study stirred up quite the controversy, and for a number of reasons. I’ll unpack a few of them:

First of all, there were worries about the ethics of consent. How could Facebook do an emotional study of 700,000 users without getting their consent, first? The EULA that everyone clicks through when signing up for Facebook only has one line saying that data may be used for research purposes, and even that line didn’t appear until several months after the study occurred.

A related issue raised was one of IRB approval: how could the editors at PNAS have approved the study given that the study took place under Facebook’s watch, without an external Institutional Review Board? Indeed, the university-affiliated researchers did not need to get approval, because the data were gathered before they ever touched the study. The counter-argument was that, well, Facebook conducts these sorts of studies all the time for the purposes of testing advertisements or interface changes, as does every other company, so what’s the problem?

A third issue discussed was one of repercussions: if the study showed that Facebook could genuinely influence people’s emotions, did anyone in the study physically harm themselves as a result of being shown a primarily negative newsfeed? Should Facebook be allowed to wield this kind of influence? Should they be required to disclose such information to their users?

The controversy spread far and wide, though I believe for the wrong reasons, which I’ll explain shortly. Social commentators decried the lack of consent, arguing that PNAS shouldn’t have published the paper without proper IRB approval. On the other side, social scientists argued the Facebook backlash was antiscience and would cause more harm than good. Both sides made valid points.

One well-known social scientist noted that the Age of Exploration, when scientists finally started exploring the further reaches of the Americas and Africa, was attacked by poets and philosophers and intellectuals as being dangerous and unethical. But, he argued, did not that exploration bring us new wonders? Miracle medicines and great insights about the world and our place in it?

I call bullshit. You’d be hard-pressed to find a period more rife with slavery and genocide and other horrible breaches of human decency than that Age of Exploration. We can’t sacrifice human decency in the name of progress. On the flip-side, though, we can’t sacrifice progress for the tiniest fears of misconduct. We must proceed with due diligence to ethics without being crippled by inefficacy.

But this is all a red herring. The issue here isn’t whether and to what extent these activities are ethical science, but to what extent they are ethical period, and if they aren’t, what we should do about it. We can’t have one set of ethical standards for researchers, and another for businesses, but that’s what many of the arguments in recent months have boiled down to. Essentially, it was argued, Facebook does this all the time. It’s something called A/B testing: they make changes for some users and not others, and depending on how the users react, they change the site accordingly. It’s standard practice in web development.

13. An FDA/FTC for Data?

It is surprising, then, that the crux of the anger revolved around the published research. Not that Facebook shouldn’t do A/B testing, but that researchers shouldn’t be allowed to publish on it. This seems to be the exact opposite of what should be happening: if indeed every major web company practices these methods already, then scholarly research on how such practices can sway emotions or voting practices are exactly what we need. We must bring these practices to light, in ways the public can understand, and decide as a society whether they cross ethical boundaries. A similar discussion occurred during the early decades of the 20th century, when the FDA and FTC were formed, in part, to prevent false advertising of snake oils and foods and other products.

We are at the cusp of a new era. The mix of big data, social networks, media companies, content creators, government surveillance, corporate advertising, and ubiquitous computing is a perfect storm for intense influence both subtle and far-reaching. Algorithmic nudging has the power to sell products, win elections, topple governments, and oppress a people, depending on how it is wielded and by whom. We have seen this work from the bottom-up, in Occupy Wallstreet, the Revolutions in the Middle East, and the ALS Ice-Bucket Challenge, and from the top-down in recent presidential campaigns, Facebook studies, and coordinated efforts to preserve net neutrality. And these have been works of non-experts: people new to this technology, scrambling in the dark to develop the methods as they are deployed. As we begin to learn more about network-based control and influence, these examples will multiply in number and audacity.

14. Surveillance

And this story leaves out one of the most major players of all: government. When Edward Snowden leaked the details of classified NSA surveillance program, the world was shocked at the government’s interest in and capacity for omniscience. Data scientists, on the other hand, were mostly surprised that people didn’t realize this was happening. If the technology is there, you can bet it will be used.

And so here, in the NSA’s $1.5 billion dollar data center in Utah, are the private phone calls, parking receipts, emails, and Google searches of millions of American citizens. It stores a few exabytes of our data, over a billion gigabytes and roughly equivalent to a hundred thousand times the size of the library of congress. More than enough space, really.

The humanities have played some role in this complex machine. During the Cold War, the U.S. government covertly supported artists and authors to create cultural works which would spread American influence abroad and improve American sentiment at home.

Today the landscape looks a bit different. For the last few years DARPA, the research branch of the U.S. Department of Defense, has been funding research and hosting conferences in what they call “Narrative Networks.” Computer scientists, statisticians, linguists, folklorists, and literary scholars have come together to discuss how ideas spread and, possibly, how to inject certain sentiments within specific communities. It’s a bit like the science of memes, or of propaganda.

Beyond this initiative, DARPA funds have gone toward several humanities-supported projects to develop actionable plans for the U.S. military. One project, for example, creates as-complete-as-possible simulations of cultures overseas, which can model how groups might react to the dropping of bombs or the spread of propaganda. These models can be used to aid in the decision-making processes of officers making life-and-death decisions on behalf of troops, enemies, and foreign citizens. Unsurprisingly, these initiatives, as well as NSA surveillance at home, all rely heavily on network analysis.

In fact, when the news broke on the captures of Osama bin Laden and Saddam Hussein, and how they were discovered via network analysis, some of my family called me after reading the newspapers claiming “we finally understand what you do!” This wasn’t the reaction I was hoping for.

In short, the world is changing incredibly rapidly, in large part driven by the availability of data, network science and statistics, and the ever-increasing role of technology in our lives. Are these corporate, political, and grassroots efforts overstepping their bounds? We honestly don’t know. We are only beginning to have sustained, public discussions about the new role of technology in society, and the public rarely has enough access to information to make informed decisions. Meanwhile, media and web companies may be forgiven for overstepping ethical boundaries, as our culture hasn’t quite gotten around to drawing those boundaries yet.

15. The Humanities’ Place

This is where the humanities come in – not because we have some monopoly on ethics (goodness knows the way we treat our adjuncts is proof we do not) – but because we are uniquely suited to the small scale. To close reading. While what often sets the digital humanities apart from its analog counterpart is the distant reading, the macroanalysis, what sets us all apart is our unwillingness to stray too far from the source. We intersperse the distant with the close, attempting to reintroduce the individual into the aggregate.

Network analysis, not coincidentally, is particularly suited to this endeavor. While recent efforts in sociophysics have stressed the importance of the grand scale, let us not forget that network theory was built on the tiniest of pieces in psychology and sociology, used as a tool to explore individuals and their personal relationships. In the intervening years, all manner of methods have been created to bridge macro and micro, from Granovetter’s theory of weak ties to Milgram’s of Small Worlds, and the way in which people navigate the networks they find themselves in. Networks work at every scale, situating the macro against the meso against the micro.

But we find ourselves in a world that does not adequately utilize this feature of networks, and is increasingly making decisions based on convenience and money and politics and power without taking the human factor into consideration. And it’s not particularly surprising: it’s easy, in the world of exabytes of data, to lose the trees for the forest.

This is not a humanities problem. It is not a network scientist problem. It is not a question of the ethics of research, but of the ethics of everyday life. Everyone is a network scientist. From Twitter users to newscasters, the boundary between people who consume and people who are aware of and influence the global social network is blurring, and we need to deal with that. We must collaborate with industries, governments, and publics to become ethical stewards of this networked world we find ourselves in.

16. Big and Small

Your challenge, as researchers on the forefront of network analysis and the humanities, is to tie the very distant to the very close. To do the research and outreach that is needed to make companies, governments, and the public aware of how perturbations of the great mobile that is our society affect each individual piece.

We have a number of routes available to us, in this respect. The first is in basic research: the sort that got those Facebook study authors in such hot water. We need to learn and communicate the ways in which pervasive surveillance and algorithmic influence can affect people’s lives and steer societies.

A second path towards influencing an international discussion is in the development of new methods that highlight the place of the individual in the larger network. We seem to have a critical mass of humanists collaborating with or becoming computer scientists, and this presents a perfect opportunity to create algorithms which highlight a node’s uniqueness, rather than its similarity.

Another step to take is one of public engagement that extends beyond the academy, and takes place online, in newspapers or essays, in interviews, in the creation of tools or museum exhibits. The MIT Media Lab, for example, created a tool after the Snowden leaks that allows users to download their email metadata to reveal the networks they form. The tool was a fantastic example of a way to show the public exactly what “simply metadata” can reveal about a person, and its viral spread was a testament to its effectiveness. Mike Widner of Stanford called for exactly this sort of engagement from digital humanists a few years ago, and it is remarkable how little that call has been heeded.

Pedagogy is a fourth option. While people cry that the humanities are dying, every student in the country will have taken many humanities-oriented courses by the time they graduate. These courses, ostensibly, teach them about what it means to be human in our complex world. Alongside the history, the literature, the art, let’s teach what it means to be part of a global network, constantly contributing to and being affected by its shadow.

With luck, reconnecting the big with the small will hasten a national discussion of the ethical norms of big data and network analysis. This could result in new government regulating agencies, ethical standards for media companies, or changes in ways people interact with and behave on the social web.

17. Going Forward

When you zoom out far enough, everything looks the same. Occupy Wall Street; Ferguson Riots; the ALS Ice Bucket Challenge; the Iranian Revolution. They’re all just grassroots contagion effects across a social network. Rhetorically, presenting everything as a massive network is the same as photographing the earth from four billion miles: beautiful, sobering, and homogenizing. I challenge you to compare network visualizations of Ferguson Tweets with the ALS Ice Bucket Challenge, and see if you can make out any differences. I couldn’t. We need to zoom in to make meaning.

The challenge of network analysis in the humanities is to bring our close reading perspectives to the distant view, so media companies and governments don’t see everyone as just some statistic, some statistical blip floating on this pale blue dot.

I will end as I began, with a quote from Carl Sagan, reflecting on a time gone by but every bit as relevant for the moment we face today:

I know that science and technology are not just cornucopias pouring good deeds out into the world. Scientists not only conceived nuclear weapons; they also took political leaders by the lapels, arguing that their nation — whichever it happened to be — had to have one first. … There’s a reason people are nervous about science and technology. And so the image of the mad scientist haunts our world—from Dr. Faust to Dr. Frankenstein to Dr. Strangelove to the white-coated loonies of Saturday morning children’s television. (All this doesn’t inspire budding scientists.) But there’s no way back. We can’t just conclude that science puts too much power into the hands of morally feeble technologists or corrupt, power-crazed politicians and decide to get rid of it. Advances in medicine and agriculture have saved more lives than have been lost in all the wars in history. Advances in transportation, communication, and entertainment have transformed the world. The sword of science is double-edged. Rather, its awesome power forces on all of us, including politicians, a new responsibility — more attention to the long-term consequences of technology, a global and transgenerational perspective, an incentive to avoid easy appeals to nationalism and chauvinism. Mistakes are becoming too expensive.

Let us take Carl Sagan’s advice to heart. Amidst cries from commentators on the irrelevance of the humanities, it seems there is a large void which we are both well-suited and morally bound to fill. This is the path forward.

Thank you.


Thanks to Nickoal Eichmann and Elijah Meeks for editing & inspiration.

Stanford Musings

It’s official: I am Stanford’s new DH data scientist from May to August. What does that mean? I haven’t the foggiest idea – I think figuring that out is part of my job description. Over the next few months, I’ll be assisting a small platoon of Stanfordites with their networks, their visualizations, their data, and who knows, maybe their love lives. I’m reporting to the inimitable Glen Worthey and the indomitable Elijah Meeks, who will keep me on the straight and narrow. I’ll also be blogging, teaching workshops, writing papers, and crunching numbers, all under the Stanford banner.

This announcement is on the heels of my recent trip to Stanford, and I have to say, I was incredibly impressed by the operation they had going there. The library has at least three branches under which DH projects occur, and of particular interest are the Academic Technology Specialists like Mike Widner. A half a dozen of them are embedded in different schools around campus, and they act as technology liaisons and researchers within those schools, supporting faculty projects, developing their own research, and just generally fostering a fantastic digital humanities presence on the Stanford campus.

Stanford! Did you know it’s actually “Leland Stanford Junior University”? Weird, right?

Then there’s Elijah Meeks and Karl Grossner. Do you know those TV shows where contestants vie for a fancy house from some team of super creative builders? They basically do that, except instead of offering cool new digs, they offer their impressive technical services for a few months. There’s also the Lit Lab, CESTA, the DH Focal Group, and probably a dozen other projects which do DH on campus in some way or another.

As far as I can tell, I’ll be just one more chaotic agent in this complex DH environment. Many of the big projects going on at Stanford rely in some way on networks, and I’m going to try to bring them all together and set agendas for how they can best utilize and analyze the networks at hand. I’ll also design some tools that’ll make it easier for future network-y projects to get off the ground. There’s also a bunch of Famous Network Scientists who operate out of Stanford, and I plan on nurturing some collaborations between them, the DH community, and some humanities-curious tenants of Silicon Valley.

It will be interesting to see how this position unfolds. As far as I’m aware, the “resident data scientist” model for DH is an untried one at any university, and I’m lucky and honored that Stanford has decided to take a chance on such a new position with me at the helm. If this proves successful, it will provide even more proof that the role of libraries in fostering DH on campus can be a powerful one. Of course there’s also the chance I could fail spectacularly, but in true DH tradition, I believe such a public failure would also be a worthy outcome. If the process works, great; if not, we’ll know what to fix for the next try.

Barriers to Scholarship & Iterative Writing

This post is mostly just thinking out loud, musing about two related barriers to scholarship: a stigma related to self-plagiarism, and various copyright concerns. It includes a potential way to get past them.

Self-Plagiarism

When Jonah Lehrer’s plagiarism scandal first broke, it sounded a bit silly. Lehrer, it turned out, had taken some sentences he’d used in earlier articles, and reused them in a few New Yorker blog posts. Without citing himself. Oh no, I thought. Surely, this represents the height of modern journalistic moral depravity.

Of course, later it was revealed that he’d bent facts, and plagiarized from others without reference, and these were all legitimately upsetting. And plagiarizing himself without reference was mildly annoying, though certainly not something that should have attracted national media attention. But it raises an interesting question: why is self-plagiarism wrong? And it’s as wrong in academia as it is in journalism.

Lehrer chart from Slate [via].
Lehrer chart from Slate. [via]
I can’t speak for journalists (though Alberto Cairo can, and he lists some of the good reasons why non-referenced self-plagiarism is bad and links to not one, but two articles about it, and), but for academia, the reasons behind the wrongness seem pretty clear.

  1. It’s wrong to directly lift from any source without adequate citation. This only applies to non-cited self-plagiarism, obviously.
  2. It’s wrong to double-dip. The currency of the academy is publications / CV lines, and if you reuse work to fill your CV, you’re getting an unfair advantage.
  3. Confusion. Which version should people reference if you have so many versions of a similar work?
  4. Copyright. You just can’t reuse stuff, because your previous publishers own the copyright on your earlier work.

That about covers it. Let’s pretend academics always cite their own works (because, hell, it gives them more citations), so we can do away with #1. Regular readers will know my position on publisher-owned copyright, so I just won’t get into #4 here to save you my preaching. The others are a bit more difficult to write off, but before I go on to try to do that, I’d like to talk a bit about my own experience of self-plagiarism as a barrier to scholarship.

I was recently invited to speak at the Universal Decimal Classification seminar, where I presented on the history of trees as a visual metaphor for knowledge classification. It’s not exactly my research area, but it was such a fun subject, I’ve decided to write an article about it. The problem is, the proceedings of the UDC seminar were published, and about 50% of what I wanted to write is already sitting in a published proceedings that, let’s face it, not many people will ever read. And if I ever want to add to it, I have to change the already-published material significantly if I want to send it out again.

Since I presented, my thesis has changed slightly, I’ve added a good chunk of more material, and I fleshed out the theoretical underpinnings. I now have a pretty good article that’s ready to be sent out for peer review, but if I want to do that, I can’t just have a reference saying “half of this came from a published proceeding.” Well, I could, but apparently there’s a slight taboo against this. I was told to “be careful,” that I’d have to “rephrase” and “reword.” And, of course, I’d have to cite my earlier publication.

I imagine most of this comes from the fear of scholars double-dipping, or padding their CVs. Which is stupid. Good scholarship should come first, and our methods of scholarly attribution should mold itself to it. Right now, scholarship is enslaved to the process of attribution and publication. It’s why we willingly donate our time and research to publishing articles, and then have our universities buy back our freely-given scholarship in expensive subscription packages, when we could just have the universities pay for the research upfront and then release it for free.

Copyright

The question of copyright is pretty clear: how much will the publisher charge if I want my to reuse a significant portion of my work somewhere else? The publisher to which I refer, Ergon Verlag, I’ve heard is pretty lenient about such things, but what if I were reprinting from a different publish?

There’s an additional, more external, concern about my materials. It’s a history of illustrations, and the manuscript itself contains 48 illustrations in all. If I want to use them in my article, for demonstrative purposes, I not only need to cite the original sources (of course), I need to get permission to use the illustrations from the publishers who scanned them – and this can be costly and time consuming. I priced a few of them so-far, and they range from free to hundreds of dollars.

A Potential Solution – Iterative Writing

To recap, there are two things currently preventing me from sending out a decent piece of scholarship for peer-review:

  1. A taboo against self-plagiarism, which requires quite a bit of time for rewriting, permission from the original publisher to reuse material, and/or the dissolution of such a taboo.
  2. The cost and time commitment of tracking down copyright holders to get permission to reproduce illustrations.

I believe the first issue is largely a historical artifact of print-based media. Scholars have this sense of citing the source because, for hundreds of years, nearly every print of a single text was largely identical. Sure, there were occasionally a handful of editions, some small textual changes, some page number changes, but citing a text could easily be done, and so we developed a huge infrastructure around citations and publications that exists to this day. It was costly and difficult to change a printed text, and so it wasn’t done often, and now our scholarly practices are based around the idea scholarly material has to be permanent and unchanging, finished, if they are to enter into the canon and become citeable sources.

In the age of Wikipedia, this is a weird idea. Texts grow organically, they change, they revert. Blog posts get updated. A scholarly article, though, is relatively constant, even those in online-only publications. One of the major exceptions are ArXiv-like pre-print repositories, which allow an article to go through several versions before the final one goes off to print. But generally, once the final version goes to print, no further changes are made.

The reasons behind this seem logical: it’s the way we’ve always done it, so why change a good thing? It’s hard to cite something that’s constantly changing; how do we know the version we cited will be preserved?

In an age of cheap storage and easily tracked changes, this really shouldn’t be a concern. Wikipedia does this very well: you can easily cite the version of an article from a specific date and, if you want, easily see how the article changed between then and any other date.

Changes between versions of the Wikipedia entry on History.
Changes between versions of the Wikipedia entry on History.

This would be more difficult to implement in academia because article hosting isn’t centralized. It’s difficult to be certain that the URL hosting a journal article now will persist for 50 years, both because of ownership and design changes, and it’s difficult to trust that whomever owns the article or the site won’t change the content and not preserve every single version, or a detailed description of changes they’ve made.

There’s an easy solution: don’t just reference everything you cite, embed everything you cite. If you cite a picture, include the picture. If you cite a book, include the book. If you cite an article, include the article. Storage is cheap: if your book cites a thousand sources, and includes a copy of every single one, it’ll be at most a gigabyte. Probably, it would be quite a deal smaller. That way, if the material changes down the line, everyone reading your research will till be able to refer to the original material. Further, because you include a full reference, people can go and look the material up to see if it has changed or updated in the time since you cited it.

Of course, this idea can’t work – copyright wouldn’t let it. But again, this is a situation where the industry of academia is getting in the way of potential improvements to the way scholarship can work.

The important thing, though, is that self-plagiarization would become a somewhat irrelevant concept. Want to write more about what you wrote before? Just iterate your article. Add some new references, a paragraph here or there, change the thesis slightly. Make sure to keep a log of all your changes.

I don’t know if this is a good solution, but it’s one of many improvements to scholarship – or at least, a removal of barriers to publishing interesting things in a timely and inexpensive fashion – which is currently impossible because of copyright concerns and institutional barriers to change. Cameron Neylon, from PLOS, recently discussed how copyright put up some barriers to his own interesting ideas. Academia is not a nimble beast, and because of it, we are stuck with a lot of scholarly practices which are, in part, due to the constraints of old media.

In short: academic writing is tough. There are ways it could be easier, that would allow good scholarship to flow more freely, but we are constrained by path dependency from choices we made hundreds of years ago. It’s time to be a bit more flexible and be more willing to try out new ideas. This isn’t anywhere near a novel concept on my part, but it’s worth repeating.

The last big barrier to self-plagiarism, double dipping to pad one’s CV, still seems tricky to get past. I’m not thrilled with the way we currently assess scholarship, and “CV size” is just one of the things I don’t like about it, but I don’t have any particularly clever fixes on that end.

Understanding Special Relativity through History and Triangles (pt. 1)

We interrupt this usually-DH blog because I got in a discussion about Special Relativity with a friend, and promised it was easily understood using only the math we use for triangles. But I’m a historian, so I can’t leave a good description alone without some background.

If you just want to learn how relativity works, skip ahead to the next post, Relativity Made Simple [Note! I haven’t written it yet, this is a two-part post. Stay-tuned for the next section]; if you hate science and don’t want to know how the universe functions, but love history, read only this post. If you have a month of time to kill, just skip this post entirely and read through my 122-item relativity bibliography on Zotero. Everyone else, disregard this paragraph.

An Oddly Selective History of Relativity

This is not a history of how Einstein came up with his Theory of Special Relativity as laid out in Zur Elektrodynamik bewegter Körper in 1905. It’s filled with big words like aberration and electrodynamics, and equations with occult symbols. We don’t need to know that stuff. This is a history of how others understood relativity. Eventually, you’re going to understand relativity, but first I’m going to tell you how other people, much smarter than you, did not.

There’s an infamous (potentially mythical) story about how difficult it is to understand relativity: Arthur Eddington, a prominent astronomer, was asked whether it was true that only three people in the world understood relativity. After pausing for a moment, Eddington replies “I’m trying to think who the third person is!” This was about General Relativity, but it was also a joke: good scientists know relativity isn’t incredibly difficult to grasp, and even early on, lots of people could claim to understand it.

Good historians, however, know that’s not the whole story. It turns out a lot of people who thought they understood Einstein’s conceptions of relativity actually did not, including those who agreed with him. This, in part, is that story.

Relativity Before Einstein

Einstein’s special theory of relativity relied on two assumptions: (1) you can’t ever tell whether you’re standing still or moving at a constant velocity (or, in physics-speak, the laws of physics in any inertial reference frame are indistinguishable from one another), and (2) light always looks like it’s moving at the same speed (in physics-speak, the speed of light is always constant no matter the velocity of the emitting body nor that of the observer’s inertial reference frame). Let’s trace these concepts back.

Our story begins in the 14th century. William of Occam, famous for his razor, claimed motion was merely the location of a body and its successive positions over time; motion itself was in the mind. Because position was simply defined in terms of the bodies that surround it, this meant motion was relative. Occam’s student, Buridan, pushed that claim forward, saying “If anyone is moved in a ship and imagines that he is at rest, then, should he see another ship which is truly at rest, it will appear to him that the other ship is moved.”

Galileo's relativity [via]. The site where this comes from is a little crazy, but the figure is still useful, so here it is.
Galileo’s relativity [via]. The site where this comes from is a little crazy, but the figure is still useful, so here it is.
The story movies forward at irregular speed (much like the speed of this blog, and the pacing of this post). Within a century, scholars introduced the concepts of an infinite universe without any center, nor any other ‘absolute’ location. Copernicus cleverly latched onto this relativistic thinking by showing that the math works just as well, if not better, when the Earth orbits the Sun, rather than vice versa. Galileo claimed there was no way, on the basis of mechanical experiments, to tell whether you were standing still or moving at a uniform speed.

For his part, Descartes disagreed, but did say that the only way one could discuss movement was relative to other objects. Christian Huygens takes Descartes a step forward, showing that there are no ‘privileged’ motions or speeds (that is, there is no intrinsic meaning of a universal ‘at rest’ – only ‘at rest’ relative to other bodies). Isaac Newton knew that it was impossible to measure something’s absolute velocity (rather than velocity relative to an observer), but still, like Descartes, supported the idea that there was an absolute space and absolute velocity – we just couldn’t measure it.

Lets skip ahead some centuries. The year is 1893; the U.S. Supreme Court declared the tomato was a vegetable, Gandhi campaigned against segregation in South Africa, and the U.S. railroad industry bubble had just popped, forcing the government to bail out AIG for $85 billion. Or something. Also, by this point, most scientists thought light traveled in waves. Given that in order for something to travel in a wave, something has to be waving, scientists posited there was this luminiferous ether that pervaded the universe, allowing light to travel between stars and candles and those fish with the crazy headlights. It makes perfect sense. In order for sound waves to travel, they need air to travel through; in order for light waves to travel, they need the ether.

Ernst Mach, A philosopher read by many contemporaries (including Einstein), said that Newton and Descartes were wrong: absolute space and absolute motion are meaningless. It’s all relative, and only relative motion has any meaning. It is both physically impossible to measure the an objects “real” velocity, and also philosophically nonsensical. The ether, however, was useful. According to Mach and others, we could still measure something kind of like absolute position and velocity by measuring things in relationship to that all-pervasive ether. Presumably, the ether was just sitting still, doing whatever ether does, so we could use its stillness as a reference point and measure how fast things were going relative to it.

Well, in theory. Earth is hurtling through space, orbiting the sun at about 70,000 miles per hour, right? And it’s spinning too, at about a thousand miles an hour. But the ether is staying still. And light, supposedly, always travels at the same speed through the ether no matter what. So in theory, light should look like it’s moving a bit faster if we’re moving toward its source, relative to the ether, and a bit slower, if we’re moving away from it, relative to the ether. It’s just like if you’re in a train hurdling toward a baseball pitcher at 100 mph, and the pitcher throws a ball at you, also at 100 mph, in a futile attempt to stop the train. To you, the baseball will look like it’s going twice as fast, because you’re moving toward it.

The earth moving in the ether. [via]
The earth moving through the ether. [via]
It turns out measuring the speed of light in relation to the ether was really difficult. A bunch of very clever people made a bunch of very clever instruments which really should have measured the speed of earth moving through the ether, based on small observed differences of the speed of light going in different directions, but the experiments always showed light moving at the same speed. Scientists figured this must mean the earth was actually exerting a pull on the ether in its vicinity, dragging it along with it as the earth hurtled through space, explaining why light seemed to be constant in both directions when measured on earth. They devised even cleverer experiments that would account for such an ether drag, but even those seemed to come up blank. Their instruments, it was decided, simply were not yet fine-tuned enough to measure such small variations in the speed of light.

Not so fast! shouted Lorentz, except he shouted it in Dutch. Lorentz used the new electromagnetic theory to suggest that the null results of the ether experiments were actually a result, not of the earth dragging the ether along behind it, but of physical objects compressing when they moved against the ether. The experiments weren’t showing any difference in the speed of light they sought because the measuring instruments themselves contracted to just the right length to perfectly offset the difference in the velocity of light, when measuring “into” the ether. The ether was literally squeezing the electrons in the meter stick together so it became a little shorter; short enough to inaccurately measure light’s speed. The set of equations used to describe this effect became known as Lorentz Transformations. One property of these transformations was that the physical contractions would, obviously, appear the same from any observer. No matter how fast you were going relative to your measuring device, if it were moving into the ether, you would see it contracting slightly to accommodate the measurement difference.

Not so fast! shouted Poincaré, except he shouted it in French. This property of transformations to always appear the same, relative to the ether, was actually a problem. Remember that 500 years of physics that said there is no way to mechanically determine your absolute speed or absolute location in space? Yeah, so did Poincaré. He said the only way you could measure velocity or location was matter-to-matter, not matter-to-ether, so the Lorentz transformations didn’t fly.

It’s worth taking a brief aside to talk about the underpinnings of the theories of both Lorentz and Poincaré. Their theories were based on experimental evidence, which is to say, they based their reasoning on contraction on apparent experimental evidence of said contraction, and they based their theories of relativity off of experimental evidence of motion being relative.

Einstein and Relativity

When Einstein hit the scene in 1905, he approached relativity a bit differently. Instead of trying to fit the apparent contraction of objects from the ether drift experiment to a particular theory, Einstein began with the assumption that light always appeared to move at the same rate, regardless of the relative velocity of the observer. The other assumption he began with was that there was no privileged frame of reference; no absolute space or velocity, only the movement of matter relative to other matter. I’ll work out the math later, but, unsurprisingly, it turned out that working out these assumptions led to exactly the same transformation equations as Lorentz came up with experimentally.

The math was the same. The difference was in the interpretation of the math. Einstein’s theory required no ether, but what’s more, it did not require any physical explanations at all. Because Einstein’s theory of special relativity rested on two postulates about measurement, the theory’s entire implications rested in its ability to affect how we measure or observe the universe. Thus, the interpretation of objects “contracting,” under Einstein’s theory, was that they were not contracting at all. Instead, objects merely appear as though they contract relative to the movement of the observer. Another result of these transformation equations is that, from the perspective of the observer, time appears to move slower or faster depending on the relative speed of what is being observed. Lorentz’s theory predicted the same time dilation effects, but he just chalked it up to a weird result of the math that didn’t actually manifest itself. In Einstein’s theory, however, weird temporal stretching effects were Actually What Was Going On.

To reiterate: the math of Lorentz, Einstein, and Poincaré were (at least at this early stage) essentially equivalent. The result was that no experimental result could favor one theory over another. The observational predictions between each theory were exactly the same.

Relativity’s Supporters in America

I’m focusing on America here because it’s rarely focused on in the historiography, and it’s about time someone did. If I were being scholarly and citing my sources, this might actually be a novel contribution to historiography. Oh well, BLOG! All my primary sources are in that Zotero library I linked to earlier.

In 1910, Daniel Comstock wrote a popular account of the relativity of Lorentz and Einstein, to some extent conflating the two. He suggested that if Einstein’s postulates could be experimentally verified, his special theory of relativity would be true. “If either of these postulates be proved false in the future, then the structure erected can not be true in is present form. The question is, therefore, an experimental one.” Comstock’s statement betrays a misunderstanding of Einstein’s theory, though, because, at the time of that writing, there was no experimental difference between the two theories.

Gilbert Lewis and Richard Tolman presented a paper at the 1908 American Physical Society in New York, where they describe themselves as fully behind Einstein over Lorentz. Oddly, they consider Einstein’s theory to be correct, as opposed to Lorentz’s, because his postulates were “established on a pretty firm basis of experimental fact.” Which, to reiterate, couldn’t possibly have been a difference between Lorentz and Einstein. Even more oddly still, they presented the theory not as one of physics or of measurement, but of psychology (a bit like 14th century Oresme). The two went on to separately write a few articles which supposedly experimentally confirmed the postulates of special relativity.

In fact, the few Americans who did seem to engage with the actual differences between Lorentz and Einstein did so primarily in critique. Louis More, a well-respected physicist from Cincinnati, labeled the difference as metaphysical and primarily useless. This American critique was fairly standard.

At the 1909 America Physical Society meeting in Boston, one physicist (Harold Wilson) claimed his experiments showed the difference between Einstein and Lorentz. One of the few American truly theoretical physicists, W.S. Franklin, was in attendance, and the lectures he saw inspired him to write a popular account of relativity in 1911; in it, he found no theoretical difference between Lorentz and Einstein. He tended to side theoretically with Einstein, but assumed Lorentz’s theory implied the same space and time dilation effects, which they did not.

Even this series of misunderstandings should be taken as shining examples in the context of an American approach to theoretical physics that was largely antagonistic, at times decrying theoretical differences entirely. At a symposium on Ether Theories at the 1911 APS, the presidential address by William Magie was largely about the uselessness of relativity because, according to him, physics should be a functional activity based in utility and experimentation. Joining Magie’s “side” in the debate were Michelson, Morley, and Arthur Gordon Webster, the co-founder of the America Physical Society. Of those at the meeting supporting relativity, Lewis was still convinced Einstein differed experimentally from Lorentz, and Franklin and Comstock each felt there was no substantive difference between the two. In 1912, Indiana University’s R.D. Carmichael stated Einstein’s postulates were “a direct generalization from experiment.” In short, the American’s were really focused on experiment.

Of Einstein’s theory, Louis More wrote in 1912:

Professor Einstein’s theory of Relativity [… is] proclaimed somewhat noisily to be the greatest revolution in scientific method since the time of Newton. That [it is] revolutionary there can be no doubt, in so far as [it] substitutes mathematical symbols as the basis of science and denies that any concrete experience underlies these symbols, thus replacing an objective by a subjective universe. The question remains whether this is a step forward or backward […] if there is here any revolution in thought, it is in reality a return to the scholastic methods of the Middle Ages.

More goes on to say how the “Anglo-Saxons” demand practical results, not the unfathomable theories of “the German mind.” Really, that quote about sums it up. By this point, the only Americans who even talked about relativity were the ones who trained in Germany.

I’ll end here, where most histories of the reception of relativity begin: the first Solvay Conference. It’s where this beautiful picture was taken.

First Solvay Conference. [via]
First Solvay Conference. [via]
To sum up: in the seven year’s following Einstein’s publication, the only Americans who agreed with Einstein were ones who didn’t quite understand him. You, however, will understand it much better, if you only read the next post [coming this week!].

A Working Definition of Digital Humanities

Hah! I tricked you. I don’t intend to define digital humanities here—too much blood has already been spilled over that subject. I’m sure we all remember the terrible digital humanities / humanities computing wars of 2004, now commemorated yearly under a Big Tent in the U.S., Europe, or in 2015, Australia. Most of us still suffer from ACH or ALLC (edit: I’ve been reminded the more politically correct acronym these days is EADH).

Instead, I’m here to report the findings of an extremely informal survey, with a sample size of 5, inspired by Paige Morgan’s question of what courses an undergraduate interested in digital humanities should take:

The question inspired a long discussion, worth reading through if you’re interested in digital humanities curricula. I suggested, were the undergrad interested in the heavily computational humanities (like Ted Underwood, Ben Schmidt, etc.), they might take linear algebra, statistics for social science, programming 1 & 2, web development, and a social science (like psych) research methods course, along with all their regular humanities courses. Others suggested to remove some and include others, and of course all of these are pipe dreams unless our mystery undergrad is in the six year program.

The Pipe Dream Curriculum. [via]
The Pipe Dream Curriculum. [via]
The discussion got me thinking: how did the digital humanists we know and love get to where they are today? Given that the basic ethos of DH is that if you want to know something, you just have to ask, I decided to ask a few well-respected DHers how someone might go about reaching expertise in their subject matter. This isn’t a question of how to define digital humanities, but of the sorts of things the digital humanists we know and love learned to get where they are today. I asked:

Dear all,

Some of you may have seen this tweet by Paige Morgan this morning, asking about what classes an undergraduate student should take hoping to pursue DH. I’ve emailed you, a random and diverse smattering of highly recognizable names associated with DH, in the hopes of getting a broader answer than we were able to generate through twitter alone.

I know you’re all extremely busy, so please excuse my unsolicited semi-mass email and no worries if you don’t get around to replying.

If you do reply, however, I’d love to get a list of undergraduate courses (traditional humanities or otherwise) that you believe was or would be instrumental to the research you do. My list, for example, would include historical methods, philosophy of science, linear algebra, statistics, programming, and web development. I’ll take the list of lists and write up a short blog post about them, because I believe it would be beneficial for many new students who are interested in pursuing DH in all its guises. I’d also welcome suggestions for other people and “schools of DH” I’m sure to have missed.

Many thanks,
Scott

The Replies

And because the people in DH are awesome and forthcoming, I got many replies back. I’ll list them first here, and then attempt some preliminary synthesis below.

Ted Underwood

The first reply was from Ted Underwood, who was afraid my question skirted a bit too close to defining DH, saying:

No matter how heavily I hedge and qualify my response (“this is just a personal list relevant to the particular kind of research I do …”), people will tend to read lists like this as tacit/covert/latent efforts to DEFINE DH — an enterprise from which I never harvest anything but thorns.

Thankfully he came back to me a bit later, saying he’d worked up the nerve to reply to my survey because he’s “coming to the conclusion that this is a vital question we can’t afford to duck, even if it’s controversial [emphasis added]”. Ted continued:

So here goes, with three provisos:

  1. I’m talking only about my own field (literary text mining), and not about the larger entity called “DH,” which may be too deeply diverse to fit into a single curriculum.
  2. A lot of this is not stuff I actually took in the classroom.
  3. I really don’t have strong opinions about how much of this should be taken as an undergrad, and what can wait for grad school. In practice, no undergrad is going to prepare themselves specifically for literary text mining (at least, I hope not). They should be aiming at some broader target.

But at some point, as preparation for literary text-mining, I’d recommend

  • A lot of courses in literary history and critical theory (you probably need a major’s worth of courses in some aspect of literary studies).
  • At least one semester of experience programming. Two semesters is better. But existing CS courses may not be the most efficient delivery system. You probably don’t need big-O notation. You do need data structures. You may not need to sweat the fine points of encapsulation. You probably do need to know about version control. I think there’s room for a “Programming for Humanists” course here.
  • Maybe one semester of linguistics (I took historical linguistics, but corpus linguistics would also work).
  • Statistics — a methods course for social scientists would be great.
  • At least one course in data mining / machine learning. This may presuppose more math than one semester of statistics will provide, so
  • Your recommendation of linear algebra is probably also a good idea.

I doubt all of that will fit in anyone’s undergrad degree. So in practice, any undergrad with courses in literary history plus a semester or two of programming experience, and perhaps statistics, would be doing very well.

So Underwood’s reply was that focusing too much in undergrad is not necessarily ideal, but were an undergraduate interested in literary text mining, they wouldn’t go astray with literary history, critical theory, a programming for humanists course, linguistics, statistics, data mining, and potentially linear algebra.

Johanna Drucker

While Underwood is pretty well known for his computational literary history, Johanna Drucker is probably most well known in our circles for her work in DH criticism. Her reply was concise and helpful:

Look at http://dh101.humanities.ucla.edu

In the best of all possible worlds, this would be followed by specialized classes in database design, scripting for the humanities, GIS/mapping, virtual worlds design, metadata/classification/culture, XML/markup, and data mining (textual corpora, image data mining, network analysis), and complex systems modeling, as well as upper division courses in disciplines (close/distant reading for literary studies, historical methods and mapping etc.).

The site she points is an online coursebook that provides a broad overview of DH concepts, along with exercises and tutorials, that would make a good basic course on the groundwork of DH. She then lists a familiar list of computer-related and humanities course that might be useful.

Melissa Terras

The next reply came from Melissa Terras, the director of the DH center (I’m sorry, centre) at UCL. Her response was a bit more general:

My first response is that they must be interested in Humanities research – and make the transition to being taught about Humanities, to doing research in the Humanities, and get the bug for finding out new information about a Humanities topic. It doesn’t matter what the Humanities subject is – but they must understand Humanities research questions, and what it means to undertake new research in the Humanities proper. (Doesn’t matter if their research project has no computing component, it’s about a hunger for new knowledge in this area, rather than digesting prior knowledge).

Like Underwood and Drucker, Terras is stressing that students cannot forget the humanities for the digital.

Then they must become information literate, and IT literate. We have a variety of training courses at our institution, and there is also the “European Driving License in IT” which is basic IT skills. They must get the bug for learning more about computing too. They’ll know after some basic courses whether they are a natural fit to computing.

Without the bug to do research, and the bug to understand more about computing, they are sunk for pursuing DH. These are the two main prerequisites.

Interestingly (but not surprisingly, given general DH trends), Terras frames passion about computing as more important than any particular skill.

Once they get the bug, then taking whatever courses are on offer to them at their institution – either for credit modules, or pure training courses in various IT methods, would stand them in good stead. For example, you are not going to get a degree course in Photoshop, but attending 6 hours of training in that…. plus spreadsheets, plus databases, plus XML, plus web design, would prepare you for pursuing a variety of other courses. Even if the institution doesnt offer taught DH courses, chances are they offer training in IT. They need to get their hands dirty, and to love learning more about computing, and the information environment we inhabit.

Her stress on hyper-focused courses of a few hours each is also interesting, and very much in line with our “workshop and summer school”-focused training mindset in DH.

It’s at that stage I’d be looking for a master’s program in DH, to take the learning of both IT and the humanities to a different level. Your list excludes people who have done “pure” humanities as an undergrad to pursuing DH, and actually, I think DH needs people who are, ya know, obsessed with Byzantine Sculpture in the first instance, but aren’t afraid of learning new aspects of computing without having any undergrad credit courses in it.

I’d also say that there is plenty room for people who do it the other way around – undergrads in comp sci, who then learn and get the bug for humanities research.

Terras continued that taking everything as an undergraduate would equate more to liberal arts or information science than a pure humanities degree:

As with all of these things, it depends on the make up of the individual programs. In my undergrad, I did 6 courses in my final year. If I had taken all of the ones you suggest: (historical methods, philosophy of science, linear algebra, statistics, programming, and web development) then I wouldn’t have been able to take any humanities courses! which would mean I was doing liberal arts, or information science, rather than a pure humanities degree. This will be a problem for many – just sayin’. 🙂

But yes, I think the key thing really is the *interest* and the *passion*. If your institution doesnt allow that type of courses as part of a humanities degree, you haven’t shot yourself in the foot, you just need to learn computing some other way…

Self-teaching is something that I think most people reading this blog can get behind (or commiserate with). I’m glad Terras shifted my focus away from undergraduate courses, and more on how to get a DH education.

John Walsh

John Walsh is known in the DH world for his work on TEI, XML, and other formal data models of humanities media. He replied:

I started undergrad as a fine arts major (graphic design) at Ohio University, before switching to English literary studies. As an art major, I was required during my freshman year to take “Comparative Arts I & II,” in which we studied mostly the formal aspects of literature, visual arts, music, and architecture. Each of the two classes occupied a ten-week “quarter” (fall winter spring summer), rather than a semester. At the time OU had a department of comparative arts, which has since become the School of Interdisciplinary Arts.

In any case, they were fascinating classes, and until you asked the question, I hadn’t really considered those courses in the context of DH, but they were definitely relevant and influential to my own work. I took these courses in the 80s, but I imagine an updated version that took into account digital media and digital representations of non-digital media would be especially useful. The study of the formal aspects of these different art forms and media and shared issues of composition and construction gave me a solid foundation for my own work constructing things to model and represent these formal characteristics and relationships.

Walsh was the first one to single out a specific humanities course as particularly beneficial to the DH agenda. It makes sense: the course appears to have crossed many boundaries, focusing particularly on formal similarities. I’d hazard that this approach is at the heart of many of the more computational and formal areas of digital humanities (but perhaps less so for those areas more aligned with new media or critical theory).

I agree web development should be in the mix somewhere, along with something like Ryan Cordell’s “Text Technologies” that would cover various representations of text/documents and a look at their production, digital and otherwise, as well as tools (text analysis, topic modeling, visualization) for doing interesting things with those texts/documents.

Otherwise, Walsh’s courses aligned with those of Underwood and Drucker.

Matt Jockers

Matt Jockers‘ expertise, like Underwoods, tends toward computational literary history and criticism. His reply was short and to the point:

The thing I see missing here are courses Linguistics and Machine Learning. Specifically courses in computational linguistics, corpus linguistics, and NLP. The later are sometimes found in the CS depts. and sometimes in linguistics, it depends. Likewise, courses in Machine Learning are sometimes found in Statistics (as at Stanford) and sometimes in CS (as at UNL).

Jockers, like Underwood, mentioned that I was missing linguistics. On the twitter conversation, Heather Froehlich pointed out the same deficiency. He and Underwood also pointed out machine learning, which are particularly useful for the sort of research they both do.

Wrapping Up

I was initially surprised by how homogeneous the answers were, given the much-touted diversity of the digital humanities. I had asked a few others to get back to me, who for various reasons couldn’t get back to me at the time, situated more closely in the new media, alt-ac, and library camps, but even the similarity among those I asked was a bit surprising. Is it that DH is slowly canonizing around particular axes and methods, or is my selection criteria just woefully biased? I wouldn’t be too surprised if it were the latter.

In the end, it seems (at least according to life-paths of these particular digital humanists), the modern digital humanist should be a passionate generalist, well-versed in their particular field of humanistic inquiry, and decently-versed in a dizzying array of subjects and methods that are tied to computers in some way or another. The path is not necessarily one an undergraduate curriculum is well-suited for, but the self-motivated have many potential sources for education.

I was initially hoping to turn this short survey into a list of potential undergraduate curricula for different DH paths (much like my list of DH syllabi), but it seems we’re either not yet at that stage, or DH is particularly ill-suited for the undergraduate-style curricula. I’m hoping some of you will leave comments on the areas of DH I’ve clearly missed, but from the view thus-far, there seems to be more similarities than differences.

Breaking the Ph.D. model using pretty pictures

Earlier today, Heather Froehlich shared what’s at this point become a canonical illustration among Ph.D. students: “The Illustrated guide to a Ph.D.” The illustrator, Matt Might, describes the sum of human knowledge as a circle. As a child, you sit at the center of the circle, looking out in all directions.

PhDKnowledge.002[1]Eventually, he describes, you get various layers of education, until by the end of your bachelor’s degree you’ve begun focusing on a specialty, focusing knowledge in one direction.

PhDKnowledge.004[1]A master’s degree further deepens your focus, extending you toward an edge, and the process of pursuing a Ph.D., with all the requisite reading, brings you to a tiny portion of the boundary of human knowledge.

PhDKnowledge.007[1]

 

You push and push at the boundary until one day you finally poke through, pushing that tiny portion of the circle of knowledge just a wee bit further than it was. That act of pushing through is a Ph.D.

PhDKnowledge.010[1]

 

It’s an uplifting way of looking at the Ph.D. process, inspiring that dual feeling of insignificance and importance that staring at the Hubble Ultra-Deep Field tends to bring about. It also exemplifies, in my mind, one of the broken aspects of the modern Ph.D. But while we’re on the subject of the Hubble Ultra-Deep Field, let me digress momentarily about stars.

1024px-Hubble_ultra_deep_field_high_rez_edit1[1]Quite a while before you or I were born, Great Thinkers with Big Beards (I hear even the Great Women had them back then) also suggested we sat at the center of a giant circle, looking outwards. The entire universe, or in those days, the cosmos (Greek: κόσμος, “order”), was a series of perfect layered spheres, with us in the middle, and the stars embedded in the very top. The stars were either gems fixed to the last sphere, or they were little holes poked through it that let the light from heaven shine through.

pythagoras

As I see it, if we connect the celestial spheres theory to “The Illustrated Guide to a Ph.D.”, we’d arrive at the inescapable conclusion that every star in the sky is another dissertation, another hole poked letting the light of heaven shine through. And yeah, it takes a very prescriptive view of the knowledge and the universe that either you or I can argue with, but for this post we can let it slide because it’s beautiful, isn’t it? If you’re a Ph.D. student, don’t you want to be able to do this?

Flammarion[1]The problem is I don’t actually want to do this, and I imagine a lot of other people don’t want to do this, because there are already so many goddamn stars. Stars are nice. They’re pretty, how they twinkle up there in space, trillions of miles away from one another. That’s how being a Ph.D. student feels sometimes, too: there’s your research, my research, and a gap between us that can reach from Alpha Centauri and back again. Really, just astronomically far away.

distance

It shouldn’t have to be this way. Right now a Ph.D. is about finding or doing something that’s new, in a really deep and narrow way. It’s about pricking the fabric of the spheres to make a new star. In the end, you’ll know more about less than anyone else in the world. But there’s something deeply unsettling about students being trained to ignore the forest for the trees. In an increasingly connected world, the universe of knowledge about it seems to be ever-fracturing. Very few are being trained to stand back a bit and try to find patterns in the stars. To draw constellations.

orion-the-hunter[1]I should know. I’ve been trying to write a dissertation on something huge, and the advice I’ve gotten from almost every professor I’ve encountered is that I’ve got to scale it down. Focus more. I can’t come up with something new about everything, so I’ve got to do it about one thing, and do it well. And that’s good advice, I know! If a lot of people weren’t doing that a lot of the time, we’d all just be running around in circles and not doing cool things like going to the moon or watching animated pictures of cats on the internet.

But we also need to stand back and take stock, to connect things, and right now there are institutional barriers in place making that really difficult. My advisor, who stands back and connects things for a living (like the map of science below), gives me the same prudent advice as everyone else: focus more. It’s practical advice. For all that universities celebrate interdisciplinarity, in the end you still need to get hired by a department, and if you don’t fit neatly into their disciplinary niche, you’re not likely to make it.
430561725_4eb7bc5d8a_o1[1]My request is simple. If you’re responsible for hiring researchers, or promoting them, or in charge of a department or (!) a university, make it easier to be interdisciplinary. Continue hiring people who make new stars, but also welcome the sort of people who want to connect them. There certainly are a lot of stars out there, and it’s getting harder and harder to see what they have in common, and to connect them to what we do every day. New things are great, but connecting old things in new ways is also great. Sometimes we need to think wider, not deeper.

northern-constellations-sky[1]

Improving the Journal of Digital Humanities

Twitter and the digital humanities blogosphere has been abuzz recently over an ill-fated special issue of the Journal of Digital Humanities (JDH) on Postcolonial Digital Humanities. I won’t get too much into what happened and why, not because I don’t think it’s important, but because I respect both parties too much and feel I am too close to the story to provide an unbiased opinion. Summarizing, the guest editors felt they were treated poorly, in part because of the nature of their content, and in part because of the way the JDH handles its publications.

I wrote earlier on twitter that I no longer want to be involved in the conversation, by which I meant, I no longer want to be involved in the conversation about what happened and why. I do want to be involved in a discussion on how to get the JDH move beyond the issues of bias, poor communication, poor planning, and microaggression, whether or not any or all of those existed in this most recent issue. As James O’Sullivan wrote in a comment, “as long as there is doubt, this will be an unfortunate consequence.”

Journal of Digital Humanities
Journal of Digital Humanities

The JDH is an interesting publication, operating in part under the catch-the-good model of seeing what’s already out there and getting discussed, and aggregating it all into a quarterly journal. In some cases, that means re-purposing pre-existing videos and blog posts and social media conversations into journal “articles.” In others, it means soliciting original reviews or works that fit with the theme of a current important issue in DH. Some articles are reviewed barely at all – especially the videos – and some are heavily reviewed. The structure of the journal itself, over its five issues thus-far, has changed drastically to fit the topic and the experimental whims of editors and guest editors.

The issue that Elijah Meeks and I guest edited changed in format at least three times in the month or so we had to solidify the issue. It’s fast-paced, not always organized, and generally churns out good scholarship that seems to be cited heavily on blogs and in DH syllabi, but not yet so much in traditional press articles or books. The flexibility, I think, is part of its charm and experimental nature, but as this recent set of problems shows, it is not without its major downsides. The editors, guest editors, and invited authors are rarely certain of what the end product will look like, and if there is the slightest miscommunication, this uncertainty can lead to disaster. The variable nature of the editing process also opens the door for bias of various sorts, and because there is not a clear plan from the beginning, that bias (and the fear of bias) is hard to guard against. These are issues that need to be solved.

Roopika RisamMatt Burton, and I, among others, have all weighed in on the best way to move forward, and I’m drawing on these previous comments for this plan. It’s not without its holes and problems, and I am hoping there will be comments to improve the proposed process, but hopefully something like what I’m about to propose can let the JDH retain its flexibility while preventing further controversies of this particular variety.

  • Create a definitive set of guidelines and mission statement that is distributed to guest editors and authors before the process of publication begins. These guidelines do not need to set the publication process in stone, but can elucidate the roles of each individual and make clear the experimental nature of the JDH. This document cannot be deviated from within an issue publication cycle, but can be amended yearly. Perhaps, as with the open intent of the journal, part of this process can be crowdsourced from the previous year’s editors-at-large of DHNow.
  • Have a week at the beginning of each issue planning phase where authors (if they’ve been chosen yet), guest editors, and editors discuss what particular format the forthcoming issue will take, how it will be reviewed, and so forth. This is formalized into a binding document and will not be changed. The editorial staff has final say, but if the guest editors or authors do not like the final document, they have ample opportunity to leave.
  • Change the publication rate from quarterly to thrice-yearly. DH changes quickly, it shouldn’t be any slower than that, but quarterly seems to be a bit too tight for this process to work smoothly–especially with the proposed week-long committee session to figure out how the issue be run.
  • Make the process of picking special issue topics more open. I know the special issue I worked on came about by Elijah asking the JDH editors if they’d be interested in a topic modeling issue, and after (I imagine) some internal discussion, they agreed. The dhpoco special issue may have had a similar history. Even a public statement of “these people came to us, and this is why we thought the topic was relevant” would likely go a long way in fostering trust in the community.
  • Make the process of picking articles and authors more open; this might be the job of special issue guest editors, as Elijah and I were the ones who picked most of the content. Everyone has their part to play. What’s clear is there is a lot of confusion right now about how it works; some on Twitter recently have pointed out that, until recently, they’d assumed all articles came from the DHNow filter. Making content choice more clear in an introductory editorial would be useful.

Obviously this is not a cure for all ills, but hopefully it’s good ground to start on the path forward. If the JDH takes this opportunity to reform some of their policies, my hope is that it will be seen as an olive branch to the community, ensuring to the best of their ability that there will be no question of whether bias is taking place, implicit or otherwise. Further suggestions in the comments are welcome.

Addendum: In private communication with Matt Burton, he and I realized that the ‘special issue’ and ‘guest editor’ role is not actually one that seems to be aligned with the initial intent of the JDH, which seemed instead to be about reflecting the DH discourse from the previous quarter. Perhaps a movement away from special issues, or having a separate associated entity for special issues with its own set of rules, would be another potential path forward.

On MOOCs

Nobody has said so to my face, but sometimes I’m scared that some of my readers secretly think I’m single-handedly assisting in the downfall of academia as we know it. You see, I was the associate instructor of an information visualization MOOC this past semester, and next Spring I’ll be putting together my own MOOC on information visualization for the digital humanities. It’s an odd position to be in, when much of the anti-DH rhetoric is against MOOCs while so few DHers actually support them (and it seems most vehemently denounce them). I’ve occasionally wondered if I’m the mostly-fictional strawman the anti-DH crowd is actually railing against. I don’t think I am, and I think it’s well-past time I posted my rationale of why.

This post itself was prompted by two others; one by Adam Crymble asking if The Programming Historian is a MOOC, and the other by Jonathan Rees on why even if you say your MOOC is going to be different, it probably won’t be. That last post was referenced by Andrew Goldstone:

With that in mind, let me preface by saying I’m a well-meaning MOOCer, and I think that if you match good intentions with reasonable actions, MOOCs can actually be used for good. How you build, deploy, and use a MOOC can go a long way, and it seems a lot of the fear behind them assumes there is one and only one inevitable path for them to go down which would ultimately result in the loss of academic jobs and a decrease in education standards.

Let’s begin with the oft-quoted Cathy Davidson, “If we can be replaced replaced by a computer screen, we should be.” I don’t believe Davidson is advocating for what Rees accuses her of in the above blog post. One prevailing argument against MOOCs is that the professors are distant, the interactivity is minimal-to-non-existent, and the overall student experience is one of weak detachment. I wonder, though, how many thousand-large undergraduate lectures offer better experiences; many do, I’m sure, but many also do not. In those cases, it seems a more engaging lecturer, at least, might be warranted. I doubt many-if-any MOOC teachers believe there are any other situations which could warrant the replacement of a university course with a MOOC beyond those where the student experience is already so abysmal that anything might help.

The question then arises, in those few situations where MOOCs might be better for enrolled students, what havoc might they wreak on already worsening faculty job opportunities? The toll on teaching in the face of automation might match the toll of the skilled craftsmen in the face of the industrial and eventually mechanical revolution. If you feel angry at replacing laborers with machines in situations where the latter are just (or nearly) as good as the former, and at a fraction of the cost, then you’ll likely also believe MOOCs replacing giant undergrad lectures (which, let’s face it, are often closer to unskilled than to skilled labor in this metaphor) is also unethical.

Rees echoes this fear of automation on the student’s end, suggesting “forcing students into MOOCs as a last resort is like automating your wedding or the birth of your first child.  You’re taking something that ought to depend upon the glorious unpredictability of human interaction and turning it into mass-produced, impersonal, disposable schlock.” The fear is echoed as well by Adam Crymble in his Programming Historian piece, when he says “what sets a MOOC apart from a classroom-based course is a belief that the tutor-tutee relationship can be depersonalized and made redundant. MOOCs replace this relationship with a series of steps. If you learn the steps in the right order and engage actively with the material you learn what you need to know and who needs teacher?”

MOOCs can happen to you! via cogdogblog.
MOOCs can happen to you! via cogdogblog.

The problem is that this entire dialogue rests on assumptions of Crymble and others taking the form that those who support MOOCs do so because, deep down, they believe “If you learn the steps in the right order and engage actively with the material you learn what you need to know and who needs teacher?” It is this set of assumptions that I would like to push against; the idea that all MOOCs must inevitably lead to automated teaching, regardless of the intentions, and that they exist as classroom replacements. I argue that, if designed and utilized correctly, MOOCs can lead to classroom augmentations and in fact can be designed in a way that they can no more be used to replace classrooms than massively-distributed textbooks can.

When Katy Börner and our team designed and built the Information Visualization MOOC, we did so using Google’s open source Course Builder, with the intention of opening knowledge to whomever wanted to learn it regardless of whether they could afford to enroll in one of the few universities around that offers this sort of course. Each of the lectures were recordings of usual lectures for that class, but cut up into more bite-sized chunks, and included tutorials on how to use software. We ran the MOOC concurrently with our graduate course of the same focus, and we used the MOOC as a sort of video textbook that (I hope) was more concise and engaging than most information visualization textbooks out there, and (importantly) completely free to the students. Students watched pre-recorded lectures at home  and then we discussed the lessons and did hands-on training during class time, not dissimilar from the style of flipped teaching.

For those not enrolled in the physical course, we opened up the lectures to anyone who wanted to join in, and created a series of tests and assignments which required students to work together in small teams of 4-5 on real world client data if they wanted to get credit for the course. Many just wanted to learn, and didn’t care about credit. Some still took the exams because they wanted to know how well they’d learned the data, even if they weren’t taking the course credit. Some just did the client projects, because they thought it would give them good real-world experience, but didn’t take the tests and go for the credit. The “credit,” by the way, was a badge from Mozilla Open Badges, and we designed them to be particularly difficult to achieve because we wanted them to mean something. We also hand-graded the client projects.

The IVMOOC Badge.
The IVMOOC Badge.

The thing is, at no time did we ever equate the MOOC with a graduate course, or ever suggest that it could be taken as a replacement credit  instead of some real course. And, by building the course in Google’s course-builder and hosting it ourselves, we have complete control over it; universities can’t take it and change it around as they see fit to offer it for credit. I suppose it’s possible that some university out there might allow students to wave a methodology credit if they get our badge, but I fail to see how that would be any different from universities offering course-waving for students reading a textbook on their own and taking some standard test afterward; it’s done, but not often.

In short, we offer the MOOC as a free and open textbook, not as a classroom replacement. Within the classroom, we use it as a tool for augmenting instruction. For those who choose to do assignments, and perform well on them with their student teams, we acknowledge their good work with a badge rather than a university credit. The fear that MOOCs will necessarily automate teachers away is no more well-founded than the idea that textbooks-and-standardized-tests would; further, if administrators choose to use MOOCs for this purpose, they are no more justified in doing this than they would be justified in replacing teachers with textbooks. That they still might is of course a frightening prospect, and something we need to guard against, but should no more be blamed on MOOC instructors than they would be blamed on textbook authors in the alternative scenario. It doesn’t seem we’re any different from what Adam Crymble described The Programming Historian to be (recall: definitely not a MOOC).

We’re making it easier for people to teach themselves interesting and useful things. Whether administrators use that for good or ill is a separate issue. Whether more open and free training trumps our need to employ all the wandering academics out there is a separate issue – as is whether or not that dichotomy is even a valid one. As it stands now, though, I’m proud of the work we’ve done on the IVMOOC, I’m proud of the students of the physical course, and I’m especially proud of all the amazing students around the world who came out of the MOOC producing beautiful visualization projects, and are better prepared for life in a data-rich world.