Do historians need scientists?

[edit: I’m realizing I didn’t make it clear in this post that I’m aware many historians consider themselves scientists, and that there’s plenty of scientific historical archaeology and anthropology. That’s exactly what I’m advocating there be more of, and more varied.]

Short Answer: Yes.

Less Snarky Answer: Historians need to be flexible to fresh methods, fresh perspectives, and fresh blood. Maybe not that last one, I guess, as it might invite vampires.Okay, I suppose this answer wasn’t actually less snarky.

Long Answer

The long answer is that historians don’t necessarily need scientists, but that we do need fresh scientific methods. Perhaps as an accident of our association with the ill-defined “humanities”, or as a result of our being placed in an entirely different culture (see: C.P. Snow), most historians seem fairly content with methods rooted in thinking about text and other archival evidence. This isn’t true of all historians, of course – there are economic historians who use statistics, historians of science who recreate old scientific experiments, classical historians who augment their research with archaeological findings, archival historians who use advanced ink analysis,  and so forth. But it wouldn’t be stretching the truth to say that, for the most part, historiography is the practice of thinking cleverly about words to make more words.

I’ll argue here that our reliance on traditional methods (or maybe more accurately, our odd habit of rarely discussing method) is crippling historiography, and is making it increasingly likely that the most interesting and innovative historical work will come from non-historians. Sometimes these studies are ill-informed, especially when the authors decide not to collaborate with historians who know the subject, but to claim that a few ignorant claims about history negate the impact of these new insights is an exercise in pedantry.

In defending the humanities, we like to say that scientists and technologists with liberal arts backgrounds are more well-rounded, better citizens of the world, more able to contextualize their work. Non-humanists benefit from a liberal arts education in pretty much all the ways that are impossible to quantify (and thus, extremely difficult to defend against budget cuts). We argue this in the interest of rounding a person’s knowledge, to make them aware of their past, of their place in a society with staggering power imbalances and systemic biases.

Humanities departments should take a page from their own books. Sure, a few general ed requirements force some basic science and math… but I got an undergraduate history degree in a nice university, and I’m well aware how little STEM I actually needed to get through it. Our departments are just as guilty of narrowness as those of our STEM colleagues, and often because of it, we rely on applied mathematicians, statistical physicists, chemists, or computer scientists to do our innovative work for (or sometimes, thankfully, with) us.

Of course, there’s still lots of innovative work to be done from a textual perspective. I’m not downplaying that. Not everyone needs to use crazy physics/chemistry/computer science/etc. methods. But there’s a lot of low hanging fruit at the intersection of historiography and the natural sciences, and we’re not doing a great job of plucking it.

The story below is illustrative.

Gutenberg

Last night, Blaise Agüera y Arcas presented his research on Gutenberg to a packed house at our rare books library. He’s responsible for a lot of the cool things that have come out of Microsoft in the last few years, and just got a job at Google, where presumably he will continue to make cool things. Blaise has degrees in physics and applied mathematics. And, a decade ago, Blaise and historian/librarian Paul Needham sent ripples through the History of the Book community by showing that Gutenberg’s press did not work at all the way people expected.

It was generally assumed that Gutenberg employed a method called punchcutting in order to create a standard font. A letter carved into a metal rod (a “punch”) would be driven into a softer metal (a “matrix”) in order to create a mold. The mold would be filled with liquid metal which hardened to form a small block of a single letter (a “type”), which would then be loaded onto the press next to other letters, inked, and then impressed onto a page. Because the mold was metal, many duplicate “types” could be made of the same letter, thus allowing many uses of the same letter to appear identical on a single pressed page.

Punch matrix system. [via]
Punch matrix system. [via]
Type to be pressed. [via]
Type to be pressed. [via]
This process is what allowed all the duplicate letters to appear identical in Gutenberg’s published books. Except, of course, careful historians of early print noticed that letters weren’t, in fact, identical. In the 1980s, Paul Needham and a colleague attempted to produce an inventory of all the different versions of letters Gutenberg used, but they stopped after frequently finding 10 or more obviously distinct versions of the same letter.

Needham's inventory of Gutenberg type. [via]
Needham’s inventory of Gutenberg type. [via]
This was perplexing, but the subject was bracketed away for a while, until Blaise Agüera y Arcas came to Princeton and decided to work with Needham on the problem. Using extremely high-resolution imagining techniques, Blaise noted that there were in fact hundreds of versions of every letter. Not only that, there were actually variations and regularities in the smaller elements that made up letters. For example, an “n” was formed by two adjacent vertical lines, but occasionally the two vertical lines seem to have flipped places entirely. The extremely basic letter “i” itself had many variations, but within those variations, many odd self-similarities.

Variations in the letter "i" in Gutenberg's type. [via]
Variations in the letter “i” in Gutenberg’s type. [via]
Historians had, until this analysis, assumed most letter variations were due to wear of the type blocks. This analysis blew that hypothesis out of the water. These “i”s were clearly not all made in the same mold; but then, how had they been made? To answer this, they looked even closer at the individual letters.

 

Close up of Gutenberg letters, with light shining through page. [via]
Close up of Gutenberg letters, with light shining through page. [via]
It’s difficult to see at first glance, but they found something a bit surprising. The letters appeared to be formed of overlapping smaller parts: a vertical line, a diagonal box, and so forth. The below figure shows a good example of this. The glyphs on the bottom have have a stem dipping below the bottom horizontal line, while the glyphs at the top do not.

Abbreviation of 'per'. [via]
Abbreviation of ‘per’. [via]
The conclusion Needham and Agüera y Arcas drew, eventually, was that the punchcutting method must not have been used for Gutenberg’s early material. Instead, a set of carved “strokes” were pushed into hard sand or soft clay, configured such that the strokes would align to form various letters, not unlike the formation of cuneiform. This mold would then be used to cast letters, creating the blocks we recognize from movable type. The catch is that this soft clay could only cast letters a few times before it became unusable and would need to be recreated. As Gutenberg needed multiple instances of individual letters per page, many of those letters would be cast from slightly different soft molds.

Low-Hanging Fruit

At the end of his talk, Blaise made an offhand comment: how is it that historians/bibliographers/librarians have been looking at these Gutenbergs for so long, discussing the triumph of their identical characters, and not noticed that the characters are anything but uniform? Or, of those who had noticed it, why hadn’t they raised any red flags?

The insights they produced weren’t staggering feats of technology. He used a nice camera, a light shining through the pages of an old manuscript, and a few simple image recognition and clustering algorithms. The clustering part could even have been done by hand, and actually had been, by Paul Needham. And yes, it’s true, everything is obvious in hindsight, but there were a lot of eyes on these bibles, and odds are if some of them had been historians who were trained in these techniques, this insight could have come sooner. Every year students do final projects and theses and dissertations, but what percent of those use techniques from outside historiography?

In short, there’s a lot of very basic assumptions we make about the past that could probably be updated significantly if we had the right skillset, or knew how to collaborate with those who did. I think people like William Newman, who performs Newton’s alchemical experiments, is on the right track. As is Shawn Graham, who reanimates the trade networks of ancient Rome using agent-based simulations, or Devon Elliott, who creates computational and physical models of objects from the history of stage magic. Elliott’s models have shown that certain magic tricks couldn’t possibly have worked as they were described to.

The challenge is how to encourage this willingness to reach outside traditional historiographic methods to learn about the past. Changing curricula to be more flexible is one way, but that is a slow and institutionally difficult process. Perhaps faculty could assign group projects to students taking their gen-ed history courses, encouraging disciplinary mixes and non-traditional methods. It’s an open question, and not an easy one, but it’s one we need to tackle.

Bridging Token and Type

There’s an oft-spoken and somewhat strawman tale of how the digital humanities is bridging C.P. Snow’s “Two Culture” divide, between the sciences and the humanities. This story is sometimes true (it’s fun putting together Ocean’s Eleven-esque teams comprising every discipline needed to get the job done) and sometimes false (plenty of people on either side still view the other with skepticism), but as a historian of science, I don’t find the divide all that interesting. As Snow’s title suggests, this divide is first and foremost cultural. There’s another overlapping divide, a bit more epistemological, methodological, and ontological, which I’ll explore here. It’s the nomothetic(type)/idiographic(token) divide, and I’ll argue here that not only are its barriers falling, but also that the distinction itself is becoming less relevant.

Nomothetic (Greek for “establishing general laws”-ish) and Idiographic (Greek for “pertaining to the individual thing”-ish) approaches to knowledge have often split the sciences and the humanities. I’ll offload the hard work onto Wikipedia:

Nomothetic is based on what Kant described as a tendency to generalize, and is typical for the natural sciences. It describes the effort to derive laws that explain objective phenomena in general.

Idiographic is based on what Kant described as a tendency to specify, and is typical for the humanities. It describes the effort to understand the meaning of contingent, unique, and often subjective phenomena.

These words are long and annoying to keep retyping, and so in the longstanding humanistic tradition of using new words for words which already exist, henceforth I shall refer to nomothetic as type and idiographic as token. 1 I use these because a lot of my digital humanities readers will be familiar with their use in text mining. If you counted the number of unique words in a text, you’d be be counting the number of types. If you counted the number of total words in a text, you’d be counting the number of tokens, because each token (word) is an individual instance of a type. You can think of a type as the platonic ideal of the word (notice the word typical?), floating out there in the ether, and every time it’s actually used, it’s one specific token of that general type.

The Token/Type Distinction
The Token/Type Distinction

Usually the natural and social sciences look for general principles or causal laws, of which the phenomena they observe are specific instances. A social scientist might note that every time a student buys a $500 textbook, they actively seek a publisher to punch, but when they purchase $20 textbooks, no such punching occurs. This leads to the discovery of a new law linking student violence with textbook prices. It’s worth noting that these laws can and often are nuanced and carefully crafted, with an awareness that they are neither wholly deterministic nor ironclad.

[via]
[via]
The humanities (or at least history, which I’m more familiar with) are more interested in what happened than in what tends to happen. Without a doubt there are general theories involved, just as in the social sciences there are specific instances, but the intent is most-often to flesh out details and create a particular internally consistent narrative. They look for tokens where the social scientists look for types. Another way to look at it is that the humanist wants to know what makes a thing unique, and the social scientist wants to know what makes a thing comparable.

It’s been noted these are fundamentally different goals. Indeed, how can you in the same research articulate the subjective contingency of an event while simultaneously using it to formulate some general law, applicable in all such cases? Rather than answer that question, it’s worth taking time to survey some recent research.

A recent digital humanities panel at MLA elicited responses by Ted Underwood and Haun Saussy, of which this post is in part itself a response. One of the papers at the panel, by Long and So, explored the extent to which haiku-esque poetry preceded what is commonly considered the beginning of haiku in America by about 20 years. They do this by teaching the computer the form of the haiku, and having it algorithmically explore earlier poetry looking for similarities. Saussy comments on this work:

[…] macroanalysis leads us to reconceive one of our founding distinctions, that between the individual work and the generality to which it belongs, the nation, context, period or movement. We differentiate ourselves from our social-science colleagues in that we are primarily interested in individual cases, not general trends. But given enough data, the individual appears as a correlation among multiple generalities.

One of the significant difficulties faced by digital humanists, and a driving force behind critics like Johanna Drucker, is the fundamental opposition between the traditional humanistic value of stressing subjectivity, uniqueness, and contingency, and the formal computational necessity of filling a database with hard decisions. A database, after all, requires you to make a series of binary choices in well-defined categories: is it or isn’t it an example of haiku? Is the author a man or a woman? Is there an author or isn’t there an author?

Underwood addresses this difficulty in his response:

Though we aspire to subtlety, in practice it’s hard to move from individual instances to groups without constructing something like the sovereign in the frontispiece for Hobbes’ Leviathan – a homogenous collection of instances composing a giant body with clear edges.

But he goes on to suggest that the initial constraint of the digital media may not be as difficult to overcome as it appears. Computers may even offer us a way to move beyond the categories we humanists use, like genre or period.

Aren’t computers all about “binary logic”? If I tell my computer that this poem both is and is not a haiku, won’t it probably start to sputter and emit smoke?

Well, maybe not. And actually I think this is a point that should be obvious but just happens to fall in a cultural blind spot right now. The whole point of quantification is to get beyond binary categories — to grapple with questions of degree that aren’t well-represented as yes-or-no questions. Classification algorithms, for instance, are actually very good at shades of gray; they can express predictions as degrees of probability and assign the same text different degrees of membership in as many overlapping categories as you like.

Here we begin to see how the questions asked of digital humanists (on the one side; computational social scientists are tackling these same problems) are forcing us to reconsider the divide between the general and the specific, as well as the meanings of categories and typologies we have traditionally taken for granted. However, this does not yet cut across the token/type divide: this has gotten us to the macro scale, but it does not address general principles or laws that might govern specific instances. Historical laws are a murky subject, prone to inducing fits of anti-deterministic rage. Complex Systems Science and the lessons we learn from Agent-Based Modeling, I think, offer us a way past that dilemma, but more on that later.

For now, let’s talk about influence. Or diffusion. Or intertextuality. 2 Matthew Jockers has been exploring these concepts, most recently in his book Macroanalysis. The undercurrent of his research (I think I’ve heard him call it his “dangerous idea”) is a thread of almost-determinism. It is the simple idea that an author’s environment influences her writing in profound and easy to measure ways. On its surface it seems fairly innocuous, but it’s tied into a decades-long argument about the role of choice, subjectivity, creativity, contingency, and determinism. One word that people have used to get around the debate is affordances, and it’s as good a word as any to invoke here. What Jockers has found is a set of environmental conditions which afford certain writing styles and subject matters to an author. It’s not that authors are predetermined to write certain things at certain times, but that a series of factors combine to make the conditions ripe for certain writing styles, genres, etc., and not for others. The history of science analog would be the idea that, had Einstein never existed, relativity and quantum physics would still have come about; perhaps not as quickly, and perhaps not from the same person or in the same form, but they were ideas whose time had come. The environment was primed for their eventual existence. 3

An example of shape affording certain actions by constraining possibilities and influencing people. [via]
An example of shape affording certain actions by constraining possibilities and influencing people. [via]
It is here we see the digital humanities battling with the token/type distinction, and finding that distinction less relevant to its self-identification. It is no longer a question of whether one can impose or generalize laws on specific instances, because the axes of interest have changed. More and more, especially under the influence of new macroanalytic methodologies, we find that the specific and the general contextualize and augment each other.

The computational social sciences are converging on a similar shift. Jon Kleinberg likes to compare some old work by Stanley Milgram 4, where he had people draw maps of cities from memory, with digital city reconstruction projects which attempt to bridge the subjective and objective experiences of cities. The result in both cases is an attempt at something new: not quite objective, not quite subjective, and not quite intersubjective. It is a representation of collective individual experiences which in its whole has meaning, but also can be used to contextualize the specific. That these types of observations can often lead to shockingly accurate predictive “laws” isn’t really the point; they’re accidental results of an attempt to understand unique and contingent experiences at a grand scale. 5

Manhattan. Dots represent where people have taken pictures; blue dots are by locals, red by tourists, and yellow unsure. [via Eric Fischer]
Manhattan. Dots represent where people have taken pictures; blue dots are by locals, red by tourists, and yellow are uncertain. [via Eric Fischer]
It is no surprise that the token/type divide is woven into the subjective/objective divide. However, as Daston and Galison have pointed out, objectivity is not an ahistorical category. 6 It has a history, is only positively defined in relation to subjectivity, and neither were particularly useful concepts before the 19th century.

I would argue, as well, that the nomothetic and idiographic divide is one which is outliving its historical usefulness. Work from both the digital humanities and the computational social sciences is converging to a point where the objective and the subjective can peaceably coexist, where contingent experiences can be placed alongside general predictive principles without any cognitive dissonance, under a framework that allows both deterministic and creative elements. It is not that purely nomothetic or purely idiographic research will no longer exist, but that they no longer represent a binary category which can usefully differentiate research agendas. We still have Snow’s primary cultural distinctions, of course, and a bevy of disciplinary differences, but it will be interesting to see where this shift in axes takes us.

Notes:

  1. I am not the first to do this. Aviezer Tucker (2012) has a great chapter in The Oxford Handbook of Philosophy of Social Science, “Sciences of Historical Tokens and Theoretical Types: History and the Social Sciences” which introduces and historicizes the vocabulary nicely.
  2. Underwood’s post raises these points, as well.
  3. This has sometimes been referred to as environmental possibilism.
  4. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, edited by Proshansky, Ittelson, and Rivlin, 104–124. New York.

    ———. 1982. “Cities as Social Representations.” In Social Representations, edited by R. Farr and S. Moscovici, 289–309.

  5. If you’re interested in more thoughts on this subject specifically, I wrote a bit about it in relation to single-authorship in the humanities here
  6. Daston, Lorraine, and Peter Galison. 2007. Objectivity. New York, NY: Zone Books.

Submissions to Digital Humanities 2014

Submissions for the 2014 Digital Humanities conference just closed. It’ll be in Switzerland this time around, which unfortunately means I won’t be able make it, but I’ll be eagerly following along from afar. Like last year, reviewers are allowed to preview the submitted abstracts. Also like last year, I’m going to be a reviewer, which means I’ll have the opportunity to revisit the submissions to DH2013 to see how the submissions differed this time around. No doubt when the reviews are in and the accepted articles are revealed, I’ll also revisit my analysis of DH conference acceptances.

To start with, the conference organizers received a record number of submissions this year: 589. Last year’s Nebraska conference only received 348 submissions. The general scope of the submissions haven’t changed much; authors were still supposed to tag their submissions using a controlled vocabulary of 95 topics, and were also allowed to submit keywords of their own making. Like last year, authors could submit long papers, short papers, panels, or posters, but unlike last year, multilingual submissions were encouraged (English, French, German, Italian, or Spanish). [edit: Bethany Nowviskie, patient awesome person that she is, has noticed yet another mistake I’ve made in this series of posts. Apparently last year they also welcomed multilingual submissions, and it is standard practice.]

Digital Humanities is known for its collaborative nature, and not much has changed in that respect between 2013 and 2014 (Figure 1). Submissions had, on average, between two and three authors, with 60% of submissions in both years having at least two authors. This year, a few fewer papers have single authors, and a few more have two authors, but the difference is too small to be attributable to anything but noise.

Figure 1. Number of authors per paper.
Figure 1. Number of authors per paper.

The distribution of topics being written about has changed mildly, though rarely in extreme ways. Any changes visible should also be taken with a grain of salt, because a trend over a single year is hardly statistically robust to small changes, say, in the location of the event.

The grey bars in Figure 2 show what percentage of DH2014 submissions are tagged with a certain topic, and the red dotted outlines show what the percentages were in 2013. The upward trends to note this year are text analysis, historical studies, cultural studies, semantic analysis, and corpora and corpus activities. Text analysis was tagged to 15% of submissions in 2013 and is now tagged to 20% of submissions, or one out of every five. Corpus analysis similarly bumped from 9% to 13%. Clearly this is an important pillar of modern DH.

Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The dotted lines represent the percentage from DH2013.
Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The red dotted outlines represent the percentage from DH2013.

I’ve pointed out before that History is secondary compared to Literary Studies in DH (although Ted Underwood has convincingly argued, using Ben Schmidt’s data, that the numbers may merely be due to fewer people studying history). This year, however, historical studies nearly doubled in presence, from 10% to 17%. I haven’t yet collected enough years of DH conference data to see if this is a trend in the discipline at large, or more of a difference between European and North American DH. Semantic analysis jumped from 1% to 7% of the submissions, cultural studies went from 10% to 14%, and literary studies stayed roughly equivalent. Visualization, one of the hottest topics of DH2013, has become even hotter in 2014 (14% to 16%).

The most visible drops in coverage came in pedagogy, scholarly editions, user interfaces, and research involving social media and the web. At DH2013, submissions on pedagogy had a surprisingly low acceptance rate, which combined the drop in pedagogy submissions this year (11% to 8% in “Digital Humanities – Pedagogy and Curriculum” and 7% to 4% in “Teaching and Pedagogy”) might suggest a general decline in interest in the DH world in pedagogy. “Scholarly Editing” went from 11% to 7% of the submissions, and “Interface and User Experience Design” from 13% to 8%, which is yet more evidence for the lack of research going into the creation of scholarly editions compared to several years ago. The most surprising drops for me were those in “Internet / World Wide Web” (12% to 8%) and “Social Media” (8.5% to 5%), which I would have guessed would be growing rather than shrinking.

The last thing I’ll cover in this post is the author-chosen keywords. While authors needed to tag their submissions from a list of 95 controlled vocabulary words, they were also encouraged to tag their entries with keywords they could choose themselves. In all they chose nearly 1,700 keywords to describe their 589 submissions. In last year’s analysis of these keywords, I showed that visualization seemed to be the glue that held the DH world together; whether discussing TEI, history, network analysis, or archiving, all the disparate communities seemed to share visualization as a primary method. The 2014 keyword map (Figure 3) reveals the same trend: visualization is squarely in the middle. In this graph, two keywords are linked if they appear together on the same submission, thus creating a network of keywords as they co-occur with one another. Words appear bigger when they span communities.

Figure 3. Co-occurrence of DH2014 author-submitted keywords.
Figure 3. Co-occurrence of DH2014 author-submitted keywords.

Despite the multilingual conference, the large component of the graph is still English. We can see some fairly predictable patterns: TEI is coupled quite closely with XML; collaboration is another keyword that binds the community together, as is (obviously) “Digital Humanities.” Linguistic and literature are tightly coupled, much moreso than, say, linguistic and history. It appears the distant reading of poetry is becoming popular, which I’d guess is a relatively new phenomena, although I haven’t gone back and checked.

This work has been supported by an ACH microgrant to analyze DH conferences and the trends of DH through them, so keep an eye out for more of these posts forthcoming that look through the last 15 years. Though I usually share all my data, I’ll be keeping these to myself, as the submitters to the conference did so under an expectation of privacy if their proposals were not accepted.

[edit: there was some interest on twitter last night for a raw frequency of keywords. Because keywords are author-chosen and I’m trying to maintain some privacy on the data, I’m only going to list those keywords used at least twice. Here you go (Figure 4)!]

Figure 4. Keywords used in DH2014 submissions ordered by frequency.
Figure 4. Keywords used in DH2014 submissions ordered by frequency.

Networks Demystified 7: Doing Co-Citation Analyses

So this is awkward. I’ve published Networks Demystified 7: Doing Citation Analyses before Networks Demystified 6: Organizing Your Twitter Lists. What depraved lunatic would do such a thing? The kind of depraved lunatic that is teaching this very subject twice in the next two weeks: deal with it, you’ll get your twitterstructions soon, internet. In the meantime, enjoy the irregular nature of the scottbot irregular.

And this is part 7 of my increasingly inaccurately named trilogy of instructional network analysis posts (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis, 5 communities and PageRank, 6 this space left intentionally blank). I’m covering how to actually do citation analyses, so it’s a continuation of part 4 of the series. If you want to know what citation analysis is and why to do it, as well as a laundry list of previous examples in the humanities and social sciences, go read that post. If you want to just finally be able to analyze citations, like you’ve always dreamed, read on. 1

You’re going to need two things for these instructions: The Sci2 Tool, and either a subscription to the multi-gazillion dollar ISI Web of Science database, or this sample dataset. The Sci2 (Science of Science) Tool is a fairly buggy program (I’m allowed to say that because I’m kinda off-and-on the development team and I wrote half the user manual) that specializes in ingesting data of various formats and turning them into networks for analysis and visualization. It’s a good tool to use before you run to Gephi to make your networks pretty, and has a growing list of available plugins. If you already have the Sci2 Tool, download it again, because there’s a new version and it doesn’t auto-update. Go download it. It’s 80mb, I’ll wait.

Once you’ve registered for (not my decision, don’t blame me!) and downloaded the tool, extract the zip folder wherever you want, no install necessary. The first thing to do is increase the amount of memory available to the program, assuming you have at least a gig of RAM on your computer. We’re going to be doing some intensive analysis, so you’ll need the extra space. Edit sci2.ini; on Windows, that can be done by right-clicking on the file and selecting ‘edit’; on Mac, I dunno, elbow-click and press ‘CHANGO’? I have no idea how things work on Macs. (Sorry Mac-folk! We’ve actually documented in more detail how to increase memory – on both Windows and Mac – here)

Once editing the file, you’ll see a nigh-unintelligble string of letters and numbers that end in “-Xmx350m”. Assuming you have more than a gig of RAM on your computer, change that to “-Xmx1000m”. If you don’t have more RAM, really, you should go get some. Or use only a quarter of the dataset provided. Save it and close the text editor.

Run Sci2.exe We didn’t pay Microsoft to register the app, so if you’re on Windows, you may get a OHMYGODWARNING sign. Click ‘run anyway’ and safely let my team’s software hack your computer and use it to send pictures of cats to famous network scientists. (No, we’ll be good, promise). You’ll get to a screen remarkably like Figure 7. Leave it open, and if you’re at an institution that pays ISI Web of Science the big bucks, head there now. Otherwise ignore this and just download the sample dataset.

Downloading Data

I’m a historian of science, so let’s look for history of science articles. Search for ‘Isis‘ as a ‘Publication Name’ from the drop-down menu (see Figure 1) and notice that, as of 9/23/2013, there are 14,858 results (see Figure 2).

Figure 1: Searching for Isis as the name of a publication.
Figure 1: Searching for Isis as the name of a publication.
Figure 2: Isis periodical search results.
Figure 2: Isis periodical search results.

This is a list of every publication in the journal ISIS. Each individual record includes bibliographic material, abstract, and the list of references that are cited in the article. To get a reasonable dataset to work with, we’re going to download every article ever published in ISIS, of which there are 1,189. The rest of the records are book reviews, notes, etc. Select only the articles by clicking the checkbox next to ‘articles’ on the left side of the results screen and clicking ‘refine’.

The next step is to download all the records. This web service limits you to 500 records per download, so you’re going to need to download 3 separate files (records 1-500, 501-1000, and 1001-1189) and combine them together, which is a fairly complicated step, so pay close attention. There’s a little “Send to:” drop-down menu at the top of the search results (Figure 3). Click it, and click ‘Other File Formats’.

Figure 3: Saving Web of Science records.
Figure 3: Saving Web of Science records.

At the pop-up box, check the radio box for records 1 to 500 and enter those numbers, change the record content to ‘Full Record and Cited References’, and change the file format to ‘Plain Text’ (Figure 4). Save the file somewhere you’ll be able to find it. Do this twice more, changing the numbers to 501-1000 and 1001-1189, saving these files as well.

Figure 4: Parameters for downloading Web of Science files.
Figure 4: Parameters for downloading Web of Science files.

You’ll end up with three files, possibly named: savedrecs.txt, savedrecs(1).txt, and savedrecs(2).txt. If you open one up (Figure 5), you’ll see that each individual article gets its own several-dozen lines, and includes information like author, title, keywords, abstract, and (importantly in our case) cited references.

Figure 5: An example Isis record.
Figure 5: An example ISIS record.
Figure 6: The end of an ISIS record file.
Figure 6: The end of an ISIS record file.

You’ll also notice (Figures 5 & 6) that first two lines and last line of every file are special header and footer lines. If we want to merge the three files so that the Sci2 Tool can understand it, we have to delete the footer of the first file, the header and footer of the second file, and the header of the last file, so that the new text file only has one header at the beginning, one footer at the end, and none in between. Those of you who are familiar enough with a text editor (and let’s be honest, it should be everyone reading this) go ahead and copy the three files into one huge file with only one header and footer. If you’re feeling lazy, just download it here.

Creating a Citation Network

Now open the Sci2 Tool (Figure 7) and go to File->Load in the drop-down menu. Find your super file with all of ISIS and open it, loading it as an ‘ISI flat format’ file (Figure 8).

Figure 7: The Sci2 Tool.
Figure 7: The Sci2 Tool.
Figure 8: Loading a file as an ISI flat format file.
Figure 8: Loading a file as an ISI flat format file.

If all goes correctly, two new files should appear in the Data Manager, the pane on the right-hand side of the software. I’ll take a bit of a detour here to explain the Sci2 Tool.

The main ‘Console’ pane on the top-left will include a complete log of your workflow, including all the various algorithms you use, what settings and parameters you use with them, and how to cite the various ones you use. When you close the program, a copy of the text in the ‘Console’ pain will save itself as a log file in the program directory so you can go back to it later and see what exactly you did.

The ‘Scheduler’ pane on the bottom is just that: it shows you what algorithms are currently running and what already ran. You can safely ignore it.

Along with the drop-down menus at the top, the already-mentioned ‘Data Manager’ pane on the right is where you’ll be spending most of your time. Every time you load a file, it will appear in the data manager. Every time you run an algorithm on or manipulate that file in some way, a copy of it with the new changes will appear hierarchically nested below the original file. This is so, if you make a mistake, want to use an earlier version of the file, or want to run run a different set of analyses, you can still do so. You can right-click on files in the data manager to view or save them in various file formats. It is important to remember to make sure that the appropriate file is selected in the data manager when you run an analysis, as it’s easy to accidentally run an algorithm on some other random data file.

With that in mind, once your file is loaded, make sure to select (by left-clicking) the ‘1189 Unique ISI Records’ data file in the data manager. If you right-click and view the file, it should open up in Excel (Figure 9) or whatever your default *.csv viewer is, and you’ll see that the previous text file has been converted to a spreadsheet. You can look through it to see what the data look like.

Figure 9: All of the ISIS History of Science journal articles as a csv.
Figure 9: All of the ISIS History of Science journal articles as a csv.

When you’re done ogling at all the pretty data, close the spreadsheet and go back to the tool. Making sure the ‘1189 Unique ISI Records’ file is selected, go to ‘Data Preparation -> Extract Paper Citation Network’ in the drop-down menu.

Voilà! You now have a history of science citation network. The algorithm spits out two files: ‘Extracted paper-citation network’, which is the network file itself, and ‘Paper information’, which is a spreadsheet that includes all the nodes in the network (in this case, articles that either were published in ISIS or are cited by them). It includes a ‘localCitationCount’ column, which tells you how frequently a work is cited within the dataset (Shapin’s Leviathan and the Air Pump‘ is cited 16 times, you’ll see if you open up the file), and a ‘globalCitationCount’ column, which is how many times ISI Web of Science thinks the article has been cited overall, not just within the dataset (Merton’s ” The Matthew effect in science II” is cited 183 times overall). ‘globalCitationCount’ statistics are of course only available for the records you downloaded, so you have them for ISIS published articles, but none of the other records.

Select ‘Extracted paper-citation network’ in the data manager. From the drop-down menu, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’. It’s a good idea to run this on any network you have, just to see the basic statistics of what you’re working with. The details will appear in the console window (Figure 10).

Figure 10: Network analysis toolkit output on the ISIS citation network.
Figure 10: Network analysis toolkit output on the ISIS citation network.

There are a few things worth noting right away. The first is that there are 52,479 nodes; that means that our adorable little dataset of 1,189 articles actually referenced over 50,000 other works between them, about 50 refs/article. The second fact worth noting is that there are 54,915 directed edges, which is the total number of direct citations in the dataset. One directed edge is a citation from a citing node (an ISIS article) to a cited node (either an ISIS article, or a book, or whatever the author decides to reference).

The last bit worth pointing out is the number of weakly connected components, and the size of the largest connected component. Each weakly connected component is a chunk of the network connected by citation chains: if article A and B are the only articles which cite article C, if article C cites nothing else, and if A and B are uncited by any other articles, they together make a weakly connected component. As soon as another citation link comes from or to them, it becomes part of that component. In our case, the biggest component is 46,971 nodes, which means that most of the nodes in the network are connected to each other. That’s important, it means history of science as represented by ISIS is relatively cohesive. There are 215 weakly connected components in all, small islands that are disconnected from the mainland.

If you have Gephi installed, you can visualize the network by selecting ‘Extracted paper-citation network’ in the data manager and clicking ‘Visualization -> Networks -> Gephi’, though what you do from there is beyond the scope of these instructions. It also probably won’t make a heck of a lot of sense: there aren’t many situations where visualizing a citation network are actually useful. It’s what’s called a Directed Acyclic Graph, which are generally the most visually boring graphs around (don’t cite me on this).

I do have a very important warning. You can tell it’s important because it’s bold. The Sci2 Tool was made by my advisor Katy Börner as a tool for people with similar research to her own, whose interests lie in modeling and predicting the spread of information on a network. As such, the direction of citation edges created by the tool are opposite what many expect. They go from the cited source to the citing source, because the idea is that’s the direction that information flows, rather than from the citing source to the cited source. As a historian, I’m more interested in considering the network in the reverse direction: citing to cited, as that gives more agency to the author. More details in the footnote. 2

Great, now that that’s out of the way, let’s get to the more interesting analyses. Select ‘Extracted paper-citation network’ in the data manager and run ‘Data Preparation -> Extract Document Co-Citation Network’. And then wait. Have you waited for a while? Good, wait some more. This is a process. And 50,000 articles is a lot of articles. While you’re waiting, re-read Networks Demystified 4: Co-Citation Analysis to get an idea of what it is you’re doing and why you want to do it.

Okay, we’re done (assuming you increased the allotted memory to the tool like we discussed earlier). You’re no presented the ‘Co-citation Similarity Network’ in the data manager, and you should, once again, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ in the Data Manager. This as well will take some time, and you’ll see why shortly.

Figure 11: Network analysis toolkit of the ISIS co-citation network.
Figure 11: Network analysis toolkit of the ISIS co-citation network.

Notice that while there are the same number of nodes (citing or cited articles) as before, 52,479, the number of edges went from 54,915 to 2,160,275, a 40x increase. Why? Because every time two articles are cited together, they get an edge between them and, according to the ‘Average degree’ in the console pane, each article or book is cited alongside an average of 82 other works.

In order to make the analysis and visualization of this network easier we’re going to significantly cut its size. Recall that document co-citation networks connect documents that are cited alongside each other, and that the weight of that connection is increased the more often the two documents appear together in a bibliography. What we’re going to do here is drastically reduce the network’s size deleting any edge between documents unless they’ve been cited together more than once. Select ‘Co-citation Similarity Network’ and run ‘Preprocessing -> Networks -> Extract Edges Above or Below Value’. Use the default settings (Figure 12).

Note that when you’re doing a scholarly citation analysis, cutting all the edges below a certain value (called ‘thresholding’) is usually a bad idea unless you know exactly how it will affect your study. We’re doing it here to make the walkthrough easier.

Figure 12: Extracting edges to reduce the size of the network.
Figure 12: Extracting edges to reduce the size of the network.

Run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ on the new ‘Edges above 1 by weight’ dataset, and note that the network has been reduced from two million edges to three thousand edges, a much more manageable number for our purposes. You’ll also see that there are 51,313 isolated nodes: nodes that are no longer connected to the network because we cut so many edges in our mindless rampage. Who cares about them? Let’s delete them too! Select ‘Edges above 1 by weight’ and run ‘Preprocessing -> Networks -> Delete Isolates’, and watch as fifty thousand precious history of science citations vanish in a puff of metadata. Gone.

If you run the Network Analysis Toolkit on the new network, you’ll see that we’re left with a small co-citation net of 1,166 documents and 3,344 co-citations between them. The average degree tells us that each document is connected to, on average, 6 other documents, and that the largest connected component contains 476 documents.

So now’s the moment of truth, the time to visualize all your hard work. If you know how to use Gephi, and have it installed, select ‘With isolates removed’ in the data manager and run ‘Visualization -> Networks -> Gephi’. If you don’t, run ‘Visualization -> Networks -> GUESS’ instead, and give it a minute to load. You will be presented with this stunning work of art vaguely reminiscent of last night’s spaghetti and meatball dinner (Figure 13).

Figure 13: GUESS in all its glory.
Figure 13: GUESS in all its glory.

Fear not! The first step to prettifying the network is to run ‘Layout -> GEM’ and then ‘Layout -> Bin Pack’. Better already, right? Then you can make edits using the graph modifier below (or using python commands in the interpreter), but the friendly folks at my lab have put together a script for you that will do that automatically. Run ‘Script -> Run Script’.

When you do, you will be presented with a godawful java applet that automatically sticks you in some horrible temp directory that you have to find your way out of. In the ‘Look In:’ navigation drop-down, find your way back to your desktop or your documents directory and then find wherever you installed the Sci2 Tool. In the Sci2 directory, there’s a folder called ‘scripts’, and in the ‘scripts’ folder, there’s a ‘GUESS’ folder, and in the ‘GUESS’ folder you will find the holy grail. Select ‘reference-co-occurrence-nw.py’ and press ‘open’.

Magic! Your document co-citation network is now all green and pretty, and you can zoom in and out using either the +/- button on the left, or using your mouse wheel and clicking and dragging on the network itself. It’ll look a bit like Figure 14.

Figure 14: Co-Citation network in GUESS.
Figure 14: Co-Citation network in GUESS.

If you feel more dangerous and cool, you can try visualizing the same network in Gephi, and it might come out something like Figure 15.

Figure 15: Gephi's document co-citation network, with nodes sized by how frequently they're cited in ISIS.
Figure 15: Gephi’s document co-citation network, with nodes sized by how frequently they’re cited in ISIS. Click to enlarge.

That’s it! You’ve co-cited a dataset. I hope you feel proud of yourself, because you should. And all without breaking a sweat. If you want (and you should want), you can save your results by right clicking the various files in the data manager you want to save. I’d recommend saving the most recent file, ‘With isolates removed’, and saving it as an NWB file, which is fairly easy to read and is the Sci2 Tool’s native format.

Stay-tuned for the paradoxically earlier-numbered Networks Demystified 6, on organizing your twitter feed.

Notes:

  1. Part 4 also links to a few great tutorials on how to do this with programming, but if you don’t know the first thing about programming, start here instead.
  2. Those of you who know network basics, keep this in mind when running your analyses: PageRank, In & Out Degree, etc., may be opposite of what you expect, with the papers that cite the most sources as those with the highest In-Degree and PageRank. If this is opposite your workflow, you can fairly easily change the data by hand in a spreadsheet editor or with regular expressions.

Breaking the Ph.D. model using pretty pictures

Earlier today, Heather Froehlich shared what’s at this point become a canonical illustration among Ph.D. students: “The Illustrated guide to a Ph.D.” The illustrator, Matt Might, describes the sum of human knowledge as a circle. As a child, you sit at the center of the circle, looking out in all directions.

PhDKnowledge.002[1]Eventually, he describes, you get various layers of education, until by the end of your bachelor’s degree you’ve begun focusing on a specialty, focusing knowledge in one direction.

PhDKnowledge.004[1]A master’s degree further deepens your focus, extending you toward an edge, and the process of pursuing a Ph.D., with all the requisite reading, brings you to a tiny portion of the boundary of human knowledge.

PhDKnowledge.007[1]

 

You push and push at the boundary until one day you finally poke through, pushing that tiny portion of the circle of knowledge just a wee bit further than it was. That act of pushing through is a Ph.D.

PhDKnowledge.010[1]

 

It’s an uplifting way of looking at the Ph.D. process, inspiring that dual feeling of insignificance and importance that staring at the Hubble Ultra-Deep Field tends to bring about. It also exemplifies, in my mind, one of the broken aspects of the modern Ph.D. But while we’re on the subject of the Hubble Ultra-Deep Field, let me digress momentarily about stars.

1024px-Hubble_ultra_deep_field_high_rez_edit1[1]Quite a while before you or I were born, Great Thinkers with Big Beards (I hear even the Great Women had them back then) also suggested we sat at the center of a giant circle, looking outwards. The entire universe, or in those days, the cosmos (Greek: κόσμος, “order”), was a series of perfect layered spheres, with us in the middle, and the stars embedded in the very top. The stars were either gems fixed to the last sphere, or they were little holes poked through it that let the light from heaven shine through.

pythagoras

As I see it, if we connect the celestial spheres theory to “The Illustrated Guide to a Ph.D.”, we’d arrive at the inescapable conclusion that every star in the sky is another dissertation, another hole poked letting the light of heaven shine through. And yeah, it takes a very prescriptive view of the knowledge and the universe that either you or I can argue with, but for this post we can let it slide because it’s beautiful, isn’t it? If you’re a Ph.D. student, don’t you want to be able to do this?

Flammarion[1]The problem is I don’t actually want to do this, and I imagine a lot of other people don’t want to do this, because there are already so many goddamn stars. Stars are nice. They’re pretty, how they twinkle up there in space, trillions of miles away from one another. That’s how being a Ph.D. student feels sometimes, too: there’s your research, my research, and a gap between us that can reach from Alpha Centauri and back again. Really, just astronomically far away.

distance

It shouldn’t have to be this way. Right now a Ph.D. is about finding or doing something that’s new, in a really deep and narrow way. It’s about pricking the fabric of the spheres to make a new star. In the end, you’ll know more about less than anyone else in the world. But there’s something deeply unsettling about students being trained to ignore the forest for the trees. In an increasingly connected world, the universe of knowledge about it seems to be ever-fracturing. Very few are being trained to stand back a bit and try to find patterns in the stars. To draw constellations.

orion-the-hunter[1]I should know. I’ve been trying to write a dissertation on something huge, and the advice I’ve gotten from almost every professor I’ve encountered is that I’ve got to scale it down. Focus more. I can’t come up with something new about everything, so I’ve got to do it about one thing, and do it well. And that’s good advice, I know! If a lot of people weren’t doing that a lot of the time, we’d all just be running around in circles and not doing cool things like going to the moon or watching animated pictures of cats on the internet.

But we also need to stand back and take stock, to connect things, and right now there are institutional barriers in place making that really difficult. My advisor, who stands back and connects things for a living (like the map of science below), gives me the same prudent advice as everyone else: focus more. It’s practical advice. For all that universities celebrate interdisciplinarity, in the end you still need to get hired by a department, and if you don’t fit neatly into their disciplinary niche, you’re not likely to make it.
430561725_4eb7bc5d8a_o1[1]My request is simple. If you’re responsible for hiring researchers, or promoting them, or in charge of a department or (!) a university, make it easier to be interdisciplinary. Continue hiring people who make new stars, but also welcome the sort of people who want to connect them. There certainly are a lot of stars out there, and it’s getting harder and harder to see what they have in common, and to connect them to what we do every day. New things are great, but connecting old things in new ways is also great. Sometimes we need to think wider, not deeper.

northern-constellations-sky[1]

From Trees to Webs: Uprooting Knowledge through Visualization

[update: here are some of the pretty pictures I will be showing off in The Hague]

The blog’s been quiet lately; my attention has been occupied by various journal submissions and a new book in the works, but I figured my readers would be interested in one of those forthcoming publications. This is an article [preprint] I’m presenting at the Universal Decimal Classification Seminar in The Hague this October, on the history of how we’ve illustrated the interconnections of knowledge and scholarly domains. It’s basically two stories: one of how we shifted from understanding the world hierarchically to understanding it as a flat web of interconnected parts, and the other of how the thing itself and knowledge of that thing became separated.

Porphyrian Tree: tree of Aristotle's categories from the 6th century. [via]
Porphyrian Tree: tree of Aristotle’s categories originally dating from the 6th century. [via some random website about trees]
A few caveats worth noting: first, because I didn’t want to deal with the copyright issues, there are no actual illustrations in the paper. For the presentation, I’m going to compile a powerpoint with all the necessary attributions and post it alongside this paper so you can all see the relevant pretty pictures. For your viewing pleasure, though, I’ve included some of the illustrations in this blog post.

An interpretation of the classification of knowledge from Hobbes' Leviathan. [via e-ducation]
An interpretation of the classification of knowledge from Hobbes’ Leviathan. [via e-ducation]
Second, because the this is a presentation directed at information scientists, the paper is organized linearly and with a sense of inevitability; or, as my fellow historians would say, it’s very whiggish. I regret not having the space to explore the nuances of the historical narrative, but it would distract from the point and context of this presentation. I plan on writing a more thorough article to submit to a history journal at a later date, hopefully fitting more squarely in the historiographic rhetorical tradition.

H.G. Wells' idea of how students should be taught. [via H.G. Wells, 1938. World Brain. Doubleday, Doran & Co., Inc]
H.G. Wells’ idea of how students should be taught. [via H.G. Wells, 1938. World Brain. Doubleday, Doran & Co., Inc]
In the meantime, if you’re interested in reading the pre-print draft, here it is! All comments are welcome, as like I said, I’d like to make this into a fuller scholarly article beyond the published conference proceedings. I was excited to put this up now, but I’ll probably have a new version with full citation information within the week, if you’re looking to enter this into Zotero/Mendeley/etc. Also, hey! I think this is the first post on the Irregular that has absolutely nothing to do with data analysis.

Recent map of science by Kevin Boyack, Dick Klavans, W. Bradford Paley, and Katy Börner. [via SEED magazine]
Recent map of science by Kevin Boyack, Dick Klavans, W. Bradford Paley, and Katy Börner. [via SEED magazine]

An experiment in communal editing: Finding the history & philosophy of science.

After my last post about co-citation analysis, the author of one of the papers I was responding to, K. Brad Wray, generously commented and suggested I write up and publish the results and send them off to Erkenntnis, which is the same journal he published his results. That sounded like a great idea, so I am.

Because so many good ideas have come from comments on this blog, I’d like to try opening my first draft to communal commenting. For those who aren’t familiar with google docs (anyone? Bueller?), you can comment by selecting test and either hitting ctrl-alt-m, or going to the insert-> menu and clicking ‘Comment’.

The paper is about the relationship between history of science and philosophy of science, and draws both from the blog post and from this page with additional visualizations. There is also an appendix (pdf, sorry) with details of data collection and some more interesting results for the HPS buffs. If you like history of science, philosophy of science, or citation analysis, I’d love to see your comments! If you have any general comments that don’t refer to a specific part of the text, just post them in the blog comments below.

This is a bit longer form than the usual blog, so who knows if it will inspire much interaction, but it’s worth a shot. Anyone who is signed in so I can see their name will get credit in the acknowledgements.

Finding the History and Philosophy of Science (earlier draft)  ← draft 1, thanks for your comments.

Finding the History and Philosophy of Science (current draft) ← comment here!

 

Networks Demystified 4: Co-Citation Analysis

This installment of Networks Demystified is the first one that’s actually applied. A few days ago, a discussion arose over twitter involving citation networks, and this post fills the dual purpose of continuing that discussion, and teaching a bit about basic citation analysis. If you’re looking for the very basics of networks, see part 1 and part 2. Part 3 is a warning for anyone who feels the urge to say “power law.” To recap: nodes are the dots/points in the network, edges are the lines/arrows/connections.

Understanding Sociology, Philosophy, and Literary Theory using One Easy Method™!

The growing availability of humanities and social science (HSS) citation data in databases like ISI’s Web of Science (warning: GIANT paywall. Good luck getting access if your university doesn’t subscribe.) has led to a groundswell of recent blog activity in the area, mostly by the humanists and social scientists themselves. Which is a good thing, because citation analyses of HSS will happen whether we’re involving in doing them or not, so if humanists start becoming familiar with the methods, at least we can begin getting humanistically informed citation analyses of our own data.

ISI Web of Science paywall
The size of ISI’s Web of Science paywall. You shall not pass. [via]
This is a sort of weird post. It’s about history and philosophy of science, by way of social history, by way of literary theory, by way of philosophy, by way of sociology. About this time last year, Dan Wang asked the question Is There a Canon in Economic Sociology (pdf)? Wang was searching for a set of core texts for economic sociology, using a set of 52 syllabi regarding the subject. It’s a reasonable first pass at the question, counting how often each article appears in the syllabi (plus some more complex measurements) as well as how often individual authors appear. Those numbers are used to support the hypothesis that there is a strongly present canon, both of authors and individual articles, in economic sociology. This is an example of an extremely simple bimodal network analysis where there are two varieties of node: syllabi or articles. Each syllabi cites multiple articles, and several of those articles are cited by multiple syllabi. The top part of Figure 1 is what this would look like in a basic network representation.

Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [Via Mark Newman, PNAS]
Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [via Mark Newman]
Wang was also curious how instructors felt these articles fit together, so he used a common method called co-citation analysis to answer the question. The idea is that if two articles are cited in the same syllabus, they are probably related, so they get an edge drawn between them. He further restricted his analysis so that articles had to appear together in the same class session, rather than the the same syllabus, to be considered related to each other. What results is a new network (Figure 1, below) of article similarity based on how frequently they appear together (how frequently they are cited by the same source). In Figure 1, you can see that because article H and article F are both cited in syllabus class session 3, they get an edge drawn between them.

A further restriction was then placed on the network, what’s called a threshold. Two articles would only get an edge drawn between them if they were cited by at least 2 different class sessions (threshold = 2). The resulting economic sociology syllabus co-citation network looked like Figure 2, pulled from the original article. From this picture, one can begin to develop a clear sense of the demarcations of subjects and areas within economic sociology, thus splitting the canon into its constituent parts.

Figure 2: Co-citation network in economic sociology. [via]
Figure 2: Co-citation network in economic sociology. Edge thickness represents how often articles appear together in syllabi, and node size is based on a measure of centrality. [via]
In short order, Kieran Healy blogged a reply to this study, providing his own interpretations of the graph and what the various clusters represented. Remember Healy’s name, as it’s important later in the story. Two days after Healy’s blog post, Neal Caren took inspiration and created a co-citation analysis of sociology more broadly–not just economic sociology–using data he downloaded from ISI’s Web of Science (remember the giant paywall from before?). Instead of using syllabi, Caren looked at articles found in American Journal of Sociology, American Sociological Review, Social Forces and Social Problems since 2008. Web of Science gave him a list of every citation from every article in those journals, and he performed the same sort of co-citation analysis as Dan Wang did with syllabi, but at a much larger scale.

Because the dataset Caren used was so much larger, he had to enforce much stricter thresholds to keep the visualization manageable. Whereas Wang’s graph showed all articles, and connected them if they appeared together in more than 2 class sessions, Caren’s graph only connected articles which were cited together more than 4 times (threshold = 4). Further, a cited article wouldn’t even appear on the network visualization unless the article itself had been cited 8 or more times, thus reducing the amount of articles appearing on the visualization overall. The final network had 397 nodes (articles) and 1,597 edges (connections between articles). He also used a popular community detection algorithm to color the different article nodes based on which other articles they were most related to. Figure 3 shows the resulting network, and clicking on it will lead to an interactive version.

Figure 3: Neal Caren's sociology co-citation analysis. Click the picture to see the interactive version. [via]
Figure 3: Neal Caren’s sociology co-citation analysis. Click the picture to see the interactive version. [via]
Caren adds a bit of contextual description in his blog post, explaining what the various clusters represent and why this visualization is a valid and useful one for the field of sociology. Notably, at the end of the post, he shares his raw data, a python script for analyzing it, and all the code for visualizing the network and making it interactive and pretty.

Jump forward a year. Kieran Healy, the one who wrote the original post inspiring Neal Caren’s, decides to try his own hand at a citation analysis using some of the code and methods that Neal Caren had posted about. Healy’s blog post, created just a few days ago, looks at the field of philosophy through the now familiar co-citation analysis. Healy’s analysis covers 20 years of four major philosophy journals, consisting of 2,200 articles. These articles together make over 34,000 citations, although many of the cited articles are duplicates of articles that had already been cited. Healy writes:

The more often any single paper is cited, the more important it’s likely to be. But the more often any two papers are cited together, the more likely they are to be part of some research question or ongoing problem or conversation topic within the discipline.

With a dataset this large, the resulting co-citation network wound up having over a million edges, or connections between co-cited articles. Healy decides to only focus on the 500 most highly-cited items in the journals (not the best practice for a co-citation analysis, but I’ll address that in a later post), resulting in only articles that had been cited more than 10 times within the four journal dataset to be present in the network. Figure 4 shows the resulting network, which like Figure 3, can be clicked on to reach the interactive version.

Figure 4: Kieran Healy's co-citation analysis of four philosophy journals. Click for interactivity. [via]
Figure 4: Kieran Healy’s co-citation analysis of four philosophy journals. Click for interactivity. [via]
The post goes on to provide a fairly thorough and interesting analysis of the various communities formed by article clusters, thus giving a description of the general philosophy landscape as it currently stands. The next day, Healy posted a follow-up delving further into citations of philosopher David Lewis, and citation frequencies by gender. Going through the most highly cited 500 or so philosophy articles by hand, Healy finds that 3.6% of the articles are written by women; 6.3% are written by David Lewis; the overwhelming majority are written by white men. It’s not lost on me that the overwhelming majority of people doing these citation analyses are also white men – someone please help change that? Healy posted a second follow-up a few days later, worth reading, on his reasoning behind which journals he used and why he looked at citations in general. He concludes “The 1990s were not the 1950s. And yet essentially none of the women from this cohort are cited in the conversation with anything close to the same frequency, despite working in comparable areas, publishing in comparable venues, and even in many cases having jobs at comparable departments.”

Merely short days after Healy’s articles, Jonathan Goodwin became inspired, using the same code Healy and Caren used to perform a co-citation analysis of Literary Theory Journals. He began by concluding that these co-citation analysis were much more useful (better) than his previous attempts at direct citation analysis. About four decades of bibliometric research backs up Goodwin’s claim. Figure 5 shows Goodwin’s Literary Theory co-citation network, drawn from five journals and clickable for the interactive version, where he adds a bit of code so that the user can determine herself what threshold she wants to cut off co-citation weights. Goodwin describes the code to create the effect on his github account. In a follow-up post, directly inspired by Healy’s, Goodwin looks at citations to women in literary theory. His results? When a feminist theory journal is included, 8 of the top 30 authors are women (27%); when that journal is not included, only 2 of the top 30 authors are women (7%).

Figure 5: Goodwin's literary theory co-citation network. [via]
Figure 5: Goodwin’s literary theory co-citation network. [via]

At the Speed of Blog

Just after these blog posts were published, a quick twitter exchange between Jonathan Goodwin, John Theibault, and myself (part of it readable here) spurred Goodwin, in the space of 20 minutes, to download, prepare, and visualize the co-citation data of four social history journals over 40 years. He used ISI Web of Science data, Neal Caren’s code, a bit of his own, and a few other bits of open script which he generously cites and links to. All of this is to highlight not only the phenomenal speed of research when unencumbered by the traditional research process, but also the ease with which these sorts of analysis can be accomplished. Most of this is done using some (fairly simple) programming, but there are just as easy solutions if you don’t know how to or don’t care to code–one specifically which I’ll mention later, the Sci2 Tool. From data to visualization can take a matter of minutes; a first pass at interpretation won’t take much longer. These are fast analyses, pretty useful for getting a general overview of some discipline, and can provide quite a bit of material for deeper analysis.

The social history dataset is now sitting on Goodwin’s blog just waiting to be interpreted by the right expert. If you or anyone you know is familiar with social history, take a stab at figuring out what the analysis reveals, and then let us all know in a blog post of your own. I’ll be posting a little more about it as well soon, though I’m no expert of the discipline. Also, if you’re interested in citation analysis in the humanities, and you’ll be at DH2013 in Nebraska, I’ll be chairing a session all about citations in the humanities featuring an impressive lineup of scholars. Come join us and bring questions, July 17th at 10:30am.

Discovering History and Philosophy of Science

Before I wrap up, it’s worth mentioning that in one of Kieran Healy’s blog posts, he thanks Brad Wray for pointing out some corrections in the dataset. Brad Wray is one of the few people to have published a recent philosophy citation analysis in a philosophy journal. Wray is a top-notch philosopher, but his citation analysis (Philosophy of Science: What are the Key Journals in the Field?, Erkenntnis, May 2010 72:3, paywalled) falls a bit short of the mark, and as this is an instructional piece on co-citation analysis, it’s worth taking some time here to explore why.

Wray’s article’s thesis is that “there is little evidence that there is such a field as the history and philosophy of science (HPS). Rather, philosophy of science is most properly conceived of as a sub-field of philosophy.” He arrives at this conclusion via a citation analysis of three well-respected monographs, A Companion to the Philosophy of ScienceThe Routledge Companion to Philosophy of Science, and The Philosophy of Science edited by David Papineau, in total comprising 149 articles. Wray then counts how many times major journals are cited within each article, and shows that in most cases, the most frequently cited journals across the board are strict philosophy of science journals.

The data used to support Wray’s thesis–that there is no such field as history & philosophy of science (HPS)–is this coarse-level journal citation data. No history of science journal is listed in the top 10-15 journals cited by the three monographs, and HPS journals appear, but very infrequently. Of the evidence, Wray writes “if there were such a field as history and philosophy of science, one would expect scholars in that field to be citing publications in the leading history of science journal. But, it appears that philosophy of science is largely independent of the history of science.”

It is curious that Wray would suggest that total citations from strict philosophy of science companions can be used as evidence of whether a related but distinct field, HPS, actually exists. Low citations from philosophy of science to history of science is that evidence. Instead, a more nuanced approach to this problem would be similar to the approach above: co-citation analysis. Perhaps HPS can be found by analyzing citations from journals which are ostensibly HPS, rather than analyzing three focused philosophy of science monographs. If a cluster of articles should appear in a co-citation analysis, this would be strong evidence that such a discipline currently exists among citing articles. If such a cluster does not appear, this would not be evidence of the non-existence of HPS (absence of evidence ≠ evidence of absence), but that the dataset or the analysis type is not suited to finding whatever HPS might be. A more thorough analysis would be required to actually disprove the existence of HPS, although one imagines it would be difficult explaining that disproof to the people who think that’s what they are.

With this in mind, I decided to perform the same sort of co-citation analysis as Dan Wang, Kieran Healy, Neal Caren, and Jonathan Goodwin, and see what could be found. I drew from 15 journals classified in ISI’s Web of Science as “History & Philosophy of Science” (British Journal for the Philosophy of Science, Journal of Philosophy, Synthese, Philosophy of Science, Studies in History and Philosophy of Science, Annals of Science, Archive for History of Exact Sciences, British Journal for the History of Science, Historical Studies in the Natural Sciences, History and Philosophy of the Life Sciences, History of Science, Isis, Journal for the History of Astronomoy, Osiris, Social Studies of Science, Studies in History and Philosophy of Modern Physics, and Technology and Culture). In all I collected 12,510 articles dating from 1956, with over 300,000 citations between them. For the purpose of not wanting to overheat my laptop, I decided to restrict my analysis to looking only at those articles within the dataset; that is, if any article from any of the 15 journals cited any other article from one of the 15 journals, it was included in the analysis.

I also changed my unit of analysis from the article to the author. I didn’t want to see how often two articles were cited by some third article–I wanted to see how often two authors were cited together within some article. The resulting co-citation analysis gives author-author pairs rather than article-article pairs, like the examples above. In all, there were 7,449 authors in the dataset, and 10,775 connections between author pairs; I did not threshold edges, so the some authors in the network were cited together only once, and some as many as 60 times. To perform the analysis I used the Science of Science (Sci2) Tool, no programming required, (full advertisement disclosure: I’m on the development team), and some co-authors and I have written up how to do a similar analysis in the documentation tutorials.

The resulting author co-citation network, in Figure 6, reveals two fairly distinct clusters of authors. You can click the image to enlarge, but I’ve zoomed in on the two communities, one primarily history of science, the other primarily philosophy of science. At first glance, Wray’s hypothesis appears to be corroborated by the visualization; there’s not much in the way of a central cluster between the two. That said, a closer look at the middle, Figure 7, highlights a group of people whom either have considered themselves within HPS, or others have considered HPS.

Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!]
Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!] 
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.

Figures 6 & 7 don’t prove anything, but they do suggest that within citation patterns, history of science and philosophy of science are clearly more cohesive than some combined HPS might be. Figure 7 suggests there might be more to the story, and what is needed in the next step to try to pin down HPS–if indeed it exists as some sort of cohesive unit–is to find articles that specifically self-identify as HPS, and through their citation and language patterns, try to see what they have in common with and what separates them from the larger community. A more thorough set of analytics, visualizations, and tables, which I’ll explain further at some point, can be found here (apologies for the pdf, this was originally made in preparation for another project).

The reason I bring up this example is not to disparage Wray, whose work did a good job of finding the key journals in philosophy of science, but to argue that we as humanists need to make sure the methods we borrow match the questions we ask. Co-citation analysis happens to be a pretty good method for exploring the question Wray asked in his thesis, but there are many more situations where it wouldn’t be particularly useful. The recent influx of blog posts on the subject, and the upcoming DH2013 session, is exciting, because it means humanists are beginning to take citation analysis seriously and are exploring the various situations in which its methods are appropriate. I look forward to seeing what comes out of the Social History data analysis, as well as future directions this research will take.

On the importance of a single historical author

I have a dirty admission to make: I think yesterday happened. Actually. Objectively. Stuff happened. I hear that’s still a controversial statement in some corners of the humanities, but I can’t say for sure; I generally avoid those corners. And I think descriptions of the historical evidence can vary in degrees of accuracy, separating logically coherent but historically implausible conspiracy theories from more likely narratives.

At the same time, what we all think of as the past is a construct. A bunch of people – historians, cosmologists, evolutionary biologists, your grandmother who loves to tell stories, you – have all worked together to construct and reconstruct the past. Lots of pasts, actually, because no two people can ever wholly agree; everybody sees the evidence through the lens of their own historical baggage.

I’d like to preface this post with the cautious claim that I am an outsider explaining something I know less about than I should. The hats I wear are information/data scientist and historian of science, and through some accident of the past, historians and historians of science have followed largely separate cultural paths. Which is to say, neither the historian of science in me nor the information scientist has a legitimate claim to the narrative of general history and the general historical process, but I’m attempting to write one anyway. I welcome any corrections or admonishments.

The Narrativist Individual

I use in this post (and in life in general) the vocabulary definitions of Aviezer Tucker, who is doing groundbreaking work on simple stuff like defining “history” and asking questions about what we can know about the past. 1 “History,” Tucker defines, is simply stuff that happened: the past itself. “Historians” are anybody who inquires about the past, from cosmologists to historical linguists. A “historiography” is a knowledge of the past, or more concretely, something a historian has written about the past. “Historiographic research” is what we historians do when we try to find out about the past, and a “historiographic narrative” is generally the result of a lot of that research strung together. 2

Narratives are important. In the 1970s, a bunch of historians began realizing that historians create narratives when they collect their historiographic research 3; that is, people tell stories about the past, using the same sorts of literary and rhetorical devices used in many other places. History itself is a giant jumble of events and causal connections, and representing it as it actually happened would be completely unintelligible and philosophically impossible, without recreating the universe from scratch. Historians look at evidence of the past and then impose an order, a pattern, in reconstructing the events to create their own unique historiographic narratives.

The narratives historians write are inescapably linked to their own historical baggage. Historians are biased and imperfect, and they all read history through the filter of themselves. Historiographic reconstructions, then, are as much windows into the historians themselves as they are windows into the past. The narrativist turn in historiography did a lot to situate the historian herself as a primary figure in her narrative, and it became widely accepted that instead of getting closer to some ground truth of history, historians were in the business of building consistent and legible narratives, their own readings of the past, so long as they were also consistent with the evidence. Those narratives became king, both epistemologically and in practice; historical knowledge is narrative knowledge.

Because narrative knowledge is a knowledge derived from lived experience – the historian sees the past in his own unique light – this emphasized the importance of the individual in historiographic research. Because historians neither could (nor by and large were) attempting to reach an objective ground-truth about the past, any claim to knowledge rested in the lone historian and how he read the past and how he presented his narrative. What resulted was a (fairly justified, given their conceptualization of historiographic knowledge) fetishization of the individual, the autonomous historian.

When multiple authors write a historiographic narrative, something almost ineffable is lost: the individual perspective which drives the narrative argument, part of the essential claim-to-knowledge. In a recent discussion with Ben Schmidt about autonomous humanities work vs. collaboration (the original post; my post; Ben’s reply), Ben pointed out “all the Stanley Fishes out there have reason to be discomfited that DHers revel so much in doing away with not only the printed monograph, traditional peer review, and close reading, but also the very institution of autonomous, individual scholarship. Erasmus could have read the biblical translations out there or hired a translator, but he went out and learned Greek [emphasis added].” I think a large part of that drive for autonomy (beyond the same institutional that’s-how-we’ve-always-done-it inertia that lone natural scientists felt at the turn of the last century) is the situatedness-as-a-way-of-knowing that imbues historiographic research, and humanistic research in general.

I’m inclined to believe that historians need to move away from an almost purely narrative epistemology; keeping in sight that all historiographic knowledge is individually constructed, remaining aware that our overarching cultural knowledge of the past is socio-technically constructed, but not letting that cripple our efforts at coordinating research, at reaching for some larger consilience with the other historical research programs out there, like paleontology and cosmology and geology. Computational methodologies will pave the way for collaborative research both because they allow it, and because they require it.

Collaboratively Constructing Paris

This is a map of Paris.

Map of Paris with dots representing photos taken and posted on Flickr. Red dots are pictures taken by tourists, blue are by locals, and yellow are unknown. via Eric Fischer.

On top of this map of Paris are red, blue, and yellow dots. The red dots are the locations of pictures taken and posted to Flickr by tourists to Paris; blue dots are where locals took pictures; yellow dots are unknown. The resulting image maps and differentiates touristic and local space by popularity, at least among Flickr users. It is a representation that would have been staggeringly difficult for an outsider to create without this sort of data-analytic approach, and yet someone intimately familiar with the city could look at this map and not be surprised. Perhaps they could even recreate it themselves.

What is this knowledge of Paris? It’s surely not a subjective representation of the city, not unless we stretch the limits of the word beyond the internally experienced and toward the collective. Neither is it an objective map 4 of the city, external to the whims of the people milling about within. The map represents an aggregate of individual experiences, a kind of hazy middle ground within the usual boundaries we draw between subjective and objective truth. This is an epistemological and ontological problem I’ve been wondering about for some time, without being able to come up with a good word for it until a conversation with a colleague last year.

“This is my problem,” I told Charles van den Heuvel, explaining my difficulties in placing these maps and other similar projects on the -jectivity scale. “They’re not quite intersubjective, not in the way the word is usually used,” I said, and Charles just looked at me like I was missing something excruciatingly obvious. “What is it when a group of people believe or do or think common things in aggregate?”—Charles asked—”isn’t that just called culture?” I initially disagreed, but mostly because it was so obvious that I couldn’t believe I’d just passed it over entirely.

In 1976, the infamous Stanley Milgram and co-author Denise Jodelet 5 responded to Durkheim’s emphasis on “the objectivity of social facts” by suggesting “that we understand things from the actor’s point of view.” To drive this point home, Milgram decides to map Paris. People “have a map of the city [they live in] in their minds,” Milgram suggests, and their individual memories and attitudes flavor those internal representations.

This is a good example to use, for Milgram, because cities themselves are socially constructed entities; what is a city without its people who live in and build it? Milgram goes on to suggest that people’s internal representations of cities are similarly socially constructed, that “such representations are themselves the products of social interaction with the physical environment.” In the ensuing study, Milgram asks 218 subjects to draw a non-tourist map of Paris as it seems to them, including whatever features they feel relevant. “Through selection, emphasis and distortion, the maps became projections of life styles.”

Milgram then compares all the maps together, seeking what unifies them: first and foremost, the city limits and the Seine. The river is distorted in a very particular way in nearly all maps, bypassing two districts entirely and suggesting they are of little importance to those who drew the maps. The center of the city, Notre Dame and the Île de la Cité, also remains constant. Milgram opposes this to a city like New York, the subject of a later similar study, whose center shifts slowly northward as the years roll by. Many who drew maps of either New York or Paris included elements they were not intimately familiar with, but they knew were socially popular, or were frequent spots of those in their social circles. Milgram concludes “the social representations of the city are more than disembodied maps; they are mechanisms whereby the bricks, streets, and physical geography of a place are endowed with social meaning.”

It’s worth posting a large chunk of Milgram’s earlier article on the matter:

A city is a social fact. We would all agree to that. But we need to add an important corollary: the perception of a city is also a social fact, and as such needs to be studied in its collective as well as its individual aspect. It is not only what exists but what is highlighted by the community that acquires salience in the mind of the person. A city is as much a collective representation as it is an assemblage of streets, squares, and buildings. We discern the major ingredients of that representation by studying not only the mental map in a specific individual, but by seeing what is shared among individuals.

Collaboratively Constructing History

Which brings us back to the past. 6 Can collaborating historians create legitimate narratives if they are not well-founded in personal experience? What sort of historical knowledge is collective historical knowledge? To this question, I turn to blogger Alice Bell, who wrote a delightfully short post discussing the social construction of science.  She writes about scientific knowledge, simply, “Saying science is a social construction does not amount to saying science is make believe.” Alice compares knowledge not to a city, like Paris, but to a building like St. Paul’s Cathedral or a scientific compound like CERN; socially constructed, but physically there. Real. Scientific ideas are part of a similar complex.

The social construction of historiographic narratives is painfully clear even without co-authorships, in our endless circles of acknowledgements and references. Still, there seems to be a good deal of push-back against explicit collaboration, where the entire academic edifice no longer lies solely in the mind of one historian (if indeed it ever had). In some cases, this push-back is against the epistemological infrastructure that requires the person in personal narrative. In others, it is because without full knowledge of each of the moving parts in a work of scholarship, that work is more prone to failure due to theories or methodologies not adequately aligning.

Building historiography together. via the Smithsonian.

I fear this is a dangerous viewpoint, one that will likely harm both our historiographic research and our cultural relevancy, as other areas of academia become more comfortable with large-scale collaboration. Single authorship for its own sake is as dangerous as collaboration for its own sake, but it has the advantage of being a tradition. We must become comfortable with the hazy middle ground between an unattainable absolute objectivity and an unscalable personal subjectivity, willing to learn how to construct our knowledge as Parisians construct their city. The individual experiences of Parisians are without a doubt interesting and poignant, but it is the combined experiences of the locals and the tourists that makes the city what it is. Moving beyond the small and individual isn’t just getting past the rut of microhistories that historiography is still trying to escape—it is also getting past the rut of individually driven narratives and toward unified collective historiographies. We have to work together.

 

 

Notes:

  1. Tucker, Aviezer. 2004. Our Knowledge of the Past: A Philosophy of Historiography. Cambridge University Press.
  2. Tucker, Aviezer, ed. 2009. A Companion to the Philosophy of History and Historiography. http://www.wiley.com/WileyCDA/WileyTitle/productCd-1405149086.html.
  3. Kuukkanen, Jouni-Matti. 2012. “The Missing Narrativist Turn in the Histiography of Science.” History and Theory 51 (3): 340–363. doi:10.1111/j.1468-2303.2012.00632.x.
  4. of anything besides the geolocations of Flickr pictures, in and of itself not particularly interesting
  5. Milgram, Stanley. 1976. “Pyschological Maps of Paris.” In Environmental Psychology: People and Their Physical Settings, ed. Proshansky, Ittelson, and Rivlin, 104–124. New York.
    Milgram, Stanley. 1982. “Cities as Social Representations.” In Social Representations, ed. R. Farr and S. Moscovici, 289–309.
  6. As opposed to bringing us Back to the Future, which would probably be more fun.

Predicting victors in an attention and feedback economy

This post is about computer models and how they relate to historical research, even though it might not seem like it at first. Or at second. Or third. But I encourage anyone who likes history and models to stick with it, because it gets to a distinction of model use that isn’t made frequently enough.

Music in a vacuum

Imagine yourself uninfluenced by the tastes of others: your friends, their friends, and everyone else. It’s an effort in absurdity, but try it, if only to pin down how their interests affect yours. Start with something simple, like music. If you want to find music you liked, you might devise a program that downloads random songs from the internet and plays them back without revealing their genre or other relevant metadata, so you can select from that group to get an unbiased sample of songs you like. It’s a good first step, given that you generally find music by word-of-mouth, seeing your friends’ last.fm playlists, listening to what your local radio host thinks is good, and so forth. The music that hits your radar is determined by your social and technological environment, so the best way to break free from this stifling musical determinism is complete randomization.

So you listen to the songs for a while and rank them as best you can by quality, the best songs (Stairway to Heaven, Shine On You Crazy Diamond, I Need A Dollar) at the very top and the worst (Ice Ice Baby, Can’t Touch This, that Korean song that’s been all over the internet recently) down at the bottom of the list. You realize that your list may not be a necessarily objective measurement of quality, but it definitely represents a hierarchy of quality to you, which is real enough, and you’re sure if your best friends from primary school tried the same exercise they’d come up with a fairly comparable order.

Friends don’t let friends share music. via.

Of course, the fact that your best friends would come up with a similar list (but school buddies today or a hundred years ago wouldn’t) reveals another social aspect of musical tastes; there is no ground truth of objectively good or bad music. Musical tastes are (largely) socially constructed 1, which isn’t to say that there isn’t any real difference between good and bad music, it’s just that the evaluative criteria (what aspects of the music are important and definitions of ‘good’ and ‘bad’) are continuously being defined and redefined by your social environment. Alice Bell wrote the best short explanation I’ve read in a while on how something can be both real and socially constructed.

There you have it: other people influence what songs we listen to out of the set of good music that’s been recorded, and other people influence our criteria for defining good and bad music to begin with. This little thought experiment goes a surprisingly long way in explaining why computational models are pretty bad at predicting Nobel laureates, best-selling authors, box office winners, pop stars, and so forth. Each category is ostensibly a mark of quality, but is really more like a game of musical chairs masquerading as a meritocracy. 2

Sure, you (usually) need to pass a certain threshold of quality to enter the game, but once you’re there, whether or not you win is anybody’s guess. Winning is a game of chance with your generally equally-qualified peers competing for the same limited resource: membership in the elite. Merton (1968) compared this phenomenon to the French Academy’s “Forty-First Chair,” because while the Academy was limited to only forty members (‘chairs’), there were many more who were also worthy of a seat but didn’t get one when the music stopped: Descartes, Diderot, Pascal, Proust, and others. It was almost literally a game of musical chairs between great thinkers, much in the same way it is today in so many other elite groups.

Musical Chair. via.

Merton’s same 1968 paper described the mechanism that tends to pick the winners and losers, which he called the ‘Matthew Effect,’ but is also known as ‘Preferential Attachment,’ ‘Rich-Get-Richer,’ and all sorts of other names besides. The idea is that you need money to make money, and the more you’ve got the more you’ll get. In the music world, this manifests when a garage band gets a lucky break on some local radio station, which leads to their being heard by a big record label company who releases the band nationally, where they’re heard by even more people who tell their friends, who in turn tell their friends, and so on and so on until the record company gets rich, the band hits the top 40 charts, and the musicians find themselves desperate for a fix and asking for only blue skittles in their show riders. Okay, maybe they don’t all turn out that way, but if it sounds like a slippery slope it’s because it is one. In complex systems science, this is an example of a positive feedback loop, where what happens in the future is reliant upon and tends to compound what happens just before it. If you get a little fame, you’re more likely to get more, and with that you’re more likely to get even more, and so on until Lady Gaga and Mick Jagger.

Rishidev Chaudhuri does a great job explaining this with bunnies, showing that if 10% of rabbits reproduce a year, starting with a hundred, in a year there’d be 110, in two there’d be 121, in twenty-five there’d be a thousand, and in a hundred years there’d be over a million rabbits. Feedback systems (so-named because the past results feed back on themselves to the future) multiply rather than add, with effects increasing exponentially quickly. When books or articles are read, each new citation increases its chances of being read and cited again, until a few scholarly publications end up with thousands or hundreds of thousands of citations when most have only a handful.

This effect holds true in Nobel prize-winning science, box office hits, music stars, and many other areas where it is hard to discern between popularity and quality, and the former tends to compound while exponentially increasing the perception of the latter. It’s why a group of musicians who are every bit as skilled as Pink Floyd wind up never selling outside their own city if they don’t get a lucky break, and why two equally impressive books might have such disproportionate citations. Add to that the limited quantity of ‘elite seats’ (Merton’s 40 chairs) and you get a situation where only a fraction of the deserving get the rewards, and sometimes the most deserving go unnoticed entirely.

Different musical worlds

But I promised to talk  about computational models, contingency, and sensitivity to initial conditions, and I’ve covered none of that so far. And before I get to it, I’d like to talk about music a bit more, this time somewhat more empirically. Salganik, Dodds, and Watts (2006; 10.1126/science.1121066) recently performed a study on about 15,000 individuals that mapped pretty closely to the social aspects of musical taste I described above. They bring up some literature suggesting popularity doesn’t directly and deterministically map on to musical proficiency; instead, while quality does play a role, much of the deciding force behind who gets fame is a stochastic (random) process driven by social interactivity. Unfortunately, because history only happened once, there’s no reliable way to replay time to see if the same musicians would reach fame the second time around.

Remember Napster? via.

Luckily Salganik, Dodds, and Watts are pretty clever, so they figured out how to make history happen a few times. They designed a music streaming site for teens which, unbeknownst to the teens but knownst to us, was not actually the same website for everyone who visited. The site asked users to listen to previously unknown songs and rate them, and then gave them an option to download the music.  Some users who went to the site were only given these options, and the music was presented to them in no particular order; this was the control group. Other users, however, were presented with a different view. Besides the control group, there were eight other versions of the site that were each identical at the outset, but could change depending on the actions of its members. Users were randomly assigned to reside in one of these eight ‘worlds,’ which they would come back to every time they logged in, and each of these worlds presented a list of most downloaded songs within that world. That is, Betty listened to a song in world 3, rated it five stars, and downloaded it. Everyone in world 3 would now see that the song had been downloaded once, and if other users downloaded it within that world, the download count would iterate up as expected.

The ratings assigned to each song in the control world, where download counts were not visible, were taken to be the independent measure of quality of each song. As expected, in the eight social influence worlds the most popular songs were downloaded a lot more than the most popular songs in the control world, because of the positive feedback effect of people seeing highly downloaded songs and then listening to and downloading them as well, which in turn increased their popularity even more. It should also come as no surprise that the ‘best’ songs, according to their rating in the independent world, rarely did badly in their download/rating counts in the social worlds, and the ‘worst’ songs under the same criteria rarely did well in the social worlds, but the top songs differed from one social world to the next, with the hugely popular hits with orders of magnitude more downloads being completely different in each social world. Their study concludes

We conjecture, therefore, that experts fail to predict success not because they are incompetent judges or misinformed about the preferences of others, but because when individual decisions are subject to social influence, markets do not simply aggregate pre-existing individual preferences. In such a world, there are inherent limits on the predictability of outcomes, irrespective of how much skill or information one has.

Contingency and sensitivity to initial conditions

In the complex systems terminology, the above is an example of a system that is highly sensitive to initial conditions and contingent (chance) events. It’s similar to that popular chaos theory claim that a butterfly flapping its wings in China can cause a hurricane years later over Florida. It’s not that one inevitably leads to the other; rather, positive feedback loops make it so that very small changes can quickly become huge causal factors in the system as their effects exponentially increase. The nearly-arbitrary decision for a famous author to cite one paper on computational linguistics over another equally qualified might be the impetus the first paper needs to shoot into its own stardom. The first songs randomly picked and downloaded in each social world of the above music sharing site greatly influenced the eventual winners of the popularity contest disguised as a quality rank.

Some systems are fairly inevitable in their outcomes. If you drop a two-ton stone from five hundred feet, it’s pretty easy to predict where it’ll fall, regardless of butterflies flapping their wings in China or birds or branches or really anything else that might get in the way. The weight and density of the stone are overriding causal forces that pretty much cancel out the little jitters that push it one direction or another. Not so with a leaf; dropped from the same height, we can probably predict it won’t float into space, or fall somewhere a few thousand miles away, but barring that prediction is really hard because the system is so sensitive to contingent events and initial conditions.

There does exist, however, a set of systems right at the sweet spot between those two extremes; stochastic enough that predicting exactly how it will turn out is impossible, but ordered enough that useful predictions and explanations can still be made. Thankfully for us, a lot of human activity falls in this class.

Tracking Hurricane Ike with models. Notice how short-term predictions are pretty accurate. (Click image watch this model animated). via.

Nate Silver, the expert behind the political prediction blog fivethirtyeight, published a book a few weeks ago called The Signal and the Noise: why so many predictions fail – but some don’t. Silver has an excellent track record of accurately predicting what large groups of people will do, although I bring him up here to discuss what his new book has to say about the weather. Weather predictions, according to Silver, are “highly vulnerable to inaccuracies in our data.” We understand physics and meteorology well enough that, if we had a powerful enough computer and precise data on environmental conditions all over the world, we could predict the weather with astounding precision. And indeed we do; the National Hurricane Center has become 350% more accurate in the last 25 years alone, giving people two or three day warnings for fairly exact locations with regard to storms. However, our data aren’t perfect, and slightly inaccurate or imprecise measurements abound. These small imprecisions can have huge repercussions in weather prediction models, with a few false measurements sometimes being enough to predict a storm tens or hundreds of miles off course.

To account for this, meteorologists introduce stochasticity into the models themselves. They run the same models tens, hundreds, or thousands of times, but each time they change the data slightly, accounting for where their measurements might be wrong. Run the model once pretending the wind was measured at one particular speed in one particular direction; run the model again with the wind at a slightly different speed and direction. Do this enough times, and you wind up with a multitude of predictions guessing the storm will go in different directions. “These small changes, introduced intentionally in order to represent the inherent uncertainty in the quality of the observational data, turn the deterministic forecast into a probabilistic one.” The most extreme predictions show the furthest a hurricane is likely to travel, but if most runs of the model have the hurricane staying within some small path, it’s a good bet that this is the path the storm will travel.

Silver uses a similar technique when predicting American elections. Various polls show different results from different places, so his models take this into account by running many times and then revealing the spread of possible outcomes; those outcomes which reveal themselves most often might be considered the most likely, but Silver also is careful to use the rest of the outcomes to show the uncertainty in his models and the spread of other plausible occurrences.

Going back to the music sharing site, while the sensitivity of the system would prevent us from exactly predicting the most-popular hits, the musical evaluations of the control world still give us a powerful predictive capacity. We can use those rankings to predict the set of most likely candidates to become hits in each of the worlds, and if we’re careful, all or most of the most-downloaded songs will have appeared in our list of possible candidates.

The payoff: simulating history

Simulating the plague in 19th century Canada. via.

So what do hurricanes, elections, and musical hits have to do with computer models and the humanities, specifically history? The fact of the matter is that a lot of models are abject failures when it comes to their intended use: predicting winners and losers. The best we can do in moderately sensitive systems that have difficult-to-predict positive feedback loops and limited winner space (the French Academy, Nobel laureates, etc.) is to find a large set of possible winners. We might be able to reduce that set so it has fairly accurate recall and moderate precision (out of a thousand candidates to win 10 awards, we can pick 50, and 9 out of the 10 actual winners was in our list of 50). This might not be great betting odds, but it opens the door for a type of history research that’s generally been consigned to the distant and somewhat distasteful realm of speculation. It is closely related to the (too-often scorned) realm of counterfactual history (What if the Battle of Gettysburg had been won by the other side? What if Hitler had never been born?), and is in fact driven by the ability to ask counterfactual questions.

The type of historiography of which I speak is the question of evolution vs. revolution; is history driven by individual, world-changing events and Great People, or is the steady flow of history predetermined, marching inevitably in some direction with the players just replaceable cogs in the machine? The dichotomy is certainly a false one, but it’s one that has bubbled underneath a great many historiographic debates for some time now. The beauty of historical stochastic models 3 is exactly their propensity to yield likely and unlikely paths, like the examples above. A well-modeled historical simulation 4 can be run many times; if only one or a few runs of the model reveal what we take as the historical past, then it’s likely that set of events was more akin to the ‘revolutionary’ take on historical changes. If the simulation takes the same course every time, regardless of the little jitters in preconditions, contingent occurrences, and exogenous events, then that bit of historical narrative is likely much closer to what we take as ‘inevitable.’

Models have many uses, and though many human systems might not be terribly amenable to predictive modeling, it doesn’t mean there aren’t many other useful questions a model can help us answer. The balance between inevitability and contingency, evolution and revolution, is just one facet of history that computational models might help us explore.

Notes:

  1. Music has a biological aspect as well. Most cultures with music tend towards discrete pitches, discernible (discrete) rhythm, ‘octave’-type systems with relatively few notes looping back around, and so forth. This suggests we’re hard-wired to appreciate music within a certain set of constraints, much in the same way we’re hard-wired to see only certain wavelengths of light or to like the taste of certain foods over others (Peretz 2006; doi:10.1016/j.cognition.2005.11.004). These tendencies can certainly be overcome, but to suggest the pre-defined structure of our wet thought-machine plays no role in our musical preferences is about as far-fetched as suggesting it plays the only role.
  2. I must thank Miriam Posner for this wonderful turn of phrase.
  3. presuming the historical data and model specifications are even accurate, which is a whole different can of worms to be opened in a later post
  4. Seriously, see the last note, this is really hard to do. Maybe impossible. But this argument is just assuming it isn’t, for now.