More heavy-handed culturomics

A few days ago, Gao, Hu, Mao, and Perc posted a preprint of their forthcoming article comparing social and natural phenomena. The authors, apparently all engineers and physicists, use the google ngrams data to come to the conclusion that “social and natural phenomena are governed by fundamentally different processes.” The take-home message is that words describing natural phenomena increase in frequency at regular, predictable rates, whereas the use of certain socially-oriented words change in unpredictable ways. Unfortunately, the paper doesn’t necessarily differentiate between words and what they describe.

Specifically, the authors invoke random fractal theory (sort of a descendant of chaos theory) to find regular patterns in 1-grams. A 1-gram is just a single word, and this study looks at how the frequency of certain words grow or shrink over time. A “hurst parameter” is found for 24 words, a dozen pertaining to nature (earthquake, fire, etc.), and another dozen “social” words (war, unemployment, etc.). The hurst parameter (H) is a number which, essentially, reveals whether or not a time series of data is correlated with itself. That is, given a set of observations over the last hundred years, autocorrelated data means the observation for this year will very likely follow a predictable trend from the past.

If H is between 0.5 and 1, that means the dataset has “long-term positive correlation,” which is roughly equivalent to saying that data quite some time in the past will still positively and noticeably effect data today. If H is under 0.5, data are negatively correlated with their past, suggesting that a high value in the past implies a low value in the future, and if H = 0.5, the data likely describe Brownian motion (they are random). H can exceed 1 as well, a point which I’ll get to momentarily.

The authors first looked at the frequency of 12 words describing natural phenomena between 1770 and 2007. In each case, H was between 0.5 and 1, suggesting a long-term positive trend in the use of the terms. That is, the use of the term “earthquake” does not fluctuate terribly wildly from year to year; looking at how frequently it was used in the past can reasonably predict how frequently it will be used in the future. The data have a long “memory.”

Natural 1-grams from Gao et al. (2012)

The paper then analyzed 12 words describing social phenomena, with very different results. According to the authors, “social phenomena, apart from rare exceptions, cannot be classified solely as processes with persistent-long range correlations.” For example, the use of the word “war” bursts around World War I and World War II; these are unpredictable moments in the discussion of social phenomena. The way “war” was used in the past was not a good predictor of how “war” would be used around 1915 and 1940, for obvious reasons.

Social 1-grams from Gao et al. (2012)

You may notice that, for many of the social terms, H is actually greater than 1, “which indicates that social phenomena are most likely to be either nonstationary, on-off intermittent, or Levy walk-like process.” Basically, the H parameter alone is not sufficient to describe what’s going on with the data. Nonstationary processes are, essentially, unpredictable. A stationary process can be random, but at least certain statistical properties of that randomness remain persistent. Nonstationary processes don’t have those persistent statistical properties. The authors point out that not all social phenomena will have H >1, citing famine, because it might relate to natural phenomena. They also point out that “the more the social phenomena can be considered recent (unemployment, recession, democracy), the higher their Hurst parameter is likely to be.”

In sum, they found that “The prevalence of long-term memory in natural phenomena [compels them] to conjecture that the long-range correlations in the usage frequency of the corresponding terms is predominantly driven by occurrences in nature of those phenomena,” whereas “it is clear that all these processes [describing social phenomena] are fundamentally different from those describing natural phenomena.” That the social phenomena follow different laws is not unexpected, they say, because they themselves are more complex; they rely on political, economic, and social forces, as well as natural phenomena.

While this paper is exceptionally interesting, and shows a very clever use of fairly basic data (24 one-dimensional variables, just looking at word use per year), it lacks the same sort of nuance also lacking in the original culturomics paper. Namely, in this case, it lacks the awareness that social and natural phenomena are not directly coupled with the words used to describe them, nor the frequency with which those words are used. The paper suggests that natural and social phenomena are governed by different scaling laws when, realistically, it is the way they are discussed, and how those discussions are published which are governed by the varying scaling laws. Further, although they used words exemplifying the difference between “nature” and “society,” the two are not always so easily disentangled, either in language or the underlying phenomena.

Perhaps the sort of words used to describe social events change differently than the sort used to describe natural events. Perhaps, because natural phenomena are often immediately felt across vast distances, whereas news of social phenomena can take some time to diffuse, how rapidly some words are discussed may take very different forms. Discussions and word-usage are always embedded in a larger network. Also needing to be taken into account is who is discussing social vs. natural phenomena, and which is more likely to get published and preserved to eventually be scanned by Google Books.

Without a doubt the authors have noticed a very interesting trend, but rather than matching the phenomena directly to word, as they did, we should be using this sort of study to look at how language changes, how people change, and ultimately what relationship people have with the things they discuss and publish. At this point, the engineers and physicists still have a greater comfort with the statistical tools needed to fully utilize the google books corpus, but there are some humanists out there already doing absolutely fantastic quantitative work with similar data.

This paper, while impressive, is further proof that the quantitative study of culture should not be left to those with (apparently) little background in the subject. While it is not unlikely that different factors do, in fact, determine the course of natural disasters versus that of human interaction, this paper does not convincingly tease those apart. It may very well be that the language use is indicative of differences in underlying factors in the phenomena described, however no study is cited suggesting this to be the case. Claims like “social and natural phenomena are governed by fundamentally different processes,” given the above language data, could easily have been avoided, I think, with a short discussion between the authors and a humanist.

personal research

Flow and Empty Space

Thirty spokes unite in one nave and on that which is non-existent [on the hole in the nave] depends the wheel’s utility. Clay is moulded into a vessel and on that which is non-existent [on its hollowness] depends the vessel’s utility. By cutting out doors and windows we build a house and on that which is non-existent [on the empty space within] depends the house’s utility. Therefore, existence renders actual but non-existence renders useful.

-Laozi, Tao Te Ching, Susuki Translation

(NOTE 1: Although it may not seem it from the introduction, this post is actually about humanities research, eventually. Stick with it and it may pay off!)

(NOTE 2: I’ve warned in the past about invoking concepts you know little about; let me be the first to say I know next to nothing about Eastern philosophy or t’ai chi ch’uan, though I do know a bit about emergence and a bit about juggling. This post uses the above concepts as helpful metaphors, fully apologizing to those who know a bit more about the concepts for the butchering of them that will likely ensue.)

The astute reader may have noticed that, besides being a sometimes-historian and a sometimes-data-scientist, the third role I often take on is that of a circus artist. Juggling and prop manipulation have been part of my life for over a decade now, and though I don’t perform as much as I used to, the feeling I get from practicing is still fairly essential in keeping me sane. What juggling provides me that I cannot get elsewhere is what prop manipulators generally call a state of “flow.”

Look! It's me in a candy store!

The concept draws from a positive psychology term developed by Mihály Csíkszentmihályi, and is roughly equivalent to being in “the zone.” Although I haven’t quite experienced it, this feeling apparently comes to programmers working late at night trying to solve a problem. It’s also been described by dancers, puzzle solvers, and pretty much anyone else who gets so into something they feel, if only for a short time, they have totally lost themselves in their activity. A fellow contact juggler, Richard Hartnell, recently filmed a fantastic video describing what flow means to him as a performer. I make no claims here to any meaning behind the flow state. The human brain is complex beyond my understanding, and though I do not ascribe any mystical properties to the experience, having felt “flow” so deeply, I can certainly see why some do treat it as a religious experience.

The most important contribution to my ability to experience this state while juggling was, oddly enough, a t’ai chi ch’uan course. Really, it was one concept from the course, called song kua, “relax the hips,” that truly opened up flow for me. It’s a complex concept, but the part I’d like to highlight here is the relationship between exertion and relaxation, between a push and a pull. When you move your body, that movement generally starts with an intention. I want my hand to move to the right, so I move it to the right. There is, however, another way to move parts of the body, and this is via relaxation. If I’m standing in a certain way, and I relax my hip in one directoin, my body will naturally shift in the opposite direction. My body naturally gets pulled one way, rather than me pushing it to go there. In the circus arts, I can now quickly reach a flow state by creating a system between myself and whatever prop I’m using, and allowing the state of that system to pull me to the next state, rather than intentionally pushing myself and my prop in some intentional way. It was, for me, a mind-blowing shift in perspective, and one that had absolutely nothing to do with my academic pursuits until last night, on a short plane ride back from Chicago APA.

In the past two weeks, I’ve been finishing up the first draft of a humanities paper that uses concepts from complex systems and network analysis. In it, I argue (among other things) that there are statistical regularities in human behavior, and that we as historians can use that backdrop as a context against which we can study history, finding actions and events which deviate from the norm. Much recent research has gone into showing that people, on average, behave in certain ways, generally due to constraints placed on us by physics, biology, and society. This is not to say humans are inherently predictable – merely that there are boundaries beyond which certain actions are unlikely or even impossible given the constraints of our system. In the paper, I further go on to suggest that the way we develop our social networks also exhibits regularities across history, and the differences against those regularities, and the mechanisms by which they occur, are historically interesting.

Fast-forward to last night: I’m reading a fantastic essay by anthropologist Terrence W. Deacon about the emergence of self-organizing biological systems on the plane-ride home. 1 In the essay, Deacon attempts to explain why entropy seems to decrease enough to allow, well, Life, The Universe, and Everything, given the second law of thermodynamics. His answer is that there are basins of attraction in the dynamics of most processes which inherently and inevitably produce order. That is, as a chaotic system interacts with itself, there are dynamical states which the system can inhabit which are inherently self-sustaining. After a chaotic system shuffles around for long enough, it will eventually and randomly reach a state that “attracts” toward a self-sustaining dynamical state, and once it falls into that basin of attraction, the system will feed back on itself, remaining in its state, creating apparent order from chaos for a sustained period of time.

Deason invokes a similar Tao Te Ching section as was quoted above, suggesting that empty or negative space, if constrained properly and possessing the correct qualities, act as a kind of potential energy. The existence of the walls of a clay pot are what allows it to be a clay pot, but the function of it rests in the constrained negative space bounded by those walls. In the universe, Deason suggests, constraints are implicit and temporally sensitive; if only a few state structures are self-sustaining, those states, if reached, will naturally persist. Similar to that basic tenant of natural selection, that which can persist tends to.

The example Deason first uses is that of a whirlpool forming in the empty space behind a rock in a flowing river.

Consider a whirlpool, stably spinning behind a boulder in a stream. As moving water enters this location it is compensated for by a corresponding outflow. The presence of an obstruction imparts a lateral momentum to the molecules in the flow. The previous momentum is replaced by introducing a reverse momentum imparted to the water as it flows past the obstruction and rushes to fill the comparatively vacated region behind the rock. So not only must excess water move out of the local vicinity at a constant rate; these vectors of perturbed momentum must also be dissipated locally so that energy and water doesn’t build up. The spontaneous instabilities that result when an obstruction is introduced will effectively induce irregular patterns of build-up and dissipation of flow that ‘explore’ new possibilities, and the resulting dynamics tends toward the minimization of the constantly building instabilities. This ‘exploration’ is essentially the result of chaotic dynamics that are constantly self-undermining. To the extent that characteristics of component interactions or boundary conditions allow any degree of regularity to develop (e.g. circulation within a trailing eddy), these will come to dominate, because there are only a few causal architectures that are not self-undermining. This is also the case for semi-regular patterns (e.g. patterns of eddies that repeatedly form and disappear over time), which are just less self-undermining than other configurations.

The flow is not forced to form a whirlpool. This dynamical geometry is not ‘pushed’ into existence, so to speak, by specially designed barriers and guides to the flow. Rather, the system as a whole will tend to spend more time in this semi-regular behaviour because the dynamical geometry of the whirlpool affords one of the few ways that the constant instabilities can most consistently compensate for one another. [Deason, 2009, emphasis added]

Self-Organizing System (

Essentially, when lots of things interact at random, there are some self-organized constraints to their interactions which allow order to arise from chaos. This order may be fleeting or persistent. Rather than using the designed constraint of a clay pot, walls of a room, or spokes around a hub, the constraints to the system arise from the potential in the context of the interactions, and in the properties of the interacting objects themselves.

So what in the world does this have to do with the humanities?

My argument in the above paper was that people naturally interact in certain ways; there are certain basins of attraction, properties of societies that tend to self-organize and persist. These are stochastic regularities; people do not always interact in the same way, and societies do not come to the same end, nor meet their ends in the same fashion. However, there are properties which make social organization more likely, and knowing how societies tend to form, historians can use that knowledge to frame questions and focus studies.

Explicit, data-driven models of the various mechanisms of human development and interaction will allow a more nuanced backdrop against which the actualities of the historical narrative can be studied. Elijah Meeks recently posted, about models,

[T]he beauty of a model is that all of these [historical] assumptions are formalized and embedded in the larger argument…  That formalization can be challenged, extended, enhanced and amended [by more historical research]… Rather than a linear text narrative, the model itself is an argument.

It is striking how seemingly unrelated strands of my life came together last night. The pull and flow of juggling, the bounded ordering of emergent behaviors, and the regularities in human activities. Perhaps this is indicative of the consilience of human endeavors; perhaps it is simply the overactive pattern-recognition circuits in my brain doing what they do best. In any case, even if the relationships are merely loose metaphors, it seems clear that a richer understanding of complexity theory, modeling, and data-driven humanities leading to a more nuanced, humanistic understanding of human dynamics would benefit all. This understanding can help ground the study of history in the Age of Abundance. A balance can be drawn between the uniquely human and individual, on one side, and the statistically regular ordering of systems, on the other; both sides need to be framed in terms of the other. Unfortunately, the dialogue on this topic in the public eye has thus-far been dominated by applied mathematicians and statistical physicists who tend not to take into account the insights gained from centuries of qualitative humanistic inquiry. That probably means it’s our job to learn from them, because it seems unlikely that they will try to learn from us.


  1. in The Re-Emergence of Emergence, 2009, edited by Philip Clayton & Paul Davies.

Early Modern Letters Online

Early modern history! Science! Letters! Data! Four of my favoritest things have been combined in this brand new beta release of Early Modern Letters Online from Oxford University.



EMLO (what an adorable acronym, I kind of what to tickle it) is Oxford’s answer to a metadata database (metadatabase?) of, you guessed it, early modern letters. This is pretty much a gold standard metadata project. It’s still in beta, so there are some interface kinks and desirable features not-yet-implemented, but it has all the right ingredients for a great project:

  • Information is free and open; I’m even told it will be downloadable at some point.
  • Developed by a combination of historians (via Cultures of Knowledge) and librarians (via the Bodleian Library) working in tandem.
  • The interface is fast, easy, and includes faceted browsing.
  • Has a fantastic interface for adding your own data.
  • Actually includes citation guidelines thank you so much.
  • Visualizations for at-a-glance understanding of data.
  • Links to full transcripts, abstracts, and hard-copies where available.
  • Lots of other fantastic things.

Sorry if I go on about how fantastic this catalog is – like I said, I love letters so much. The index itself includes roughly 12,000 people, 4,000 locations, 60,000 letters, 9,000 images, and 26,000 additional comments. It is without a doubt the largest public letters database currently available. Between the data being compiled by this group, along with that of the CKCC in the Netherlands, the Electronic Enlightenment Project at Oxford, Stanford’s Mapping the Republic of Letters project, and R.A. Hatch‘s research collection, there will without a doubt soon be hundreds of thousands of letters which can be tracked, read, and analyzed with absolute ease. The mind boggles.

Bodleian Card Catalogue Summaries

Without a doubt, the coolest and most unique feature this project brings to the table is the digitization of Bodleian Card Catalogue, a fifty-two drawer index-card cabinet filled with summaries of nearly 50,000 letters held in the library, all compiled by the Bodleian staff many years ago. In lieu of full transcriptions, digitizations, or translations, these summary cards are an amazing resource by themselves. Many of the letters in the EMLO collection include these summaries as full-text abstracts.

One of the Bodleian summaries showing Heinsius looking far and wide for primary sources, much like we’re doing right now…

The collection also includes the correspondences of John Aubrey (1,037 letters), Comenius (526), Hartlib (4,589 many including transcripts), Edward Lhwyd (2,139 many including transcripts), Martin Lister (1,141), John Selden (355), and John Wallis (2,002). The advanced search allows you to look for only letters with full transcripts or abstracts available. As someone who’s worked with a lot of letters catalogs of varying qualities, it is refreshing to see this one being upfront about unknown/uncertain values. It would, however, be nice if they included the editor’s best guess of dates and locations, or perhaps inferred locations/dates from the other information available. (For example, if birth and death dates are known, it is likely a letter was not written by someone before or after those dates.)


In the interest of full disclosure, I should note that, much like with the CKCC letters interface, I spent some time working with the Cultures of Knowledge team on visualizations for EMLO. Their group was absolutely fantastic to work with, with impressive resources and outstanding expertise. The result of the collaboration was the integration of visualizations in metadata summaries, the first of which is a simple bar chart showing the numbers of letters written, received, and mentioned in per year of any given individual in the catalog. Besides being useful for getting an at-a-glance idea of the data, these charts actually proved really useful for data cleaning.

Sir Robert Crane (1604-1643)

In the above screenshot from previous versions of the data, Robert Crane is shown to have been addressed letters in the mid 1650s, several years after his reported death. While these could also have been spotted automatically, there are many instances where a few letters are dated very close to a birth or death date, and they often turn out to miss-reported. Visualizations can be great tools for data cleaning as a form of sanity test. This is the new, corrected version of Robert Crane’s page. They are using d3.js, a fantastic javascript library for building visualizations.

Because I can’t do anything with letters without looking at them as a network, I decided to put together some visualizations using Sci2 and Gephi. In both cases, the Sci2 tool was used for data preparation and analysis, and the final network was visualized in GUESS and Gephi, respectively. The first graph shows network in detail with edges, and names visible for the most “central” correspondents. The second visualization is without edges, with each correspondent clustered according to their place in the overall network, with the most prominent figures in each cluster visible.

Built with Sci2/Guess
Built with Sci2/Gephi

The graphs show us that this is not a fully connected network. There are many islands of one or two letters or a small handful of letters. These can be indicative of a prestige bias in the data. That is, the collection contains many letters from the most prestigious correspondents, and increasingly fewer as the prestige of the correspondent decreases. Put in another way, there are many letters from a few, and few letters from many. This is a characteristic shared with power law and other “long tail” distributions. The jumbled community structure at the center of the second graph is especially interesting, and it would be worth comparing these communities against institutions and informal societies at the time. Knowledge of large-scale patterns in a network can help determine what sort of analyses are best for the data at hand. More on this in particular will be coming in the next few weeks.

It’s also worth pointing out these visualizations as another tool for data-checking. You may notice, on the bottom left-hand corner of the first network visualization, two separate Edward Lhwyds with virtually the same networks of correspondence. This meant there were two distinct entities in their database referring to the same individual – a problem which has since been corrected.

More Letters!

Notice that the EMLO site makes it very clear that they are open to contributions. There are many letters datasets out there, some digitized, some still languishing idly on dead trees, and until they are all combined, we will be limited in the scope of the research possible. We can always use more. If you are in any way responsible for an early-modern letters collection, meta-data or full-text, please help by opening that collection up and making it integrable with the other sets out there. It will do the scholarly world a great service, and get us that much closer to understanding the processes underlying scholarly communication in general. The folks at Oxford are providing a great example, and I look forward to watching this project as it grows and improves.


Alchemy, Text Analysis, and Networks! Oh my!

“Newton wrote and transcribed about a million words on the subject of alchemy.” —


Beside bringing us things like calculus, universal gravitation, and perhaps the inspiration for certain Pink Floyd albums, Isaac Newton spent many years researching what was then known as “chymistry,” a multifaceted precursor to, among other things, what we now call chemistry, pharmacology, and alchemy.

Pink Floyd and the Occult: Discuss.

Researchers at Indiana University, notably William R. Newman, John A. Walsh, Dot Porter, and Wallace Hooper, have spent the last several years developing The Chymistry of Isaac Newton, an absolutely wonderful history of science resource which, as of this past month, has digitized all 59 of Newton’s alchemical manuscripts assembled by John Keynes in 1936. Among the sites features are heavily annotated transcriptions, manuscript images, often scholarly synopses, and examples of alchemical experiments. That you can try at home. That’s right, you can do alchemy with this website. They also managed to introduce alchemical symbols into unicode (U+1F700 – U+1F77F), which is just indescribably cool.

Alchemical experiments at home!

What I really want to highlight, though, is a brand new feature introduced by Wallace Hooper: automated Latent Semantic Analysis (LSA) of the entire corpus. For those who are not familiar with it, LSA is somewhat similar LDA, the algorithm driving the increasingly popular Topic Models used in Digital Humanities. They both have their strengths and weaknesses, but essentially what they do is show how documents and terms relate to one another.

Newton Project LSA

In this case, the entire corpus of Newton’s alchemical texts is fed into the LSA implementation (try it for yourself), and then based on the user’s preferences, the algorithm spits out a network of terms, documents, or both together. That is, if the user chooses document-document correlations, a list is produced of the documents that are most similar to one another based on similar word use within them. That list includes weights – how similar are they to one another? – and those weights can be used to create a network of document similarity.

Similar Documents using LSA

One of the really cool features of this new service is that it can export the network either as CSV for the technical among us, or as an nwb file to be loaded into the Network Workbench or the Sci² Tool. From there, you can analyze or visualize the alchemical networks, or you can export the files into a network format of your choice.

Network of how Newton’s alchemical documents relate to one-another visualized using NWB.

It’s great to see more sophisticated textual analyses being automated and actually used. Amber Welch recently posted on Moving Beyond the Word Cloud using the wonderful TAPoR, and Michael Widner just posted a thought-provoking article on using Voyeur Tools for the process of paper revision. With tools this easy to use, it won’t be long now before the first thing a humanist does when approaching a text (or a million texts) is to glance at all the high-level semantic features and various document visualizations before digging in for the close read.

personal research

#humnets paper/review

UCLA’s Networks and Network Analysis for the Humanities this past weekend did not fail to impress. Tim Tangherlini and his mathemagical imps returned in true form, organizing a really impressively realized (and predictably jam-packed) conference that left the participants excited, exhausted, enlightened, and unanimously shouting for more next year (and the year after, and the year after that, and the year after that…) I cannot thank the ODH enough for facilitating this and similar events.

Some particular highlights included Graham Sack’s exceptionally robust comparative analysis of a few hundred early English novels (watch out for him, he’s going to be a Heavy Hitter), Sarah Horowitz‘s really convincing use of epistolary network analysis to weave the importance of women (specifically salonières) in holding together the fabric of French high society, Rob Nelson’s further work on the always impressive Mining the Dispatch, Peter Leonard‘s thoughtful and important discussion on combining text and network analysis (hint: visuals are the way to go), Jon Kleinberg‘s super fantastic wonderful keynote lecture, Glen Worthey‘s inspiring talk about not needing All Of It, Russell Horton’s rhymes, Song Chen‘s rigorous analysis of early Asian family ties, and, well, everyone else’s everything else.

Especially interesting were the discussions, raised most particularly by Kleinberg and Hoyt Long, about what particularly we were looking at when we constructed these networks. The union of so many subjective experiences surely is not the objective truth, but neither is it a proxy of objective truth – what, then, is it? I’m inclined to say that this Big Data aggregated from individual experiences provides us a baseline subjective reality that provides us local basins of attraction; that is, trends we see are measures of how likely a certain person will experience the world in a certain way when situated in whatever part of the network/world they reside. More thought and research must go into what the global and local meaning of this Big Data, and will definitely reveal very interesting results.


My talk on bias also seemed to stir some discussion. I gave up counting how many participants looked at me during their presentations and said “and of course the data is biased, but this is preliminary, and this is what I came up with and what justifies that conclusion.” And of course the issues I raised were not new; further, everybody in attendance was already aware of them. What I hoped my presentation to inspire, and it seems to have been successful, was the open discussion of data biases and constraints it puts on conclusions within the context of the presentation of those conclusions.

Some of us were joking that the issues of bias means “you don’t know, you can’t ever know what you don’t know, and you should just give up now.” This is exactly opposite to the point. As long as we’re open an honest about what we do not or cannot know, we can make claims around those gaps, inferring and guessing where we need to, and let the reader decide whether our careful analysis and historical inferences are sufficient to support the conclusions we draw. Honesty is more important than completeness or unshakable proof; indeed, neither of those are yet possible in most of what we study.


There was some twittertalk surrounding my presentation, so here’s my draft/notes for anyone interested (click ‘continue reading’ to view):

personal research

#humnets preview

Last year, Tim Tangherlini and his magical crew of folkloric imps and applied mathematicians put together a most fantastic and exhausting workshop on networks and network analysis in the humanities. We called it #humnets for short. The workshop (one of the oh-so-fantastic ODH Summer Institutes) spanned two weeks, bringing together forward-thinking humanists and Big Deals in network science and computer science. Now, a year and a half later, we’re all reuniting (bouncing back?) at UCLA to show off all the fantastic network-y humanist-y projects we’ve come up with in the interim.

As of a few weeks ago, I was all set to present my findings from analyzing and modeling the correspondence networks of early-modern scholars. Unfortunately (for me, but perhaps fortunately for everyone else), some new data came in that Changed Everything and invalidated many of my conclusions. I was faced with a dilemma; present my research as it was before I learned about the new data (after all, it was still a good example of using networks in the humanities), or retool everything to fit the new data.

Unfortunately, there was no time to do the latter, and doing the former felt icky and dishonest. In keeping with Tony Beaver’s statement at UCLA last year (“Everything you can do I can do meta,”) I ultimately decided to present a paper on precisely the problem that foiled my presentation: systematic bias. Biases need not be an issue of methodology; you can do everything right methodologically, you can design a perfect experiment, and a systematic bias can still thwart the accuracy of a project. The bias can be due to the available observable data itself (external selection bias), it may be due to how we as researchers decide to collect that data (sample selection bias), or it may be how we decide to use the data we’ve collected (confirmation bias).

There is a small-but-growing precedent of literature on the effects of bias on network analysis. I’ll refer to it briefly in my talk at UCLA, but below is a list of the best references I’ve found on the matter. Most of them deal with sample selection bias, and none of them deal with the humanities.

For those of you who’ve read this far, congratulations! Here’s a preview of my Friday presentation (I’ll post the notes on Friday).



Effects of bias on network analysis condensed bibliography:

  • Achlioptas, Dimitris, Aaron Clauset, David Kempe, and Cristopher Moore. 2005. On the bias of traceroute sampling. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 694. ACM Press. doi:10.1145/1060590.1060693.
  • ———. 2009. “On the bias of traceroute sampling.” Journal of the ACM 56 (June 1): 1-28. doi:10.1145/1538902.1538905.
  • Costenbader, Elizabeth, and Thomas W Valente. 2003. “The stability of centrality measures when networks are sampled.” Social Networks 25 (4) (October): 283-307. doi:10.1016/S0378-8733(03)00012-1.
  • Gjoka, M., M. Kurant, C. T Butts, and A. Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In 2010 Proceedings IEEE INFOCOM, 1-9. IEEE, March 14. doi:10.1109/INFCOM.2010.5462078.
  • Gjoka, Minas, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2011. “Practical Recommendations on Crawling Online Social Networks.” IEEE Journal on Selected Areas in Communications 29 (9) (October): 1872-1892. doi:10.1109/JSAC.2011.111011.
  • Golub, B., and M. O. Jackson. 2010. “From the Cover: Using selection bias to explain the observed structure of Internet diffusions.” Proceedings of the National Academy of Sciences 107 (June 3): 10833-10836. doi:10.1073/pnas.1000814107.
  • Henzinger, Monika R., Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. “On near-uniform URL sampling.” Computer Networks 33 (1-6) (June): 295-308. doi:10.1016/S1389-1286(00)00055-4.
  • Kim, P.-J., and H. Jeong. 2007. “Reliability of rank order in sampled networks.” The European Physical Journal B 55 (February 7): 109-114. doi:10.1140/epjb/e2007-00033-7.
  • Kurant, Maciej, Athina Markopoulou, and P. Thiran. 2010. On the bias of BFS (Breadth First Search). In Teletraffic Congress (ITC), 2010 22nd International, 1-8. IEEE, September 7. doi:10.1109/ITC.2010.5608727.
  • Lakhina, Anukool, John W. Byers, Mark Crovella, and Peng Xie. 2003. Sampling biases in IP topology measurements. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, 1:332- 341 vol.1. IEEE, April 30. doi:10.1109/INFCOM.2003.1208685.
  • Latapy, Matthieu, and Clemence Magnien. 2008. Complex Network Measurements: Estimating the Relevance of Observed Properties. In IEEE INFOCOM 2008. The 27th Conference on Computer Communications, 1660-1668. IEEE, April 13. doi:10.1109/INFOCOM.2008.227.
  • Maiya, Arun S. 2011. Sampling and Inference in Complex Networks. Chicago: University of Illinois at Chicago, April.
  • Pedarsani, Pedram, Daniel R. Figueiredo, and Matthias Grossglauser. 2008. Densification arising from sampling fixed graphs. In Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 205. ACM Press. doi:10.1145/1375457.1375481.
  • Stumpf, Michael P. H., Carsten Wiuf, and Robert M. May. 2005. “Subnets of scale-free networks are not scale-free: Sampling properties of networks.” Proceedings of the National Academy of Sciences of the United States of America 102 (12) (March 22): 4221 -4224. doi:10.1073/pnas.0501179102.
  • Stutzbach, Daniel, Reza Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2009. “On Unbiased Sampling for Unstructured Peer-to-Peer Networks.” IEEE/ACM Transactions on Networking 17 (2) (April): 377-390. doi:10.1109/TNET.2008.2001730.


Effects of selection bias on historical/sociological research condensed bibliography:

  • Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data.” American Sociological Review 48 (3) (June 1): 386-398. doi:10.2307/2095230.
  • Bryant, Joseph M. 1994. “Evidence and Explanation in History and Sociology: Critical Reflections on Goldthorpe’s Critique of Historical Sociology.” The British Journal of Sociology 45 (1) (March 1): 3-19. doi:10.2307/591521.
  • ———. 2000. “On sources and narratives in historical social science: a realist critique of positivist and postmodernist epistemologies.” The British Journal of Sociology 51 (3) (September 1): 489-523. doi:10.1111/j.1468-4446.2000.00489.x.
  • Duncan Baretta, Silvio R., John Markoff, and Gilbert Shapiro. 1987. “The selective Transmission of Historical Documents: The Case of the Parish Cahiers of 1789.” Histoire & Mesure 2: 115-172. doi:10.3406/hism.1987.1328.
  • Goldthorpe, John H. 1991. “The Uses of History in Sociology: Reflections on Some Recent Tendencies.” The British Journal of Sociology 42 (2) (June 1): 211-230. doi:10.2307/590368.
  • ———. 1994. “The Uses of History in Sociology: A Reply.” The British Journal of Sociology 45 (1) (March 1): 55-77. doi:10.2307/591525.
  • Jensen, Richard. 1984. “Review: Ethnometrics.” Journal of American Ethnic History 3 (2) (April 1): 67-73.
  • Kosso, Peter. 2009. Philosophy of Historiography. In A Companion to the Philosophy of History and Historiography, 7-25.
  • Kreuzer, Marcus. 2010. “Historical Knowledge and Quantitative Analysis: The Case of the Origins of Proportional Representation.” American Political Science Review 104 (02): 369-392. doi:10.1017/S0003055410000122.
  • Lang, Gladys Engel, and Kurt Lang. 1988. “Recognition and Renown: The Survival of Artistic Reputation.” American Journal of Sociology 94 (1) (July 1): 79-109.
  • Lustick, Ian S. 1996. “History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias.” The American Political Science Review 90 (3): 605-618. doi:10.2307/2082612.
  • Mariampolski, Hyman, and Dana C. Hughes. 1978. “The Use of Personal Documents in Historical Sociology.” The American Sociologist 13 (2) (May 1): 104-113.
  • Murphey, Murray G. 1973. Our Knowledge of the Historical Past. Macmillan Pub Co, January.
  • Murphey, Murray G. 1994. Philosophical foundations of historical knowledge. State Univ of New York Pr, July.
  • Rubin, Ernest. 1943. “The Place of Statistical Methods in Modern Historiography.” American Journal of Economics and Sociology 2 (2) (January 1): 193-210.
  • Schatzki, Theodore. 2006. “On Studying the Past Scientifically∗.” Inquiry 49 (4) (August): 380-399. doi:10.1080/00201740600831505.
  • Wellman, Barry, and Charles Wetherell. 1996. “Social network analysis of historical communities: Some questions from the present for the past.” The History of the Family 1 (1): 97-121. doi:10.1016/S1081-602X(96)90022-6.