Occasionally, in computer science, the term “halting condition” is thrown around as the point at which the program should stop running.
Say I’ve got a robot that watches my roommate and I play scrabble, and I want it to count how many scrabble pieces we use, and tell us who won and what the highest scoring word was. Unfortunately, let’s say, I’m also Superman, so our scrabble games frequently end early when I hear cries for help and run off to the nearest phone booth. Our robot has to decide what conditions mean the game is over so it can give us the winner report; in this case, it is either when one player runs out of pieces, or when nobody plays a piece for a significant amount of time, because games often end early. Those are our halting conditions.
When it comes to data collection, humanists have no halting conditions. We don’t even have decent halting heuristics. Lisa Rhody just blogged a fantastically important piece about the difficulties of data collection in the humanities, and her points are worth stressing. “You need to know,” Rhody writes, “when it’s time to cut the rope and release what might be done.” She points out that humanists need to be discerning in what data we do collect, and we need to be comfortable with analyzing and releasing imperfect data. “The decision not to be perfect is the right choice, but it isn’t an easy one.”
Many (but not all!) of the natural sciences have it easy. You design an experiment, you get the data you planned to get, then you analyze and release it. The halting conditions, when to stop collecting and cleaning data, are usually fairly easily pre-determined and stuck to. Psychology and the social sciences are usually similar; they often either use data that already exists, or else collect it themselves under pre-specified conditions.
The humanities, well… we’re used to a tradition that involves very deep and particular reading. The tiniest stones of our studied objects do not go unturned. The idea that a first pass, an incomplete pass, can lead to anything at all, let alone analysis and release, is almost anathema to the traditional humanistic mindset.
Herein lies the problem of humanities big data. We’re trying to measure the length of a coastline by sitting on the beach with a ruler, rather flying over with a helicopter and a camera. And humanists know that, like the sandy coastline shifting with the tides, our data are constantly changing with each new context or interpretation. Cartographers are aware of this problem, too, but they’re still able to make fairly accurate maps.
While I won’t suggest that humanists should take a more natural-scientific approach to research, beginning with a specific hypothesis and pre-specified data that could either confirm or deny it, we should look to them for inspiration on how to plan research. Thinking about what sort of specific analyses you’d like to perform with the data at the end can reasonably constrain what you try to collect from the beginning. Think about what bits of data are redundant, or would yield diminishing returns on your time and money investment of data collection.
Being Comfortable With Imperfection
In her blog post, Lisa wrote about her experience at MITH. She had a four month fellowship to research 4,500 poems; she could easily have spent the whole time collecting increasingly minute data about each poem. In the end, she settled on only collecting the gender of the poet and whether the poem pertained to a work of art, opting not to include information like when each poem was published, what work of art it referred to, etc. She would then go in later and use other large-scale analytic tools (like text analysis), augmenting those results with the tags she entered about each poem.
A lot of valuable, rich information was lost in this data collection, but the important thing is that Lisa was still able to go in with a specific question, and collect only that which she needed most to explore it. The data may not have been perfect, and they may not have described everything, but they were sufficient and useful.
Her story reminded me a lot of my undergraduate years. I spent all of them collecting data on early modern letters for my old advisor. Letters, of course, generally have various locations and dates attached to them, and this presented us with no end of problems. Sometimes the places mentioned were cities, or houses, or states; granularities differed. Over the course of two hundred years, cities would change names, move, or wink out of or into existence entirely. Sometimes they would subsumed into new or different empires. Computers, unfortunately, need fairly regularized data to perform comparative analyses, so we had to make a lot of editorial decisions when entering locations that would make answering our questions easier, but would lose some of the nuance otherwise available.
Similarly, my colleague Jeana Jorgensen recently spent several months painstakingly hand-collecting data about the usage of body parts in fairy tales for her dissertation. Of particular interest in her case was the overtly interpretive layer she added to the collection; for example, did a reference somehow embody the “grotesque?” By allowing herself the freedom to use interpretive frameworks, she embraced the subjective nature of data collection, and was able to analyze her data accordingly.
Of course, by allowing this sort of humanistic nuance, the amount of data one could collect for any single sentence is effectively infinite, and so Jeana had to constrain herself to only collecting for that which she could eventually use. It nevertheless took her months of daily collection, but if she tried to make her data perfect or complete, it would have taken her over a lifetime. She still managed to produce really interesting and thoughtful results for her dissertation.
Perfect or complete data is impossible in the humanities. The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.
The trick and art is knowing the right halting conditions. How much is too much? What data will actually be useful? These are not easy questions, and their answers differ for every project. The important thing to remember is to just do it. Too many projects get hung up because they just haven’t quite collected enough yet, or if they just spend a few more months cleaning their data will be so much better. There will never be a point when your data are perfect. Do your analysis now, release it, and be comfortable with the fact that you’ve fairly accurately mapped the coastline, even if you haven’t quite worked out the jitters of the tides.
So apparently yesterday was a big day for hypothesis testing and discovery. Stanley Fish’s third post on Digital Humanities also brought up the issue of fishing for correlations, although his post was… slightly more polemic. Rather than going over it on this blog, I’ll let Ted Underwood describe it. Anybody who read my post on Avoiding Traps should also read Underwood’s post; it highlights the role of discovery in the humanities as a continuous process of appraisal and re-appraisal, both on the quantitative and qualitative side.
…the significance of any single test is reduced when it’s run as part of a large battery.
That’s a valid observation, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.”And it’s why Matt Wilkens (targeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them more stringently. (For instance, after noticing that coastal states initially seem more prominent in American fiction than the Midwest, he tests whether this remains true after you compensate for differences in population size, and then proposes a hypothesis that he suggests will need to be confirmed by additional “test cases.”)
It’s important to keep in mind that Reichenbach’s old distinction between discovery and justification is not so clear-cut as it was originally conceived. How we generate our hypotheses, and how we support them to ourselves and the world at large, is part of the ongoing process of research. In my last post, I suggested people keep clear ideas of what they plan on testing before they begin testing; let me qualify that slightly. One of the amazing benefits of Big Data has been the ability to spot trends we were not looking for; an unexpected trend in the data can lead us to a new hypothesis, one which might be fruitful and interesting. The task, then, is to be clever enough to devise further tests to confirm the hypothesis in a way that isn’t circular, relying on the initial evidence that led you toward it.
… I like books with pictures. When I started this blog, I promised myself I’d have a picture in every post. I can’t think of one that’s relevant, so here’s an angry cupcake:
We have the advantage of arriving late to the game.
In the cut-throat world of high-tech venture capitalism, the first company with a good idea often finds itself at the mercy of latecomers. The latecomer’s product might be better-thought-out, advertised to a more appropriate market, or simply prettier, but in each case that improvement comes through hindsight. Trailblazers might get there first, but their going is slowest, and their way the most dangerous.
Digital humanities finds itself teetering on the methodological edge of many existing disciplines, boldly going where quite a few have gone before. When I’ve blogged before about the dangers of methodology appropriation, it was in the spirit of guarding against our misunderstanding of foundational aspects of various methodologies. This post is instead about avoiding the monsters already encountered (and occasionally vanquished) by other disciplines.
Everything Old Is New Again
A collective guffaw probably accompanied my defining digital humanities as a “new” discipline. Digital humanities itself has a rich history dating back to big iron computers in the 1950s, and the humanities in general, well… they’re old. Probably older than my grandparents.
The important point, however, is that we find ourselves in a state of re-definition. While this is not the first time, and it certainly will not be the last, this state is exceptionally useful in planning against future problems. Our blogosphere cup overfloweth with definitions of and guides to the digital humanities, many of our journals are still in their infancy, and our curricula are over-ready for massive reconstruction. Generally (from what I’ve seen), everyone involved in these processes are really excited and open to new ideas, which should ease the process of avoiding monsters.
Most of the below examples, and possible solutions, are drawn from the same issues of bias I’ve previouslydiscussed. Also, the majority are meta-difficulties. While some of the listed dangers are avoidable when writing papers and doing research, most are discipline-level systematic. That is, despite any researcher’s best efforts, the aggregate knowledge we gain while reading the newest exciting articles might fundamentally mislead us. While these dangers have never been wholly absent from the humanities, our recent love of big data profoundly increases their effect sizes.
An architect from Florida might not be great at designing earthquake-proof housing, and while earthquakes are still a distant danger, this shouldn’t really affect how he does his job at home. If the same architect moves to California, odds are he’ll need to learn some extra precautions. The same is true for a digital humanist attempting to make inferences from lots of data, or from a bunch of studies which all utilize lots of data. Traditionally, when looking at the concrete and particular, evidence for something is necessary and (with enough evidence) sufficient to believe in that thing. In aggregate, evidence for is necessary but not sufficient to identify a trend, because that trend may be dwarfed by or correlated to some other data that are not available.
The below lessons are not all applicable to DH as it exists today, and of course we need to adapt them to our own research (their meaning changes in light of our different material of study), however they’re still worth pointing out and, perhaps, may be guarded against. Many traditional sciences still struggle with these issues due to institutional inertia. Their journals have acted in such a way for so long, so why change it now? Their tenure has acted in such a way for so long, so why change it now? We’re already restructuring, and we have a great many rules that are still in flux, so we can change it now.
Anyway, I’ve been dancing around the examples for way too long, so here’s the meat:
Sampling and Selection Bias
The problem here is actually two-fold, both for the author of a study, and for the reader of several studies. We’ll start with the author-centric issues.
Sampling and Selection Bias in Experimental Design
People talk about sampling and selection biases in different ways, but for the purpose of this post we’ll use wikipedia’s definition:
Selection bias is a statistical bias in which there is an error in choosing the individuals or groups to take part in a scientific study.
A distinction, albeit not universally accepted, of sampling bias [from selection bias] is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.
In this case, we’ll say a study exhibits a sampling error if the conclusions drawn from the data at hand, while internally valid, does not actually hold true for the world around it. Let’s say I’m analyzing the prevalence of certain grievances in the cahiers de doléances from the French Revolution. One study showed that, of all the lists written, those from urban areas were significantly more likely to survive to today. Any content analysis I perform on those lists will bias the grievances of those people from urban areas, because my sample is not representative. Conclusions I draw about grievances in general will be inaccurate, unless I explicitly take into account which sort of documents I’m missing.
Selection bias can be insidious, and many varieties can be harder to spot than sampling bias. I’ll discuss two related phenomena of selection bias which lead to false positives, those pesky statistical effects which leave us believing we’ve found something exciting when all we really have is hot air.
The first issue, probably the most relevant to big-data digital humanists, is data dredging. When you have a lot of data (and increasingly more of us have just that), it’s very tempting to just try to find correlations between absolutely everything. In fact, as exploratory humanists, that’s what we often do: get a lot of stuff, try to understand it by looking at it from every angle, and then write anything interesting we notice. This is a problem. The more data you have, the more statistically likely it is that it will contain false-positive correlations.
Google has lots of data, let’s use them as an example! We can look at search frequencies over time to try to learn something about the world. For example, people search for “Christmas” around and leading up to December, but that search term declines sharply once January hits. Comparing that search with searches for “Santa”, we see the two results are pretty well correlated, with both spiking around the same time. From that, we might infer that the two are somehow related, and would do some further studies.
Unfortunately, Google has a lot of data, and a lot of searches, and if we just looked for every search term that correlated well with any other over time, well, we’d come up with a lot of nonsense. Apparently searches for “losing weight” and “2 bedroom” are 93.6% correlated over time. Perhaps there is a good reason, perhaps there is not, but this is a good cautionary tale that the more data you have, the more seemingly nonsensical correlations will appear. It is then very easy to cherry pick only the ones that seem interesting to you, or which support your hypothesis, and to publish those.
The other type of selection bias leading to false positives I’d like to discuss is cherry picking. This is selective use of evidence, cutting data away until the desired hypothesis appears to be the correct one. The humanities, not really known for their hypothesis testing, are not quite as likely to be bothered by this issue, but it’s still something to watch out for. This is also related to confirmation bias, the tendency for people to only notice evidence for that which they already believe.
Much like data dredging, cherry picking is often done without the knowledge or intent of the research. It arises out of what Simmons, Nelson, and Simonsohn (2011) call researcher degrees of freedom. Researchers often make decisions on the fly:
Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?
The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding [that is significant] is [itself necessarily significant]. This exploratory behavior is not the by-product of malicious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result.
When faced with decisions of how to proceed with analysis, we will almost invariably (and inadvertently) favor the decision that results in our hypothesis seeming more plausible.
If I go into my favorite dataset (The Republic of Letters!) trying to show that Scholar A was very similar to Scholar B in many ways, odds are I could do that no matter who the scholars were, so long as I had enough data. If you take a cookie-cutter to your data, don’t be surprised when cookie-shaped bits come out the other side.
Sampling and Selection Bias in Meta-Analysis
There are copious examples of problems with meta-analysis. Meta-analysis is, essentially, a quantitative review of studies on a particular subject. For example, a medical meta-analysis could review data from hundreds of small studies testing the side-effects of a particular medicine, bringing them all together and drawing new or more certain conclusions via the combination of data. Sometimes these are done to gain a larger sample size, or to show how effects change across different samples, or to provide evidence that one non-conforming study was indeed a statistical anomaly.
A meta-analysis is the quantitative alternative to something every one of us in academia does frequently: read a lot of papers or books, find connections, draw inferences, explore new avenues, and publish novel conclusions. Because quantitative meta-analysis is so similar to what we do, we can use the problems it faces to learn more about the problems we face, but which are more difficult to see. A criticism oft-lobbed at meta-analyses is that of garbage in – garbage out; the data used for the meta-analysis is not representative (or otherwise flawed), so the conclusions as well are flawed.
There are a number of reasons why the data in might be garbage, some of which I’ll cover below. It’s worth pointing out that the issues above (cherry-picking and data dredging) also play a role, because if the majority of studies are biased toward larger effect sizes, then the overall perceived effect across papers will appear systematically larger. This is not only true of quantitative meta-analysis; when every day we read about trends and connections that may not be there, no matter how discerning we are, some of those connections will stick and our impressions of the world will be affected. Correlation might not imply anything.
Before we get into publication bias, I will write a short aside that I was really hoping to avoid, but really needs to be discussed. I’ll dedicate a post to it eventually, when I feel like punishing myself, but for now, here’s my summary of
The Problems with P
Most of you have heard of p-values. A lucky few of you have never heard of them, and so do not need to be untrained and retrained. A majority of you probably hold a view similar to a high-ranking, well-published, and well-learned professor I met recently. “All I know about statistics,” he said, “is that p-value formula you need to show whether or not your hypothesis is correct. It needs to be under .05.” Many of you (more and more these days) are aware of the problems with that statement, and I thank you from the bottom of my heart.
Let’s talk about statistics.
The problems with p-values are innumerable (let me count the ways), and I will not get into most of them here. Essentially, though, the calculation of a p-value is the likelihood that the results of your study did not appear by random chance alone. In many studies which rely on statistics, the process works like this: begin with a hypothesis, run an experiment, analyze the data, calculate the p-value. The researcher then publishes something along the lines of “my hypothesis is correct because p is under 0.05.”
Most people working with p-values know that it has something to do with the null hypothesis (that is, the default position; the position that there is no correlation between the measured phenomena). They work under the assumption that the p-value is the likelihood that the null hypothesis is true. That is, if the p-value is 0.75, it’s 75% likely that the null hypothesis is true, and there is no correlation between the variables being studied. Generally, the cut-off to get published is 0.05; you can only publish your results if it’s less than 5% likely that the null hypothesis is true, or more than 95% likely that your hypothesis is true. That means you’re pretty darn certain of your result.
Unfortunately, most of that isn’t actually how p-values work. Wikipedia writes:
In a nutshell, assuming there is no correlation between two variables, what’s the likelihood that they’ll appear as correlated as you observed in your experiment by chance alone? If your p-value is .05, that means it’s 5% likely that random chance caused your variables to be correlated. That is, one in every twenty studies (5%) that get a p-value of 0.05 will have found a correlation that doesn’t really exist.
To recap: p-values say nothing about your hypothesis. They say, assuming there is no real correlation, what’s the likelihood that your data show one anyway? Also, in the scholarly community, a result is considered “significant” if p is less than or equal to 0.05. Alright, I’m glad that’s out of the way, now we’re all on the same footing.
The positive results bias, the first of many interrelated publication biases, simply states that positive results are more likely to get published then negative or inconclusive ones. Authors and editors will be more likely to submit and accept work if the results are significant (p < .05). The file drawer problem is the opposite effect: negative results are more likely to be stuck in somebody’s file drawer, never to see the light of day. HARKing (Hypothesizing After the Results Are Known), much like cherry-picking above, is when, if during the course of a study many trials and analyses occur, only the “significant” ones are ever published.
Let’s begin with HARKing. Recall that a p-value is (basically) the likelihood that an effect occurred by chance alone. If one research project consisted of 100 different trials and analyses, if only 5 of them yielded significant results pointing toward the author’s hypothesis, those 5 analyses likely occurred by chance. They could still be published (often without the researcher even realizing they were cherry-picking, because obviously non-fruitful analyses might be stopped before they’re even finished). Thus, again, more positive results are published than perhaps there ought to be.
Let’s assume some people are perfect in every way, shape, and form. Every single one of their studies is performed with perfect statistical rigor, and all of their results are sound. Again, however, they only publish their positive results – the negative ones are kept in the file drawer. Again, more positive results are being published than being researched.
Who cares? So what that we’re only seeing the good stuff?
The problem is that, using common significance testing of p < 0.05, 5% of published, positive results ought to have occurred by chance alone. However, since we cannot see the studies that haven’t been published because their results were negative, those 5% studies that yielded correlations where they should not have are given all the scholarly weight. One hundred small studies are done on the efficacy of some medicine for some disease; only five by chance find some correlation – they are published. Let’s be liberal, and say another three are published saying there was no correlation between treatment and cure. Thus, an outside observer will see that the evidence is stacked in the favor of the (ineffectual) medication.
The Decline Effect
A recent much-discussed article by John Lehrer, as well as countless studies by John Ioannidis and others, show two things: (1) a large portion of published findings are false (some of the reasons are shown above). (2) The effects of scientific findings seem to decline. A study is published, showing a very noticeable effect of some medicine curing a disease, and further tests tend show that very noticeable effect declining sharply. (2) is mostly caused by (1). Much ink (or blood) could be spilled discussing this topic, but this is not the place for it.
So there are a lot of biases in rigorous quantitative studies. Why should humanists care? We’re aware that people are not perfect, that research is contingent, that we each bring our own subjective experiences to the table, and they shape our publications and our outlooks, and none of those are necessarily bad things.
The issues arise when we start using statistics, or algorithms derived using statistics, and other methods used by our quantitative brethren. Make no mistake, our qualitative assessments are often subject to the same biases, but it’s easy to write reflexively on one’s own position when they are only one person, one data-point. In the age of Big Data, with multiplying uncertainties for any bit of data we collect, it is far easier to lose track of small unknowns in the larger picture. We have the opportunity of learning from past mistakes so we can be free to make mistakes of our own.
Ioannidis’ most famous article is, undoubtedly, the polemic “Why Most Published Research Findings Are False.” With a statement like that, what hope is there? Ioannidis himself has some good suggestions, and there are many floating around out there; as with anything, the first step is becoming cognizant of the problems, and the next step is fixing them. Digital humanities may be able to avoid inheriting these problems entirely, if we’re careful.
We’re already a big step ahead of the game, actually, because of the nearly nonsensical volumes of tweets and blog posts on nascent research. In response to publication bias and the file drawer problem, many people suggest a authors submit their experiment to a registry before they begin their research. That way, it’s completely visible what experiments on a subject have been run that did not yield positive results, regardless of whether they eventually became published. Digital humanists are constantly throwing out ideas and preliminary results, which should help guard against misunderstandings through publication bias. We have to talk about all the effort we put into something, especially when nothing interesting comes out of it. The fact that some scholar felt there should be something interesting, and there wasn’t, is itself interesting.
At this point, “replication studies” means very little in the humanities, however if we begin heading down the road where replication studies become more feasible, our journals will need to be willing to accept them just as they accept novel research. Funding agencies should also be just as willing to fund old, non-risky continuation research as they are the new exciting stuff.
Other institutional changes needed for us to guard against this sort of thing is open access publications (so everyone draws inferences from the same base set of research), tenure boards that accept negative research and exploratory research (again, not as large of an issue for the humanities), and restructured curricula that teach quantitative methods and their pitfalls, especially statistics.
On the ground level, a good knowledge of statistics (especially Bayesian statistics, doing away with p-values entirely) will be essential as more data becomes available to us. When running analysis on data, to guard against coming up with results that appear by random chance, we have to design an experiment before running it, stick to the plan, and publish all results, not just ones that fit our hypotheses. The false-positive psychology paper I mentioned above actually has a lot of good suggestions to guard against this effect:
Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.
Authors must list all variables collected in a study
Authors must report all experimental conditions, including failed manipulations.
If observations are eliminated, authors must also report what the statistical results are if those observations are included.
If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
Reviewers should ensure that authors follow the requirements.
Reviewers should be more tolerant of imperfections in results.
Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.
This list of problems and solutions is neither exhaustive nor representative. That is, there are a lot of biases out there unlisted, and not all the ones listed are the most prevalent. Gender and power biases come to mind, however they are well beyond anything I could intelligently argue, and there are issues of peer-review and retraction rates that are an entirely different can of worms.
Also, the humanities are simply different. We don’t exactly test hypothesis, we’re not looking for ground truths, and our publication criteria are very different from that of the natural and social sciences. It seems clear that the issues listed above will have some mapping on our own research going forward, but I make no claims at understanding exactly how or where. My hope in this blog post is to raise awareness of some of the more pressing concerns in quantitative studies that might have bearing on our own studies, so we can try to understand how they will be relevant to our own research, and how we might guard against it.
A few months ago, Science published a Thanksgiving article on what scientists can be grateful for. It’s got a lot of good points, like being thankful for family members who accept the crazy hours we work, or for those really useful research projects that make science cool enough for us to get funding for the merely really interesting. It does have one unfortunate reference to humanists:
We are thankful that Ph.D. programs in the sciences, as much as we complain about them, aren’t nearly as horrifying as, say, Ph.D. programs in the humanities. I just heard today from a friend in his ninth year of a comparative literature Ph.D. who thinks he might finish “in a year and a half.” At least the job market for comp lit Ph.D. awardees is thriving, right?
Ouch. I suppose the truth hurts. The particularly interesting point that inspired this post, however, was:
We are thankful for that one colleague who knows statistics. There’s always one.
The State of Things
The above quote about statisticians is so true it hurts, as (we just discovered) the truth is wont to do. It’s even more true in the humanities than it is in the more natural and quantitative sciences. When we talk about a colleague who knows statistics, we generally don’t mean someone down the hall; usually, we mean that one statistician who we met in the pub that one night and has a bizarre interest in the humanities. That’s not to say humanist statisticians don’t exist, but I doubt you’re likely to find one in any given humanities department.
This unfortunately is not only true of statistics, but also of GIS, network science, computer science, textual analysis, and many other disciplines we digital humanists love to borrow from. Thankfully, the NEH ODH’s Institutes for Advanced Topics in the Humanities, UVic’s Digital Humanities Summer Institutes, and other programs out there are improving our collective expertise, but a quick look for GIS/Stats/SNA/etc. courses in most humanities departments still produces slim pickings.
One of the best things to come out of the #hacker movement in the Digital Humanities has been the spirit to get our collective hands dirty and learn the techniques ourselves. It’s been a long time coming, and happier days are sure to follow, but one skill still seems underrepresented from the DH purview: statistics.
Why Statistics? Why Bayesian Statistics?
In a recent post by Elijah Meeks, he called Text Analysis, Spatial Analysis, and Network Analysis the “three pillars” of DH research, with a sneaking suspicion that Image Analysis should fit somewhere in there as well. This seems to be the converging sentiment in most DH circles, and although when asked most would say statistics is also important, it still doesn’t seem to be among the first subjects named.
With another round of Digging Into Data winners chosen, and a bevy of panels and presentations dedicating themselves to Big Data in the Humanities, the first direction we should point is statistics. Statistics is a tool uniquely built for understanding lots of data, and it was developed with full knowledge that the data may be incomplete, biased, or otherwise imperfect, and has legitimate work-arounds for most such occasions. Of course, all the caveats in my first Networks Demystified post apply here: don’t use it without fully understanding it, and changing it where necessary.
Many Humanists, even digital ones, frequently seem to have a (justifiably) knee-jerk reaction to statistics. If you’ve been following the Twitter and blog conversations about AHA 2012, you probably caught a flurry of discussion over Google Ngrams. Conversation tended toward horrified screams of the dangers of correlation vs. causation (or at least references to xkcd), and the ease with which one might lie via statistics or omission. These are all valid cautions, especially where ngrams is concerned, but I sometimes fear we get so caught up in bad examples that we spend more time apologizing for them than fixing them. Ted Underwood has a great post about just this, which I will touch on again shortly. (And, to Ted and Allen specifically, I’m guessing you both will enjoy this post.)
In short: statistics is useful. To quote the above-linked xkcd comic:
Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.
So how do we go about using statistics? In a comment on Ted’s recent post about statistics, Trevor Owens wrote:
if you just start signing up for statistics courses you are going to end up getting a rundown on using t-tests and ANOVAs as tools for hypothesis testing. The entire hypothesis testing idea remains a core part of how a lot of folks in the social sciences think about things and it is deeply at odds with what humanists want to do.
The key is not appropriation but adaption. We must learn statistics, even the hypothesis testing, so that we might find what methods are useful, what might be changed, and how we can get it to work for us. We’re humanists. We’re reallygood at methodological critique.
One of the areas of statistics most likely to bear fruit for humanists is Bayesian statistics. Some of us already use it in our text mining algorithms, although the math involved remains occult to most. It basically builds uncertainty and belief directly into statistics. Instead of coming up with one correct answer, Bayesian analysis often yields a range of more or less probable answers depending what seems to be the case from prior evidence, and can update and improve that range as more is learned.
For humanists, this importance is (at least) two-fold. Ted Underwood sums up the first reason nicely:
[Bayesian inference] is amazingly, almost bizarrely willing to incorporate subjective belief into its definition of knowledge. It insists that definitions of probability have to depend not only on observed evidence, but on the “prior probabilities” that we expected before we saw the evidence. If humanists were more familiar with Bayesian statistics, I think it would blow a lot of minds.
The second and more specific reason worth mentioning here deals with the ranges I discussed above. If a historian, for example, is trying to understand how and why some historical event happened, Bayesian analysis could yield which set of occurrences were more or less likely, and which were so far off as to not be worth considering. By trying to find reasonable boundary conditions rather than exact explanations to answer our questions, humanists can retain that core knowledge that humans and human situations are not wholly deterministic machines, who all act the same and reproduce the same results in every situation.
We are intrinsically and inextricably inexact, and until we get computers that see and remember everything, and model it all perfectly, we should avoid looking for exact answers. Bayesian statistics, instead, can help us find a range of reasonable answers, with full awareness and use of the beliefs and evidence we have going in.
A Call to Arms
After I read that post about a scientist’s thanksgiving, I realized I didn’t want to have to rely on that one colleague who knows statistics. Nobody should. That’s why I decided to enroll in a Bayesian Data Analysis course this semester, taught by and using the book of John K. Kruschke. It’s a very readable book, directed toward people with no prior knowledge in statistics or programming, and takes you through the basics of both. Kruschke’s got a blog worth reading, as does Andrew Gelman, an author of the book Bayesian Data Analysis. I’m sure a basic Google search can point you to video lectures, if that’s your thing. I’ll also try to blog about it over the coming months as I learn more.
There are several (occasionally apocryphal) anecdotes about the great theoretical physicists of the early 20th century needing to go back to school to learn basic statistics. Some still weren’t terribly happy about it (“God does not play dice with the universe”), but in the end, pressures from the changing nature of their theories required a thorough understanding of statistics. As humanists begin to deal with a glut of information we never before had access to, it’s time we adapt in a similar fashion.
The wide angle, the distant reading, the longue durée will all benefit from a deeper understanding of statistics. That knowledge, in tandem with traditional close reading skills, will surely become one of the pillars of humanities research as Big Data becomes ever-more common.
Herein lies Part the Second of n posts introducing various network concepts. Part the First can be found here. From here on out, the posts will be cover only one topic at a time. I’ll occasionally use math, but will do my best to explain it from the ground up, assuming no previous knowledge of mathematical notation.
Node Degree: An Introduction
Today I’ll cover the deceptively simple concept of node degree. I say “deceptive” because, on the one hand, network degree can tell you quite a lot. On the other hand, degree can often lead one astray, especially as networks become larger and more complicated.
A node’s degreeis, simply, how many edges it is connected to. Generally, this also correlates to how many neighbors a node has, where a node’s neighborhood is those other nodes connected directly to it by an edge. In the network below, each node is labeled by its degree.
If you take a minute to study the network, something might strike you as odd. The bottom-right node, with degree 5, is connected to only four distinct edges, and really only three other nodes (four, including itself). Self-loops, which will be discussed later because they’re annoying and we hates them preciousss, are counted twice. A self-loop is any edge which starts and ends at the same node.
Why are self-loops counted twice? Well, as a rule of thumb you can say that, since the degree is the number of times the node is connected to an edge, and a self-loop connected to a node twice, that’s the reason. There are some more mathy reasons dealing with matrix representation, another topic for a later date. Suffice to say that many network algorithms will not work well if self-loops are only counted once.
The odd node out on the bottom left, with degree zero, is called an isolate. An isolate is any node with no edges.
At any rate, the concept is clearly simple enough. Count the number of times a node is connected to an edge, get the degree. If only getting higher education degrees were this easy.
Node degree is occasionally called degree centrality. Centrality is generally used to determine how important nodes are in a network, and lots of clever researchers have come up with lots of clever ways to measure it. “Importance” can mean a lot of things. In social networks, centrality can be the amount of influence or power someone has; in the U.S. electrical grid network, centrality might mean which power station should be removed to cause the most damage to the network.
The simplest way of measuring node importance is to just look at its degree. This centrality measurement at once seems deeply intuitive and extremely silly. If we’re looking at the social network of facebook, with every person a node connected by an edge to their friends, it’s no surprise that the most well-connected person is probably also the most powerful and influential in the social space. On the same token, though, degree centrality is such a coarse-grained measurement that it’s really anybody’s guess what exactly it’s measuring. It could mean someone has a lot of power; it could also mean that someone tried to become friends with absolutely everybody on facebook.
Degree Centrality Sampling Warnings
Degree works best as a measure of network centrality when you have full knowledge of the network. That is, a social network exists, and instead of getting some glimpse of it and analyzing just that, you have the entire context of the social network: all the friends, all the friends of friends, and so forth.
When you have an ego-network (a network of one person, like a list of all my friends and who among them are friends with one another), clearly the node with the highest centrality is the ego node itself. This knowledge tells you very little about whether that ego is actually central within the larger network, because you sampled the network such that the ego is necessarily the most central. Sampling strategies – how you pick which nodes and edges to collect – can fundamentally affect centrality scores.
A historian of science might generate a correspondence network from early modern letters currently held in Oxford’s library. In fact, this is currently happening, and the resulting resource will be invaluable. Unfortunately, centrality scores generated from nodes in that early modern letter writing network will more accurately reflect the whims of Oxford editors and collectors over the years, rather than the underlying correspondence network itself. Oxford scholars over the years selected certain collections of letters, be they from Great People or sent to or from Oxford, and that choice of what to hold at Oxford libraries will bias centrality scores toward Oxford-based scholars, Great People, and whatever else was selected for.
Similarly, the generation of a social network from a literary work will bias the recurring characters; characters that occur more frequently are simply statistically more likely to appear with more people, and as such will have the highest degrees. It is likely that the degree centrality and frequency of character occurrence are almost exactly correlated.
Of course, if what you’re looking for is the most central character in the novel or the most central figure from Oxford’s perspective, this measurement might be perfectly sufficient. The important thing is to be aware of the limitations of degree centrality, and the possible biasing effects from selection and sampling. Once those biases are explicit, careful and useful inferences can still be drawn.
Things get a bit more complicated when looking at document similarity networks. If you’ve got a network of books with edges connecting them based on whether they share similar topics or keywords, your degree centrality score will mean something very different. In this case, centrality could mean the most general book. Keep in mind that book length might affect these measurements as well; the longer a book is, the more likely (by chance alone) it will cover more topics. Thus, longer books may also appear to be more central, if one is not careful in generating the network.
Degree Centrality in Bi-Modal Networks
Recall that bi-modal networks are ones where there are two different types of nodes (e.g., articles and authors), and edges are relationships that bridge those types (e.g., authorships). In this example, the more articles an author has published, the more central she is. Degree centrality would have nothing to do, in this case, with the number of co-authorships, the position in the social network, etc.
With an even more multi-modal network, having many types of nodes, degree centrality becomes even less well defined. As the sorts of things a node can connect to increases, the utility of simply counting the number of connections a node has decreases.
Micro vs. Macro
Looking at the degree of an individual node, and comparing it against others in the network, is useful for finding out about the relative position of that node within the network. Looking at the degree of every node at once turns out to be exceptionally useful for talking about the network as a whole, and comparing it to others. I’ll leave a thorough discussion of degree distributions for a later post, but it’s worth mentioning them in brief here. The degree distribution shows how many nodes have how many edges.
As it happens, many real world networks exhibit something called “power-law properties” in their degree distributions. What that essentially means is that a small number of nodes have an exceptionally high degree, whereas most nodes have very low degrees. By comparing the degree distributions of two networks, it is possible to say whether they are structurally similar. There’s been some fantastic work comparing the degree distribution of social networks in various plays and novels to find if they are written or structured similarly.
For the entirety of this post, I’ve been talking about networks that were unweighted and undirected. Every edge counted just as much as every other, and they were all symmetric (a connection from A to B implies the same connection from B to A). Degree can be extended to both weighted and directed (asymmetric) networks with relative ease.
Combining degree with edge weights is often called strength. The strength of a node is the sum of the weights of its edges. For example, let’s say Steve is part of a weighted social network. The first time he interacts with someone, an edge is created to connect the two with a weight of 1. Every subsequent interaction incrementally increases the weight by 1, so if he’s interacted with Sally four times, Samantha two times, and Salvador six times, the edge weights between them are 4, 2, and 6 respectively.
In the above example, because Steve is connected to three people, his degree is 3. Because he is connected to one of them four times, another twice, and another six times, his weight is 4+2+6=8.
Combining degree with directed edges is also quite simple. Instead of one degree score, every node now has two different degrees: in-degree and out-degree. The in-degree is the number of edges pointing to a node, and the out-degree is the number of edges pointing away from it. If Steve borrowed money from Sally, and lent money to Samantha and Salvador, his in-degree might be 1 and his out-degree 2.
The degree of a node is really very simple: more connections, higher degree. However, this simple metric accounts for quite a great deal in network science. Many algorithms that analyze both node-level properties and network-level properties are closely correlated with degree and degree distribution. This is a pareto-like effect; a great deal about a network is driven by the degree of its nodes.
While degree-based results are often intuitive, it is worth pointing out that the prime importance of degree is a direct result of the binary network representation of nodes and edges. Interactions either happen or they don’t, and everything that is is a self-contained node or edge. Thus, how many nodes, how many edges, and which nodes have which edges will be the driving force of any network analysis. This is both a limitation and a strength; basic counts influence so much, yet they are apparently powerful enough to yield intuitive, interesting, and ultimately useful results.
A bunchofmyrecentposts have mentioned networks. Elijah Meeks not-so-subtly hinted that it might be a good idea to discuss some of the basics of networks on this blog, and I’m happy to oblige. He already introduced network visualizations on his own blog, and did a fantastic job of it, so I’m going to try to stick to more of the conceptual issues here, gearing the discussion generally toward people with little-to-no background in networks or math, and specifically to digital humanists interested in applying network analysis to their own work. This will be part of an ongoing series, so if you have any requests, please feel free to mention them in the comments below (I’ve already been asked to discuss how social networks apply to fictional worlds, which is probably next on the list).
A network is a fantastic tool in the digital humanist’s toolbox – one of many – and it’s no exaggeration to say pretty much any data can be studied via network analysis. With enough stretching and molding, you too could have a network analysis problem! As with many other science-derived methodologies, it’s fairly easy to extend the metaphor of network analysis into any number of domains.
The danger here is two-fold.
When you’re given your first hammer, everything looks like a nail. Networks can be used on any project. Networks should be used on far fewer. Networks in the humanities are experiencing quite the awakening, and this is due in part to until-recently untapped resources. There is a lot of low-hanging fruit out there on the networks+humanities tree, and they ought to be plucked by those brave and willing enough to do so. However, that does not give us an excuse to apply networks to everything. This series will talk a little bit about when hammers are useful, and when you really should be reaching for a screwdriver.
Methodology appropriation is dangerous. Even when the people designing a methodology for some specific purpose get it right – and they rarely do – there is often a score of theoretical and philosophical caveats that get lost when the methodology gets translated. In the more frequent case, when those caveats are not known to begin with, “borrowing” the methodology becomes even more dangerous. Ted Underwood blogs a great example of why literary historians ought to skip a major step in Latent Semantic Analysis, because the purpose of the literary historian is so very different from that of computer scientists who designed the algorithm. This series will attempt to point out some of the theoretical baggage and necessary assumptions of the various network methods it covers.
Nothing worth discovering has ever been found in safe waters. Or rather, everything worth discovering in safe waters has already been discovered, so it’s time to shove off into the dangerous waters of methodology appropriation, cognizant of the warnings but not crippled by them.
Anyone with a lot of time and a vicious interest in networks should stop reading right now, and instead pick up copies of Networks, Crowds, and Markets (Easley & Kleinberg, 2010) and Networks: An Introduction (Newman, 2010). The first is a non-mathy introduction to most of the concepts of network analysis, and the second is a more in depth (and formula-laden) exploration of those concepts. They’re phenomenal, essential, and worth every penny.
Those of you with slightly less time, but somehow enough to read my rambling blog (there are apparently a few of you out there), so good of you to join me. We’ll start with the really basic basics, but stay with me, because by part n of this series, we’ll be going over the really cool stuff only ninjas, Gandhi, and The Rolling Stones have worked on.
Generally, network studies are made under the assumption that neither the stuff nor the relationships are the whole story on their own. If you’re studying something with networks, odds are you’re doing so because you think the objects of your study are interdependent rather than independent. Representing information as a network implicitly suggests not only that connections matter, but that they are required to understand whatever’s going on.
Oh, I should mention that people often use the word “graph” when talking about networks. It’s basically the mathy term for a network, and its definition is a bit more formalized and concrete. Think dots connected with lines.
Because networks are studied by lots of different groups, there are lots of different words for pretty much the same concepts. I’ll explain some of them below.
Stuff (presumably) exists. Eggplants, true love, the Mary Celeste, tall people, and Terry Pratchett’s Thief of Time all fall in that category. Network analysis generally deals with one or a small handful of types of stuff, and then a multitude of examples of that type.
Say the type we’re dealing with is a book. While scholars might argue the exact lines of demarcation separating book from non-book, I think we can all agree that most of the stuff in my bookshelf are, in fact, books. They’re the stuff. There are different examples of books; a quotation dictionary, a Poe collection, and so forth.
I’ll call this assortment of stuff nodes. You’ll also hear them called vertices (mostly from the mathematicians and computer scientists), actors (from the sociologists), agents (from the modelers), or points (not really sure where this one comes from).
The type of stuff corresponds to the type of node. The individual examples are the nodes themselves. All of the nodes are books, and each book is a different node.
Nodes can have attributes. Each node, for example, may include the title, the number of pages, and the year of publication.
A list of nodes could look like this:
| Title | # of pages | year of publication |
| ----------------------------------------------------------- |
| Graphs, Maps, and Trees | 119 | 2005 |
| How The Other Half Lives | 233 | 1890 |
| Modern Epic | 272 | 1995 |
| Mythology | 352 | 1942 |
| Macroanalysis | unknown | 2011 |
We can get a bit more complicated and add more node types to the network. Authors, for example. Now we’ve got a network with books and authors (but nothing linking them, yet!). Franco Moretti and Graphs, Maps, and Trees are both nodes, although they are of different varieties, and not yet connected. We would have a second list of nodes, part of the same network, that might look like this:
| Author | Birth | Death |
| --------------------------------- |
| Franco Moretti | ? | n/a |
| Jacob A. Riis | 1849 | 1914 |
| Edith Hamilton | 1867 | 1963 |
| Matthew Jockers | ? | n/a |
A network with two types of nodes is called 2-mode, bimodal, or bipartite. We can add more, making it multimodal. Publishers, topics, you-name-it. We can even add seemingly unrelated node-types, like academic conferences, or colors of the rainbow. The list goes on. We would have a new list for each new variety of node.
Presumably we could continue adding nodes and node-types until we run out of stuff in the universe. This would be a bad idea, and not just because it would take more time, energy, and hard-drives than could ever possibly exist.
As it stands now, network science is ill-equipped to deal with multimodal networks. 2-mode networks are difficult enough to work with, but once you get to three or more varieties of nodes, most algorithms used in network analysis simply do not work. It’s not that they can’t work; it’s just that most algorithms were only created to deal with networks with one variety of node.
This is a trap I see many newcomers to network science falling into, especially in the Digital Humanities. They find themselves with a network dataset of, for example, authors and publishers. Each author is connected with one or several publishers (we’ll get into the connections themselves in the next section), and the up-and-coming network scientist loads the network into their favorite software and visualizes it. Woah! A network!
Then, because the software is easy to use, and has a lot of buttons with words that from a non-technical standpoint seem to make a lot of sense, they press those buttons to see what comes out. Then, they change the visual characteristics of the network based on the buttons they’ve pressed.
Let’s take a concrete example. Popular network software Gephi comes with a button that measures the centrality of nodes. Centrality is a pretty complicated concept that I’ll get into more detail later, but for now it’s enough to say that it does exactly what it sounds like; it finds how central, or important, each node is in a network. The newcomer to network analysis loads the author-publisher network into Gephi, finds the centrality of every node, and then makes the nodes bigger that have the highest centrality.
The issue here is that, although the network loads into Gephi perfectly fine, and although the centrality algorithm runs smoothly, the resulting numbers do not mean what they usually mean. Centrality, as it exists in Gephi, was fine-tuned to be used with single mode networks, whereas the author-publisher network is bimodal. Centrality measures have been made for bimodal networks, but those algorithms are not included with Gephi.
Most computer scientists working with networks do so with only one or a few types of nodes. Humanities scholars, on the other hand, are often dealing with the interactions of many types of things, and so the algorithms developed for traditional network studies are insufficient for the networks we often have. There are ways of fitting their algorithms to our networks, or vice-versa, but that requires fairly robust technical knowledge of the task at hand.
Besides dealing with the single mode / multimodal issue, humanists also must struggle with fitting square pegs in round holes. Humanistic data are almost by definition uncertain, open to interpretation, flexible, and not easily definable. Node types are concrete; your object either is or is not a book. Every book-type thing shares certain unchanging characteristics.
This reduction of data comes at a price, one that some argue traditionally divided the humanities and social sciences. If humanists care more about the differences than the regularities, more about what makes an object unique rather than what makes it similar, that is the very information they are likely to lose by defining their objects as nodes.
This is not to say it cannot be done, or even that it has not! People are clever, and network science is more flexible than some give it credit for. The important thing is either to be aware of what you are losing when you reduce your objects to one or a few types of nodes, or to change the methods of network science to fit your more complex data.
Relationships (presumably) exist. Friendships, similarities, web links, authorships, and wires all fall into this category. Network analysis generally deals with one or a small handful of types of relationships, and then a multitude of examples of that type.
Say the type we’re dealing with is an authorship. Books (the stuff) and authors (another kind of stuff) are connected to one-another via the authorship relationship, which is formalized in the phrase “X is an author of Y.” The individual relationships themselves are of the form “Franco Moretti is an author of Graphs, Maps, and Trees.”
Much like the stuff (nodes), relationships enjoy a multitude of names. I’ll call them edges. You’ll also hear them called arcs, links, ties, and relations. For simplicity sake, although edges are often used to describe only one variety of relationship, I’ll use it for pretty much everything and just add qualifiers when discussing specific types. The type of relationship corresponds to the type of edge. The individual examples are the edges themselves.
Individual edges are defined, in part, by the nodes that they connect.
A list of edges could look like this:
| Person | Is an author of |
| ----------------------------------------------------- |
| Franco Moretti | Modern Epic |
| Franco Moretti | Graphs, Maps, and Trees |
| Jacob A. Riis | How The Other Half Lives |
| Edith Hamilton | Mythology |
| Matthew Jockers | Macroanalysis |
Notice how, in this scheme, edges can only link two different types of nodes. That is, a person can be an author of a book, but a book cannot be an author of a book, nor can a person an author of a person. For a network to be truly bimodal, it must be of this form. Edges can go between types, but not among them.
This constraint may seem artificial, and in some sense it is, but for reasons I’ll get into in a later post, it is a constraint required by most algorithms that deal with bimodal networks. As mentioned above, algorithms are developed for specific purposes. Single mode networks are the ones with the most research done on them, but bimodal networks certainly come in a close second. They are networks with two types of nodes, and edges only going between those types.
Of course, the world humanists care to model is often a good deal more complicated than that, and not only does it have multiple varieties of nodes – it also has multiple varieties of edges. Perhaps, in addition to “X is an author of Y” type relationships, we also want to include “A collaborates with B” type relationships. Because edges, like nodes, can have attributes, an edge list combining both might look like this.
| Node1 | Node 2 | Edge Type |
| ----------------------------------------------------- | ----------------- |
| Franco Moretti | Modern Epic | is an author of |
| Franco Moretti | Graphs, Maps, and Trees | is an author of |
| Jacob A. Riis | How The Other Half Lives | is an author of |
| Edith Hamilton | Mythology | is an author of |
| Matthew Jockers | Macroanalysis | is an author of |
| Matthew Jockers | Franco Moretti | collaborates with |
Notice that there are now two types of edges: “is an author of” and “collaborates with.” Not only are they two different types of edges; they act in two fundamentally different ways. “X is an author of Y” is an asymmetric relationship; that is, you cannot switch out Node1 for Node2. You cannot say “Modern Epic is an author of Franco Moretti.” We call this type of relationship a directed edge, and we generally represent that visually using an arrow going from one node to another.
“A collaborates with B,” on the other hand, is a symmetric relationship. We can switch out “Matthew Jockers collaborates with Franco Moretti” with “Franco Moretti collaborates with Matthew Jockers,” and the information represented would be exactly the same. This is called an undirected edge, and is usually represented visually by a simple line connecting two nodes.
Most network algorithms and visualizations break down when combining these two flavors of edges. Some algorithms were designed for directed edges, like Google’s PageRank, whereas other algorithms are designed for undirected edges, like many centrality measures. Combining both types is rarely a good idea. Some algorithms will still run when the two are combined, however the results usually make little sense.
Both directed and undirected edges can also be weighted. For example, I can try to make a network of books, with those books that are similar to one another sharing an edge between them. The more similar they are, the heavier the weight of that edge. I can say that every book is similar to every other on a scale from 1 to 100, and compare them by whether they use the same words. Two dictionaries would probably connect to one another with an edge weight of 95 or so, whereas Graphs, Maps, and Trees would probably share an edge of weight 5 with How The Other Half Lives. This is often visually represented by the thickness of the line connecting two nodes, although sometimes it is represented as color or length.
It’s also worth pointing out the difference between explicit and inferred edges. If we’re talking about computers connected on a network via wires, the edges connecting each computer actually exist. We can weight them by wire length, and that length, too, actually exists. Similarly, citation linkages, neighbor relationships, and phone calls are explicit edges.
We can begin to move into interpretation when we begin creating edges between books based on similarity (even when using something like word comparisons). The edges are a layer of interpretation not intrinsic in the objects themselves. The humanist might argue that all edges are intrinsic all the way down, or inferred all the way up, but in either case there is a difference in kind between two computers connected via wires, and two books connected because we feel they share similar topics.
As such, algorithms made to work on one may not work on the other; or perhaps they may, but their interpretative framework must change drastically. A very central computer might be one in which, if removed, the computers will no longer be able to interact with one another; a very central book may be something else entirely.
As with nodes, edges come with many theoretical shortcomings for the humanist. Really, everything is probably related to everything else in its light cone. If we’ve managed to make everything in the world a node, realistically we’d also have some sort of edge between pretty much everything, with a lesser or greater weight. A network of nodes where almost everything is connected to almost everything else is called dense, and dense networks networks are rarely useful. Most network algorithms (especially ones that detect communities of nodes) work better and faster when the network is sparse, when most nodes are only connected to a small percentage of other nodes.
To make our network sparse, we often must artificially cut off which edges to use, especially with humanistic and inferred data. That’s what Shawn Graham showed us how to do when combining topic models with networks. The network was one of authors and topics; which authors wrote about which topics? The data itself connected every author to every topic to a greater or lesser degree, but such a dense network would not be very useful, so Shawn limited the edges to the highest weighted connections between an author and a topic. The resulting network looked like this, when it otherwise would have looked like a big ball of spaghetti and meatballs.
Unfortunately, given that humanistic data are often uncertain and biased to begin with, every arbitrary act of data-cutting has the potential to add further uncertainty and bias to a point where the network no longer provides meaningful results. The ability to cut away just enough data to make the network manageable, but not enough to lose information, is as much an art as it is a science.
Hypergraphs & Multigraphs
Mathematicians and computer scientists have actually formalized more complex varieties of networks, and they call them hypergraphs and multigraphs. Because humanities data are often so rich and complex, it may be more appropriate to represent it using these representations. Unfortunately, although ample research has been done on both, most out-of-the-box tools support neither. We have to build them for ourselves.
A hypergraph is one in which more than two nodes can be connected by one edge. A simple example would be a “is a sibling of” relationship, where the edge connected three sisters rather than two. This is a symmetric, undirected edge, but perhaps there can be directed edges as well, of the type “Alex convinced Betty to run away from Carl.”
A multigraph is one in which multiple edges can connect any two nodes. We can have, for example, a transportation graph between cities. A edge exists for every transportation route. Realistically, many routes can exist between any two cities; some by plane, several different highways, trains, etc.
I imagine both of these representations will be important for humanists going forward, but rather than relying on that computer scientist who keeps hanging out in the history department, we ourselves will have to develop algorithms that accurately capture exactly what it is we are looking for. We have a different set of problems, and though the solutions may be similar, they must be adapted to our needs.
Side note: RDF Triples
Digital humanities loves RDF. RDF basically works using something called a triple; a subject, a predicate, and an object. “Moretti is an author of Graphs, Maps, and Trees” is an example of a triple, where “Moretti” is the subject, “is an author of” is the predicate, and “Graphs, Maps, and Trees” is the object. As such, nearly all RDF documents can be represented as a directed network. Whether that representation would actually be useful depends on the situation.
Side note: Perspectives
Context is key, especially in the humanities. One thing the last few decades has taught us is that perspectives are essential, and any model of humanity that does not take into account its multifaceted nature is doomed to be forever incomplete. According to Alex, Betty and Carl are best friends. According to Carl, he can’t actually stand Betty. The structure and nature of a network might change depending on the perspective of a particular node, and I know of no model that captures this complexity. If you’re familiar with something that might capture this, or are working on it yourself, please let me know via comments or e-mail.
The above post discussed the simplest units of networks; the stuff and the relationships that connect them. Any network analysis approach must subscribe to and live with that duality of objects. Humanists face problems from the outset; data that does not fit neatly into one category or the other, complex situations that ought not be reduced, and methods that were developed with different purposes in mind. However, network analysis remains a viable methodology for answering and raising humanistic questions – we simply must be cautious, and must be willing to get our hands dirty editing the algorithms to suit our needs.
In the coming posts of this series, I’ll discuss various introductory topics including data representations, basic metrics like degree, centrality, density, clustering, and path length, as well as ways to link old network analysis concepts with common humanist problems. I’ll also try to highlight examples from the humanities, and raise methodological issues that come with our appropriation of somebody else’s algorithms.
This will probably be the longest of the posts, as some concepts are fairly central and must be discussed all-at-once. Again, if anybody has any particular concepts of network analysis they’d like to see discussed, please don’t hesitate to comment with your request.
Last post, I talked about combining textual and network analysis. Both are becoming standard tools in the methodological toolkit of the digital humanist, sitting next to GIS in what seems to be becoming the Big Three in computational humanities.
Data as Context, Data as Contextualized
Humanists are starkly aware that no particular aspect of a subject sits in a vacuum; context is key. A network on its own is a set of meaningless relationships without a knowledge of what travels through and across it, what entities make it up, and how that network interacts with the larger world. The network must be contextualized by the content. Conversely, the networks in which people and processes are situated deeply affect those entities: medium shapes message and topology shapes influence. The content must be contextualized by the network.
At the risk of the iPhonification of methodologies 1, textual, network, and geographic analysis may be combined with each other and traditional humanities research so that they might all inform one another. That last post on textual and network analysis was missing one key component for digital humanities: the humanities. Combining textual and network analysis with traditional humanities research (rather than merely using the humanities to inform text and network analysis, or vice-versa) promises to transform the sorts of questions asked and projects undertaken in Academia at large.
Just as networks can be used to contextualize text (and vice-versa), the same can be said of networks and maps (or texts and maps for that matter, or all three, but I’ll leave those for later posts). Now, instead of starting with the maps we all know and love, we’ll start by jumping into the deep end by discussing maps as any sort of representative landscape in which a network can be situated. In fact, I’m going to start off by using the network as a map against which certain relational properties can be overlaid. That is, I’m starting by using a map to contextualize a network, rather than the more intuitive other way around.
Using Maps to Contextualize a Network
The base map we’re discussing here is a map of science. They’ve made their rounds, so you’ve probably seen one, but just in case you haven’t here’s a brief description: some researchers (in this case Kevin Boyack and Richard Klavans) take tons on information from scholarly databases (in this case the Science Citation Index Expanded and the Social Science Citation Index) and create a network diagram from some set of metrics (in this case, citation similarity). They call this network representation a Map of Science.
We can debate about the merits of these maps ’till we’re blue in the face, but let’s avoid that for now. To my mind, the maps are useful, interesting, and incomplete, and the map-makers are generally well-aware of their deficiencies. The point here is that it is a map: a landscape against which one can situate oneself, and with which one may be able to find paths and understand the lay of the land.
In Boyack, Börner 2, and Klavans (2007), the three authors set out to use the map of science to explore the evolution of chemistry research. The purpose of the paper doesn’t really matter here, though; what matters is the idea of overlaying information atop a base network map.
The images above are the funding profiles of the NIH (National Institutes of Health) and NSF (National Science Foundation). The authors collected publication information attached to all the grants funded by the NSF and NIH and looked at how those publications cited one another. The orange edges show connections between disciplines on the map of science that were more prevalent within the context a particular funding agency than they were compared to the entire map of science. Boyack, Börner 3, and Klavans created a map and used it to contextualize certain funding agencies. They and other parties have since used such maps to contextualize universities, authors, disciplines, and other publication groups.
From Network Maps to Geographic Maps
Of course, the Where’s The Beef™ section of this post still has yet to be discussed, with the beef in this case being geography. How can we use existing topography to contextualize network topology? Network space rarely corresponds to geographic place, however neither of them alone can ever fully represent the landscape within which we are situated. A purely geographic map of ancient Rome would not accurately represent the world in which the ancient Romans lived, as it does not take into account the shortening of distances through well-trod trade routes.
Enter Stanford DH ninja Elijah Meeks. In two recent posts, Elijah discussed the topology/topography divide. In the first, he created a network layout algorithm which took a network with nodes originally placed in their geographic coordinates, and then distorted the network visualization to emphasize network distance. The visualization above shows the network laid out geographically. The one below shows the Imperial Roman trade routes with network distances emphasized. As Elijah says, “everything of the same color in the above map is the same network distance from Rome.”
Of course, the savvy reader has probably observed that this does not take everything into account. These are only land routes; what about the sea?
Elijah’s second post addressed just that, impressively applying GIS techniques to determine the likely route ships took to get from one port to another. This technique drives home the point he was trying to make about transitioning from network topology to network topography. The picture below, incidentally, is Elijah’s re-rendering of the last visualization taking into account both land and see routes. As you can see, the distance from any city to any other has decreased significantly, even taking into account his network-distance algorithm.
The above network visualization combines geography, two types of transportation routes, and network science to provide a more nuanced at-a-glance view of the Imperial Roman landscape. The work he highlighted in his post transitioning from topology to topography in edge shapes is also of utmost importance, however that topic will need to wait for another post.
The Republic of Letters (A Brief Interlude)
Elijah was also involved in another Stanford-based project, one very dear to my heart, Mapping the Republic of Letters. Much of my own research has dealt with the Republic of Letters, especially my time spent under Bob Hatch, and Paula Findlen, Dan Edelstein, and Nicole Coleman at Stanford have been heading up an impressive project on that very subject. I’ll go into more details about the Republic in another post (I know, promises promises), but for now the important thing to look at is their interface for navigating the Republic.
The team has gone well beyond the interface that currently faces the public, however even the original map is an important step. Overlaid against a map of Europe are the correspondences of many early modern scholars. The flow of information is apparent temporally, spatially, and through the network topology of the Republic itself. Now as any good explorer knows, no map is any substitute for a thorough knowledge of the land itself; instead, it is to be used for finding unexplored areas and for synthesizing information at a large scale. For contextualizing.
If you’ll allow me a brief diversion, I’d like to talk about tools for making these sorts of maps, now that we’re on the subject of letters. Elijah’s post on visualizing network distance included a plugin for Gephi to emphasize network distance. Gephi’s a great tool for making really pretty network visualizations, and it also comes with a small but potent handful of network analysis algorithms.
I’m on the development team of another program, the Sci² Tool, which shares a lot of Gephi’s functionality, although it has a much wider scope and includes algorithms for textual, geographic, and statistical analysis, as well as a somewhat broader range of network analysis algorithms.
This is by no means a suggestion to use Sci² over Gephi; they both have their strengths and weaknesses. Gephi is dead simple to use, produces the most beautiful graphs on the market, and is all-around fantastic software. They both excel in different areas, and by using them (and other tools!) together, it is possible to create maps combining geographic and network features without ever having to resort to programming.
The above image was generated by combining the Sci² Tool with Gephi. It is the correspondence network of Hugo Grotius, a dataset I worked on while at Huygens ING in The Hague. They are a great group, and another team doing fantastic Republic of Letters research, and they provided this letters dataset. We just developed this particular functionality in Sci² yesterday, so it will take a bit of time before we work out the bugs and release it publicly, however as soon as it is released I’ll be sure to post a full tutorial on how to make maps like the one above.
This ends the public service announcement.
These maps are not without their critics. Especially prevalent were questions along the lines of “But how is this showing me anything I didn’t already know?” or “All of this is just an artefact of population densities and standard trade routes – what are these maps telling us about the Republic of Letters?” These are legitimate critiques, however as mentioned before, these maps are still useful for at-a-glance synthesis of large scales of information, or learning something new about areas one is not yet an expert in. Another problem has been that the lines on the map don’t represent actual travel routes; those sorts of problems are beginning to be addressed by the type of work Elijah Meeks and other GIS researchers are doing.
To tackle the suggestion that these are merely representing population data, I would like to propose what I believe to be a novel idea. I haven’t published on this yet, and I’m not trying to claim scholarly territory here, but I would ask that if this idea inspires research of your own, please cite this blog post or my publication on the subject, whenever it comes out.
We have a lot of data. Of course it doesn’t feel like we have enough, and it never will, but we have a lot of data. We can use what we have, for example collecting all the correspondences from early modern Europe, and place them on a map like this one. The more data we have, the smaller time slices we can have in our maps. We create a base map that is a combination of geographic properties, statistical location properties, and network properties.
Start with a map of the world. To account for population or related correlations, do something similar to what Elijah did in this post, encoding population information (or average number of publications per city, or whatever else you’d like to account for) into the map. On top of that, place the biggest network of whatever it is that you’re looking at that you can find. Scholarly communication, citations, whatever. It’s your big Map of YourFavoriteThingHere. All of these together are your base map.
Atop that, place whatever or whomever you are studying. The correspondence of Grotius can be put on this map, like the NIH was overlaid atop the Map of Science, and areas would light up and become larger if they are surprising against the base map. Are there more letters between Paris and The Hague in the Grotius dataset then one would expect if the dataset was just randomly plucked from the whole Republic of Letters? If so, make that line brighter and thicker.
By combining geography, point statistics, and networks, we can create base maps against which we can contextualize whatever we happen to be studying. This is just one possible combination; base maps can be created from any of a myriad of sources of data. The important thing is that we, as humanists, ought to be able to contextualize our data in the same way that we always have. Now that we’re working with a lot more of it, we’re going to need help in those contextualizations. Base maps are one solution.
It’s worth pointing out one major problem with base maps: bias. Until recently, those Maps of Science making their way around the blogosphere represented the humanities as a small island off the coast of social sciences, if they showed them at all. This is because the primary publication venues of the arts and humanities were not represented in the datasets used to create these science maps. We must watch out for similar biases when constructing our own base maps, however the problem is significantly more difficult for historical datasets because the underrepresented are too dead to speak up. For a brief discussion of historical biases, you can read my UCLA presentation here.
putting every tool imaginable in one box and using them all at once ↩
Full disclosure: she’s my advisor. She’s also awesome. Hi Katy! ↩
According to Google Scholar, David Blei’s first topic modeling paper has received 3,540 citations since 2003. Everybody’s talking about topic models. Seriously, I’m afraid of visiting my parents this Hanukkah and hearing them ask “Scott… what’s this topic modeling I keep hearing all about?” They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.
Since shortly after Blei’s first publication, researchers have been looking into the interplay between networks and topic models. This post will be about that interplay, looking at how they’ve been combined, what sorts of research those combinations can drive, and a few pitfalls to watch out for. I’ll bracket the big elephant in the room until a later discussion, whether these sorts of models capture the semantic meaning for which they’re often used. This post also attempts to introduce topic modeling to those not yet fully converted aware of its potential.
A brief history of topic modeling
In my recent post on IU’s awesome alchemy project, I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) during the discussion of topic models. They’re intimately related, though LSA has been around for quite a bit longer. Without getting into too much technical detail, we should start with a brief history of LSA/LDA.
The story starts, more or less, with a tf-idf matrix. Basically, tf-idf ranks words based on how important they are to a document within a larger corpus. Let’s say we want a list of the most important words for each article in an encyclopedia.
Our first pass is obvious. For each article, just attach a list of words sorted by how frequently they’re used. The problem with this is immediately obvious to anyone who has looked at word frequencies; the top words in the entry on the History of Computing would be “the,” “and,” “is,” and so forth, rather than “turing,” “computer,” “machines,” etc. The problem is solved by tf-idf, which scores the words based on how special they are to a particular document within the larger corpus. Turing is rarely used elsewhere, but used exceptionally frequently in our computer history article, so it bubbles up to the top.
LSA and pLSA
LSA utilizes these tf-idf scores 1 within a larger term-document matrix. Every word in the corpus is a different row in the matrix, each document has its own column, and the tf-idf score lies at the intersection of every document and word. Our computing history document will probably have a lot of zeroes next to words like “cow,” “shakespeare,” and “saucer,” and high marks next to words like “computation,” “artificial,” and “digital.” This is called a sparse matrix because it’s mostly filled with zeroes; most documents use very few words related to the entire corpus.
With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. 2 It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.
The method was significantly improved by Puzicha and Hofmann (1999), who did away with the linear algebra approach of LSA in favor of a more statistically sound probabilistic model, called probabilistic latent semantic analysis (pLSA). Now is the part of the blog post where I start getting hand-wavy, because explaining the math is more trouble than I care to take on in this introduction.
Essentially, pLSA imagines an additional layer between words and documents: topics. What if every document isn’t just a set of words, but a set of topics? In this model, our encyclopedia article about computing history might be drawn from several topics. It primarily draws from the big platonic computing topic in the sky, but it also draws from the topics of history, cryptography, lambda calculus, and all sorts of other topics to a greater or lesser degree.
Now, these topics don’t actually exist anywhere. Nobody sat down with the encyclopedia, read every entry, and decided to come up with the 200 topics from which every article draws. pLSA infers topics based on what will hereafter be referred to as black magic. Using the dark arts, pLSA “discovers” a bunch of topics, attaches them to a list of words, and classifies the documents based on those topics.
Blei et al. (2003) vastly improved upon this idea by turning it into a generative model of documents, calling the model Latent Dirichlet allocation (LDA). By this time, as well, some sounder assumptions were being made about the distribution of words and document length — but we won’t get into that. What’s important here is the generative model.
Imagine you wanted to write a new encyclopedia entry, let’s say about digital humanities. Well, we now know there are three elements that make up that process, right? Words, topics, and documents. Using these elements, how would you go about writing this new article on digital humanities?
First off, let’s figure out what topics our article will consist of. It probably draws heavily from topics about history, digitization, text analysis, and so forth. It also probably draws more weakly from a slew of other topics, concerning interdisciplinarity, the academy, and all sorts of other subjects. Let’s go a bit further and assign weights to these topics; 22% of the document will be about digitization, 19% about history, 5% about the academy, and so on. Okay, the first step is done!
Now it’s time to pull out the topics and start writing. It’s an easy process; each topic is a bag filled with words. Lots of words. All sorts of words. Let’s look in the “digitization” topic bag. It includes words like “israel” and “cheese” and “favoritism,” but they only appear once or twice, and mostly by accident. More importantly, the bag also contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of “scanner.”
So here you are, you’ve dragged out your digitization bag and your history bag and your academy bag and all sorts of other bags as well. You start writing the digital humanities article by reaching into the digitization bag (remember, you’re going to reach into that bag for 22% of your words), and you pull out “OCR.” You put it on the page. You then reach for the academy bag and reach for a word in there (it happens to be “teaching,”) and you throw that on the page as well. Keep doing that. By the end, you’ve got a document that’s all about the digital humanities. It’s beautiful. Send it in for publication.
Alright, what now?
So why is the generative nature of the model so important? One of the key reasons is the ability to work backwards. If I can generate an (admittedly nonsensical) document using this model, I can also reverse the process an infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from.
Another factor contributing to the success of LDA is the ability to extend the model. In this case, we assume there are only documents, topics, and words, but we could also make a model that assumes authors who like particular topics, or assumes that certain documents are influenced by previous documents, or that topics change over time. The possibilities are endless, as evidenced by the absurd number of topic modeling variations that have appeared in the past decade. David Mimno has compiled a wonderful bibliography of many such models.
While the generative model introduced by Blei might seem simplistic, it has been shown to be extremely powerful. When a newcomer sees the results of LDA for the first time, they are immediately taken by how intuitive they seem. People sometimes ask me “but didn’t it take forever to sit down and make all the topics?” thinking that some of the magic had to be done by hand. It wasn’t. Topic modeling yields intuitive results, generating what really feels like topics as we know them 3, with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.
Topic Modeling and Networks
Topic models can interact with networks in multiple ways. While a lot of the recent interest in digital humanities has surrounded using networks to visualize how documents or topics relate to one another, the interfacing of networks and topic modeling initially worked in the other direction. Instead of inferring networks from topic models, many early (and recent) papers attempt to infer topic models from networks.
Topic Models from Networks
The first research I’m aware of in this niche was from McCallum et al. (2005). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (Steyvers et al., 2004), which assumes topics are formed based on the mixtures of authors writing a paper. McCallum et al. extended that model for directed messages in their Author-Recipient-Topic (ART) Model. In ART, it is assumed that topics of letters, e-mails or direct messages between people can be inferred from knowledge of both the author and the recipient. Thus, ART takes into account the social structure of a communication network in order to generate topics. In a later paper (McCallum et al., 2007), they extend this model to one that infers the roles of authors within the social network.
Dietz et al. (2007) created a model that looks at citation networks, where documents are generated by topical innovation and topical inheritance via citations. Nallapati et al. (2008) similarly creates a model that finds topical similarity in citing and cited documents, with the added ability of being able to predict citations that are not present. Blei himself joined the fray in 2009, creating the Relational Topic Model (RTM) with Jonathan Chang, which itself could summarize a network of documents, predict links between them, and predict words within them. Wang et al. (2011) created a model that allows for “the joint analysis of text and links between [people] in a time-evolving social network.” Their model is able to handle situations where links exist even when there is no similarity between the associated texts.
Networks from Topic Models
Some models have been made that infer networks from non-networked text. Broniatowski and Magee (2010 & 2011) extended the Author-Topic Model, building a model that would infer social networks from meeting transcripts. They later added temporal information, which allowed them to infer status hierarchies and individual influence within those social networks.
Many times, however, rather than creating new models, researchers create networks out of topic models that have already been run over a set of data. There are a lot of benefits to this approach, as exemplified by the Newton’s Chymistry project highlighted earlier. Using networks, we can see how documents relate to one another, how they relate to topics, how topics are related to each other, and how all of those are related to words.
Elijah Meeks created a wonderful example combining topic models with networks in Comprehending the Digital Humanities. Using fifty texts that discuss humanities computing, Elijah created a topic model of those documents and used networks to show how documents, topics, and words interacted with one another within the context of the digital humanities.
Elijah Jeff Drouin has also created networks of topic models in Proust, as reported by Elijah.
Peter Leonard recently directed me to TopicNets, a project that combines topic modeling and network analysis in order to create an intuitive and informative navigation interface for documents and topics. This is a great example of an interface that turns topic modeling into a useful scholarly tool, even for those who know little-to-nothing about networks or topic models.
If you want to do something like this yourself, Shawn Graham recently posted a great tutorial on how to create networks using MALLET and Gephi quickly and easily. Prepare your corpus of text, get topics with MALLET, prune the CSV, make a network, visualize it! Easy as pie.
Networks can be a great way to represent topic models. Beyond simple uses of navigation and relatedness as were just displayed, combining the two will put the whole battalion of network analysis tools at the researcher’s disposal. We can use them to find communities of similar documents, pinpoint those documents that were most influential to the rest, or perform any of a number of other workflows designed for network analysis.
As with anything, however, there are a few setbacks. Topic models are rich with data. Every document is related to every other document, if some only barely. Similarly, every topic is related to every other topic. By deciding to represent document similarity over a network, you must make the decision of precisely how similar you want a set of documents to be if they are to be linked. Having a network with every document connected to every other document is scarcely useful, so generally we’ll make our decision such that each document is linked to only a handful of others. This allows for easier visualization and analysis, but it also destroys much of the rich data that went into the topic model to begin with. This information can be more fully preserved using other techniques, such as multidimensional scaling.
A somewhat more theoretical complication makes these network representations useful as a tool for navigation, discovery, and exploration, but not necessarily as evidentiary support. Creating a network of a topic model of a set of documents piles on abstractions. Each of these systems comes with very different assumptions, and it is unclear what complications arise when combining these methods ad hoc.
Although there may be issues with the process, the combination of topic models and networks is sure to yield much fruitful research in the digital humanities. There are some fantastic tutorials out there for getting started with topic modeling in the humanities, such as Shawn Graham’s post on Getting Started with MALLET and Topic Modeling, as well as on combining them with networks, such as this post from the same blog. Shawn is right to point out MALLET, a great tool for starting out, but you can also find the code used for various models on many of the model-makers’ academic websites. One code package that stands out is Chang’s implementation of LDA and related models in R.
Ted Underwood rightly points out in the comments that other scoring systems are often used in lieu of tf-idf, most frequently log entropy. ↩
Yes yes, this is a simplification of actual LSA, but it’s pretty much how it works. SVD reduces the size of the matrix to filter out noise, and then each word row is treated as a vector shooting off in some direction. The vector of each word is compared to every other word, so that every pair of words has a relatedness score between them. Ted Underwood has a great blog post about why humanists should avoid the SVD step. ↩
They’re not, of course. We’ll worry about that later. ↩