Early modern history! Science! Letters! Data! Four of my favoritest things have been combined in this brand new beta release of Early Modern Letters Online from Oxford University.
EMLO (what an adorable acronym, I kind of what to tickle it) is Oxford’s answer to a metadata database (metadatabase?) of, you guessed it, early modern letters. This is pretty much a gold standard metadata project. It’s still in beta, so there are some interface kinks and desirable features not-yet-implemented, but it has all the right ingredients for a great project:
Information is free and open; I’m even told it will be downloadable at some point.
The interface is fast, easy, and includes faceted browsing.
Has a fantastic interface for adding your own data.
Actually includes citation guidelines thank you so much.
Visualizations for at-a-glance understanding of data.
Links to full transcripts, abstracts, and hard-copies where available.
Lots of other fantastic things.
Sorry if I go on about how fantastic this catalog is – like I said, I love letters so much. The index itself includes roughly 12,000 people, 4,000 locations, 60,000 letters, 9,000 images, and 26,000 additional comments. It is without a doubt the largest public letters database currently available. Between the data being compiled by this group, along with that of the CKCC in the Netherlands, the Electronic Enlightenment Project at Oxford, Stanford’s Mapping the Republic of Letters project, and R.A. Hatch‘s research collection, there will without a doubt soon be hundreds of thousands of letters which can be tracked, read, and analyzed with absolute ease. The mind boggles.
Bodleian Card Catalogue Summaries
Without a doubt, the coolest and most unique feature this project brings to the table is the digitization of Bodleian Card Catalogue, a fifty-two drawer index-card cabinet filled with summaries of nearly 50,000 letters held in the library, all compiled by the Bodleian staff many years ago. In lieu of full transcriptions, digitizations, or translations, these summary cards are an amazing resource by themselves. Many of the letters in the EMLO collection include these summaries as full-text abstracts.
The collection also includes the correspondences of John Aubrey (1,037 letters), Comenius (526), Hartlib (4,589 many including transcripts), Edward Lhwyd (2,139 many including transcripts), Martin Lister (1,141), John Selden (355), and John Wallis (2,002). The advanced search allows you to look for only letters with full transcripts or abstracts available. As someone who’s worked with a lot of letters catalogs of varying qualities, it is refreshing to see this one being upfront about unknown/uncertain values. It would, however, be nice if they included the editor’s best guess of dates and locations, or perhaps inferred locations/dates from the other information available. (For example, if birth and death dates are known, it is likely a letter was not written by someone before or after those dates.)
In the interest of full disclosure, I should note that, much like with the CKCC letters interface, I spent some time working with the Cultures of Knowledge team on visualizations for EMLO. Their group was absolutely fantastic to work with, with impressive resources and outstanding expertise. The result of the collaboration was the integration of visualizations in metadata summaries, the first of which is a simple bar chart showing the numbers of letters written, received, and mentioned in per year of any given individual in the catalog. Besides being useful for getting an at-a-glance idea of the data, these charts actually proved really useful for data cleaning.
Because I can’t do anything with letters without looking at them as a network, I decided to put together some visualizations using Sci2 and Gephi. In both cases, the Sci2 tool was used for data preparation and analysis, and the final network was visualized in GUESS and Gephi, respectively. The first graph shows network in detail with edges, and names visible for the most “central” correspondents. The second visualization is without edges, with each correspondent clustered according to their place in the overall network, with the most prominent figures in each cluster visible.
The graphs show us that this is not a fully connected network. There are many islands of one or two letters or a small handful of letters. These can be indicative of a prestige bias in the data. That is, the collection contains many letters from the most prestigious correspondents, and increasingly fewer as the prestige of the correspondent decreases. Put in another way, there are many letters from a few, and few letters from many. This is a characteristic shared with power law and other “long tail” distributions. The jumbled community structure at the center of the second graph is especially interesting, and it would be worth comparing these communities against institutions and informal societies at the time. Knowledge of large-scale patterns in a network can help determine what sort of analyses are best for the data at hand. More on this in particular will be coming in the next few weeks.
It’s also worth pointing out these visualizations as another tool for data-checking. You may notice, on the bottom left-hand corner of the first network visualization, two separate Edward Lhwyds with virtually the same networks of correspondence. This meant there were two distinct entities in their database referring to the same individual – a problem which has since been corrected.
Notice that the EMLO site makes it very clear that they are open to contributions. There are many letters datasets out there, some digitized, some still languishing idly on dead trees, and until they are all combined, we will be limited in the scope of the research possible. We can always use more. If you are in any way responsible for an early-modern letters collection, meta-data or full-text, please help by opening that collection up and making it integrable with the other sets out there. It will do the scholarly world a great service, and get us that much closer to understanding the processes underlying scholarly communication in general. The folks at Oxford are providing a great example, and I look forward to watching this project as it grows and improves.
So apparently yesterday was a big day for hypothesis testing and discovery. Stanley Fish’s third post on Digital Humanities also brought up the issue of fishing for correlations, although his post was… slightly more polemic. Rather than going over it on this blog, I’ll let Ted Underwood describe it. Anybody who read my post on Avoiding Traps should also read Underwood’s post; it highlights the role of discovery in the humanities as a continuous process of appraisal and re-appraisal, both on the quantitative and qualitative side.
…the significance of any single test is reduced when it’s run as part of a large battery.
That’s a valid observation, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.”And it’s why Matt Wilkens (targeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them more stringently. (For instance, after noticing that coastal states initially seem more prominent in American fiction than the Midwest, he tests whether this remains true after you compensate for differences in population size, and then proposes a hypothesis that he suggests will need to be confirmed by additional “test cases.”)
It’s important to keep in mind that Reichenbach’s old distinction between discovery and justification is not so clear-cut as it was originally conceived. How we generate our hypotheses, and how we support them to ourselves and the world at large, is part of the ongoing process of research. In my last post, I suggested people keep clear ideas of what they plan on testing before they begin testing; let me qualify that slightly. One of the amazing benefits of Big Data has been the ability to spot trends we were not looking for; an unexpected trend in the data can lead us to a new hypothesis, one which might be fruitful and interesting. The task, then, is to be clever enough to devise further tests to confirm the hypothesis in a way that isn’t circular, relying on the initial evidence that led you toward it.
… I like books with pictures. When I started this blog, I promised myself I’d have a picture in every post. I can’t think of one that’s relevant, so here’s an angry cupcake:
We have the advantage of arriving late to the game.
In the cut-throat world of high-tech venture capitalism, the first company with a good idea often finds itself at the mercy of latecomers. The latecomer’s product might be better-thought-out, advertised to a more appropriate market, or simply prettier, but in each case that improvement comes through hindsight. Trailblazers might get there first, but their going is slowest, and their way the most dangerous.
Digital humanities finds itself teetering on the methodological edge of many existing disciplines, boldly going where quite a few have gone before. When I’ve blogged before about the dangers of methodology appropriation, it was in the spirit of guarding against our misunderstanding of foundational aspects of various methodologies. This post is instead about avoiding the monsters already encountered (and occasionally vanquished) by other disciplines.
Everything Old Is New Again
A collective guffaw probably accompanied my defining digital humanities as a “new” discipline. Digital humanities itself has a rich history dating back to big iron computers in the 1950s, and the humanities in general, well… they’re old. Probably older than my grandparents.
The important point, however, is that we find ourselves in a state of re-definition. While this is not the first time, and it certainly will not be the last, this state is exceptionally useful in planning against future problems. Our blogosphere cup overfloweth with definitions of and guides to the digital humanities, many of our journals are still in their infancy, and our curricula are over-ready for massive reconstruction. Generally (from what I’ve seen), everyone involved in these processes are really excited and open to new ideas, which should ease the process of avoiding monsters.
Most of the below examples, and possible solutions, are drawn from the same issues of bias I’ve previouslydiscussed. Also, the majority are meta-difficulties. While some of the listed dangers are avoidable when writing papers and doing research, most are discipline-level systematic. That is, despite any researcher’s best efforts, the aggregate knowledge we gain while reading the newest exciting articles might fundamentally mislead us. While these dangers have never been wholly absent from the humanities, our recent love of big data profoundly increases their effect sizes.
An architect from Florida might not be great at designing earthquake-proof housing, and while earthquakes are still a distant danger, this shouldn’t really affect how he does his job at home. If the same architect moves to California, odds are he’ll need to learn some extra precautions. The same is true for a digital humanist attempting to make inferences from lots of data, or from a bunch of studies which all utilize lots of data. Traditionally, when looking at the concrete and particular, evidence for something is necessary and (with enough evidence) sufficient to believe in that thing. In aggregate, evidence for is necessary but not sufficient to identify a trend, because that trend may be dwarfed by or correlated to some other data that are not available.
The below lessons are not all applicable to DH as it exists today, and of course we need to adapt them to our own research (their meaning changes in light of our different material of study), however they’re still worth pointing out and, perhaps, may be guarded against. Many traditional sciences still struggle with these issues due to institutional inertia. Their journals have acted in such a way for so long, so why change it now? Their tenure has acted in such a way for so long, so why change it now? We’re already restructuring, and we have a great many rules that are still in flux, so we can change it now.
Anyway, I’ve been dancing around the examples for way too long, so here’s the meat:
Sampling and Selection Bias
The problem here is actually two-fold, both for the author of a study, and for the reader of several studies. We’ll start with the author-centric issues.
Sampling and Selection Bias in Experimental Design
People talk about sampling and selection biases in different ways, but for the purpose of this post we’ll use wikipedia’s definition:
Selection bias is a statistical bias in which there is an error in choosing the individuals or groups to take part in a scientific study.
A distinction, albeit not universally accepted, of sampling bias [from selection bias] is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.
In this case, we’ll say a study exhibits a sampling error if the conclusions drawn from the data at hand, while internally valid, does not actually hold true for the world around it. Let’s say I’m analyzing the prevalence of certain grievances in the cahiers de doléances from the French Revolution. One study showed that, of all the lists written, those from urban areas were significantly more likely to survive to today. Any content analysis I perform on those lists will bias the grievances of those people from urban areas, because my sample is not representative. Conclusions I draw about grievances in general will be inaccurate, unless I explicitly take into account which sort of documents I’m missing.
Selection bias can be insidious, and many varieties can be harder to spot than sampling bias. I’ll discuss two related phenomena of selection bias which lead to false positives, those pesky statistical effects which leave us believing we’ve found something exciting when all we really have is hot air.
The first issue, probably the most relevant to big-data digital humanists, is data dredging. When you have a lot of data (and increasingly more of us have just that), it’s very tempting to just try to find correlations between absolutely everything. In fact, as exploratory humanists, that’s what we often do: get a lot of stuff, try to understand it by looking at it from every angle, and then write anything interesting we notice. This is a problem. The more data you have, the more statistically likely it is that it will contain false-positive correlations.
Google has lots of data, let’s use them as an example! We can look at search frequencies over time to try to learn something about the world. For example, people search for “Christmas” around and leading up to December, but that search term declines sharply once January hits. Comparing that search with searches for “Santa”, we see the two results are pretty well correlated, with both spiking around the same time. From that, we might infer that the two are somehow related, and would do some further studies.
Unfortunately, Google has a lot of data, and a lot of searches, and if we just looked for every search term that correlated well with any other over time, well, we’d come up with a lot of nonsense. Apparently searches for “losing weight” and “2 bedroom” are 93.6% correlated over time. Perhaps there is a good reason, perhaps there is not, but this is a good cautionary tale that the more data you have, the more seemingly nonsensical correlations will appear. It is then very easy to cherry pick only the ones that seem interesting to you, or which support your hypothesis, and to publish those.
The other type of selection bias leading to false positives I’d like to discuss is cherry picking. This is selective use of evidence, cutting data away until the desired hypothesis appears to be the correct one. The humanities, not really known for their hypothesis testing, are not quite as likely to be bothered by this issue, but it’s still something to watch out for. This is also related to confirmation bias, the tendency for people to only notice evidence for that which they already believe.
Much like data dredging, cherry picking is often done without the knowledge or intent of the research. It arises out of what Simmons, Nelson, and Simonsohn (2011) call researcher degrees of freedom. Researchers often make decisions on the fly:
Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?
The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding [that is significant] is [itself necessarily significant]. This exploratory behavior is not the by-product of malicious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result.
When faced with decisions of how to proceed with analysis, we will almost invariably (and inadvertently) favor the decision that results in our hypothesis seeming more plausible.
If I go into my favorite dataset (The Republic of Letters!) trying to show that Scholar A was very similar to Scholar B in many ways, odds are I could do that no matter who the scholars were, so long as I had enough data. If you take a cookie-cutter to your data, don’t be surprised when cookie-shaped bits come out the other side.
Sampling and Selection Bias in Meta-Analysis
There are copious examples of problems with meta-analysis. Meta-analysis is, essentially, a quantitative review of studies on a particular subject. For example, a medical meta-analysis could review data from hundreds of small studies testing the side-effects of a particular medicine, bringing them all together and drawing new or more certain conclusions via the combination of data. Sometimes these are done to gain a larger sample size, or to show how effects change across different samples, or to provide evidence that one non-conforming study was indeed a statistical anomaly.
A meta-analysis is the quantitative alternative to something every one of us in academia does frequently: read a lot of papers or books, find connections, draw inferences, explore new avenues, and publish novel conclusions. Because quantitative meta-analysis is so similar to what we do, we can use the problems it faces to learn more about the problems we face, but which are more difficult to see. A criticism oft-lobbed at meta-analyses is that of garbage in – garbage out; the data used for the meta-analysis is not representative (or otherwise flawed), so the conclusions as well are flawed.
There are a number of reasons why the data in might be garbage, some of which I’ll cover below. It’s worth pointing out that the issues above (cherry-picking and data dredging) also play a role, because if the majority of studies are biased toward larger effect sizes, then the overall perceived effect across papers will appear systematically larger. This is not only true of quantitative meta-analysis; when every day we read about trends and connections that may not be there, no matter how discerning we are, some of those connections will stick and our impressions of the world will be affected. Correlation might not imply anything.
Before we get into publication bias, I will write a short aside that I was really hoping to avoid, but really needs to be discussed. I’ll dedicate a post to it eventually, when I feel like punishing myself, but for now, here’s my summary of
The Problems with P
Most of you have heard of p-values. A lucky few of you have never heard of them, and so do not need to be untrained and retrained. A majority of you probably hold a view similar to a high-ranking, well-published, and well-learned professor I met recently. “All I know about statistics,” he said, “is that p-value formula you need to show whether or not your hypothesis is correct. It needs to be under .05.” Many of you (more and more these days) are aware of the problems with that statement, and I thank you from the bottom of my heart.
Let’s talk about statistics.
The problems with p-values are innumerable (let me count the ways), and I will not get into most of them here. Essentially, though, the calculation of a p-value is the likelihood that the results of your study did not appear by random chance alone. In many studies which rely on statistics, the process works like this: begin with a hypothesis, run an experiment, analyze the data, calculate the p-value. The researcher then publishes something along the lines of “my hypothesis is correct because p is under 0.05.”
Most people working with p-values know that it has something to do with the null hypothesis (that is, the default position; the position that there is no correlation between the measured phenomena). They work under the assumption that the p-value is the likelihood that the null hypothesis is true. That is, if the p-value is 0.75, it’s 75% likely that the null hypothesis is true, and there is no correlation between the variables being studied. Generally, the cut-off to get published is 0.05; you can only publish your results if it’s less than 5% likely that the null hypothesis is true, or more than 95% likely that your hypothesis is true. That means you’re pretty darn certain of your result.
Unfortunately, most of that isn’t actually how p-values work. Wikipedia writes:
In a nutshell, assuming there is no correlation between two variables, what’s the likelihood that they’ll appear as correlated as you observed in your experiment by chance alone? If your p-value is .05, that means it’s 5% likely that random chance caused your variables to be correlated. That is, one in every twenty studies (5%) that get a p-value of 0.05 will have found a correlation that doesn’t really exist.
To recap: p-values say nothing about your hypothesis. They say, assuming there is no real correlation, what’s the likelihood that your data show one anyway? Also, in the scholarly community, a result is considered “significant” if p is less than or equal to 0.05. Alright, I’m glad that’s out of the way, now we’re all on the same footing.
The positive results bias, the first of many interrelated publication biases, simply states that positive results are more likely to get published then negative or inconclusive ones. Authors and editors will be more likely to submit and accept work if the results are significant (p < .05). The file drawer problem is the opposite effect: negative results are more likely to be stuck in somebody’s file drawer, never to see the light of day. HARKing (Hypothesizing After the Results Are Known), much like cherry-picking above, is when, if during the course of a study many trials and analyses occur, only the “significant” ones are ever published.
Let’s begin with HARKing. Recall that a p-value is (basically) the likelihood that an effect occurred by chance alone. If one research project consisted of 100 different trials and analyses, if only 5 of them yielded significant results pointing toward the author’s hypothesis, those 5 analyses likely occurred by chance. They could still be published (often without the researcher even realizing they were cherry-picking, because obviously non-fruitful analyses might be stopped before they’re even finished). Thus, again, more positive results are published than perhaps there ought to be.
Let’s assume some people are perfect in every way, shape, and form. Every single one of their studies is performed with perfect statistical rigor, and all of their results are sound. Again, however, they only publish their positive results – the negative ones are kept in the file drawer. Again, more positive results are being published than being researched.
Who cares? So what that we’re only seeing the good stuff?
The problem is that, using common significance testing of p < 0.05, 5% of published, positive results ought to have occurred by chance alone. However, since we cannot see the studies that haven’t been published because their results were negative, those 5% studies that yielded correlations where they should not have are given all the scholarly weight. One hundred small studies are done on the efficacy of some medicine for some disease; only five by chance find some correlation – they are published. Let’s be liberal, and say another three are published saying there was no correlation between treatment and cure. Thus, an outside observer will see that the evidence is stacked in the favor of the (ineffectual) medication.
The Decline Effect
A recent much-discussed article by John Lehrer, as well as countless studies by John Ioannidis and others, show two things: (1) a large portion of published findings are false (some of the reasons are shown above). (2) The effects of scientific findings seem to decline. A study is published, showing a very noticeable effect of some medicine curing a disease, and further tests tend show that very noticeable effect declining sharply. (2) is mostly caused by (1). Much ink (or blood) could be spilled discussing this topic, but this is not the place for it.
So there are a lot of biases in rigorous quantitative studies. Why should humanists care? We’re aware that people are not perfect, that research is contingent, that we each bring our own subjective experiences to the table, and they shape our publications and our outlooks, and none of those are necessarily bad things.
The issues arise when we start using statistics, or algorithms derived using statistics, and other methods used by our quantitative brethren. Make no mistake, our qualitative assessments are often subject to the same biases, but it’s easy to write reflexively on one’s own position when they are only one person, one data-point. In the age of Big Data, with multiplying uncertainties for any bit of data we collect, it is far easier to lose track of small unknowns in the larger picture. We have the opportunity of learning from past mistakes so we can be free to make mistakes of our own.
Ioannidis’ most famous article is, undoubtedly, the polemic “Why Most Published Research Findings Are False.” With a statement like that, what hope is there? Ioannidis himself has some good suggestions, and there are many floating around out there; as with anything, the first step is becoming cognizant of the problems, and the next step is fixing them. Digital humanities may be able to avoid inheriting these problems entirely, if we’re careful.
We’re already a big step ahead of the game, actually, because of the nearly nonsensical volumes of tweets and blog posts on nascent research. In response to publication bias and the file drawer problem, many people suggest a authors submit their experiment to a registry before they begin their research. That way, it’s completely visible what experiments on a subject have been run that did not yield positive results, regardless of whether they eventually became published. Digital humanists are constantly throwing out ideas and preliminary results, which should help guard against misunderstandings through publication bias. We have to talk about all the effort we put into something, especially when nothing interesting comes out of it. The fact that some scholar felt there should be something interesting, and there wasn’t, is itself interesting.
At this point, “replication studies” means very little in the humanities, however if we begin heading down the road where replication studies become more feasible, our journals will need to be willing to accept them just as they accept novel research. Funding agencies should also be just as willing to fund old, non-risky continuation research as they are the new exciting stuff.
Other institutional changes needed for us to guard against this sort of thing is open access publications (so everyone draws inferences from the same base set of research), tenure boards that accept negative research and exploratory research (again, not as large of an issue for the humanities), and restructured curricula that teach quantitative methods and their pitfalls, especially statistics.
On the ground level, a good knowledge of statistics (especially Bayesian statistics, doing away with p-values entirely) will be essential as more data becomes available to us. When running analysis on data, to guard against coming up with results that appear by random chance, we have to design an experiment before running it, stick to the plan, and publish all results, not just ones that fit our hypotheses. The false-positive psychology paper I mentioned above actually has a lot of good suggestions to guard against this effect:
Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.
Authors must list all variables collected in a study
Authors must report all experimental conditions, including failed manipulations.
If observations are eliminated, authors must also report what the statistical results are if those observations are included.
If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
Reviewers should ensure that authors follow the requirements.
Reviewers should be more tolerant of imperfections in results.
Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.
This list of problems and solutions is neither exhaustive nor representative. That is, there are a lot of biases out there unlisted, and not all the ones listed are the most prevalent. Gender and power biases come to mind, however they are well beyond anything I could intelligently argue, and there are issues of peer-review and retraction rates that are an entirely different can of worms.
Also, the humanities are simply different. We don’t exactly test hypothesis, we’re not looking for ground truths, and our publication criteria are very different from that of the natural and social sciences. It seems clear that the issues listed above will have some mapping on our own research going forward, but I make no claims at understanding exactly how or where. My hope in this blog post is to raise awareness of some of the more pressing concerns in quantitative studies that might have bearing on our own studies, so we can try to understand how they will be relevant to our own research, and how we might guard against it.
A bunchofmyrecentposts have mentioned networks. Elijah Meeks not-so-subtly hinted that it might be a good idea to discuss some of the basics of networks on this blog, and I’m happy to oblige. He already introduced network visualizations on his own blog, and did a fantastic job of it, so I’m going to try to stick to more of the conceptual issues here, gearing the discussion generally toward people with little-to-no background in networks or math, and specifically to digital humanists interested in applying network analysis to their own work. This will be part of an ongoing series, so if you have any requests, please feel free to mention them in the comments below (I’ve already been asked to discuss how social networks apply to fictional worlds, which is probably next on the list).
A network is a fantastic tool in the digital humanist’s toolbox – one of many – and it’s no exaggeration to say pretty much any data can be studied via network analysis. With enough stretching and molding, you too could have a network analysis problem! As with many other science-derived methodologies, it’s fairly easy to extend the metaphor of network analysis into any number of domains.
The danger here is two-fold.
When you’re given your first hammer, everything looks like a nail. Networks can be used on any project. Networks should be used on far fewer. Networks in the humanities are experiencing quite the awakening, and this is due in part to until-recently untapped resources. There is a lot of low-hanging fruit out there on the networks+humanities tree, and they ought to be plucked by those brave and willing enough to do so. However, that does not give us an excuse to apply networks to everything. This series will talk a little bit about when hammers are useful, and when you really should be reaching for a screwdriver.
Methodology appropriation is dangerous. Even when the people designing a methodology for some specific purpose get it right – and they rarely do – there is often a score of theoretical and philosophical caveats that get lost when the methodology gets translated. In the more frequent case, when those caveats are not known to begin with, “borrowing” the methodology becomes even more dangerous. Ted Underwood blogs a great example of why literary historians ought to skip a major step in Latent Semantic Analysis, because the purpose of the literary historian is so very different from that of computer scientists who designed the algorithm. This series will attempt to point out some of the theoretical baggage and necessary assumptions of the various network methods it covers.
Nothing worth discovering has ever been found in safe waters. Or rather, everything worth discovering in safe waters has already been discovered, so it’s time to shove off into the dangerous waters of methodology appropriation, cognizant of the warnings but not crippled by them.
Anyone with a lot of time and a vicious interest in networks should stop reading right now, and instead pick up copies of Networks, Crowds, and Markets (Easley & Kleinberg, 2010) and Networks: An Introduction (Newman, 2010). The first is a non-mathy introduction to most of the concepts of network analysis, and the second is a more in depth (and formula-laden) exploration of those concepts. They’re phenomenal, essential, and worth every penny.
Those of you with slightly less time, but somehow enough to read my rambling blog (there are apparently a few of you out there), so good of you to join me. We’ll start with the really basic basics, but stay with me, because by part n of this series, we’ll be going over the really cool stuff only ninjas, Gandhi, and The Rolling Stones have worked on.
Generally, network studies are made under the assumption that neither the stuff nor the relationships are the whole story on their own. If you’re studying something with networks, odds are you’re doing so because you think the objects of your study are interdependent rather than independent. Representing information as a network implicitly suggests not only that connections matter, but that they are required to understand whatever’s going on.
Oh, I should mention that people often use the word “graph” when talking about networks. It’s basically the mathy term for a network, and its definition is a bit more formalized and concrete. Think dots connected with lines.
Because networks are studied by lots of different groups, there are lots of different words for pretty much the same concepts. I’ll explain some of them below.
Stuff (presumably) exists. Eggplants, true love, the Mary Celeste, tall people, and Terry Pratchett’s Thief of Time all fall in that category. Network analysis generally deals with one or a small handful of types of stuff, and then a multitude of examples of that type.
Say the type we’re dealing with is a book. While scholars might argue the exact lines of demarcation separating book from non-book, I think we can all agree that most of the stuff in my bookshelf are, in fact, books. They’re the stuff. There are different examples of books; a quotation dictionary, a Poe collection, and so forth.
I’ll call this assortment of stuff nodes. You’ll also hear them called vertices (mostly from the mathematicians and computer scientists), actors (from the sociologists), agents (from the modelers), or points (not really sure where this one comes from).
The type of stuff corresponds to the type of node. The individual examples are the nodes themselves. All of the nodes are books, and each book is a different node.
Nodes can have attributes. Each node, for example, may include the title, the number of pages, and the year of publication.
A list of nodes could look like this:
| Title | # of pages | year of publication |
| ----------------------------------------------------------- |
| Graphs, Maps, and Trees | 119 | 2005 |
| How The Other Half Lives | 233 | 1890 |
| Modern Epic | 272 | 1995 |
| Mythology | 352 | 1942 |
| Macroanalysis | unknown | 2011 |
We can get a bit more complicated and add more node types to the network. Authors, for example. Now we’ve got a network with books and authors (but nothing linking them, yet!). Franco Moretti and Graphs, Maps, and Trees are both nodes, although they are of different varieties, and not yet connected. We would have a second list of nodes, part of the same network, that might look like this:
| Author | Birth | Death |
| --------------------------------- |
| Franco Moretti | ? | n/a |
| Jacob A. Riis | 1849 | 1914 |
| Edith Hamilton | 1867 | 1963 |
| Matthew Jockers | ? | n/a |
A network with two types of nodes is called 2-mode, bimodal, or bipartite. We can add more, making it multimodal. Publishers, topics, you-name-it. We can even add seemingly unrelated node-types, like academic conferences, or colors of the rainbow. The list goes on. We would have a new list for each new variety of node.
Presumably we could continue adding nodes and node-types until we run out of stuff in the universe. This would be a bad idea, and not just because it would take more time, energy, and hard-drives than could ever possibly exist.
As it stands now, network science is ill-equipped to deal with multimodal networks. 2-mode networks are difficult enough to work with, but once you get to three or more varieties of nodes, most algorithms used in network analysis simply do not work. It’s not that they can’t work; it’s just that most algorithms were only created to deal with networks with one variety of node.
This is a trap I see many newcomers to network science falling into, especially in the Digital Humanities. They find themselves with a network dataset of, for example, authors and publishers. Each author is connected with one or several publishers (we’ll get into the connections themselves in the next section), and the up-and-coming network scientist loads the network into their favorite software and visualizes it. Woah! A network!
Then, because the software is easy to use, and has a lot of buttons with words that from a non-technical standpoint seem to make a lot of sense, they press those buttons to see what comes out. Then, they change the visual characteristics of the network based on the buttons they’ve pressed.
Let’s take a concrete example. Popular network software Gephi comes with a button that measures the centrality of nodes. Centrality is a pretty complicated concept that I’ll get into more detail later, but for now it’s enough to say that it does exactly what it sounds like; it finds how central, or important, each node is in a network. The newcomer to network analysis loads the author-publisher network into Gephi, finds the centrality of every node, and then makes the nodes bigger that have the highest centrality.
The issue here is that, although the network loads into Gephi perfectly fine, and although the centrality algorithm runs smoothly, the resulting numbers do not mean what they usually mean. Centrality, as it exists in Gephi, was fine-tuned to be used with single mode networks, whereas the author-publisher network is bimodal. Centrality measures have been made for bimodal networks, but those algorithms are not included with Gephi.
Most computer scientists working with networks do so with only one or a few types of nodes. Humanities scholars, on the other hand, are often dealing with the interactions of many types of things, and so the algorithms developed for traditional network studies are insufficient for the networks we often have. There are ways of fitting their algorithms to our networks, or vice-versa, but that requires fairly robust technical knowledge of the task at hand.
Besides dealing with the single mode / multimodal issue, humanists also must struggle with fitting square pegs in round holes. Humanistic data are almost by definition uncertain, open to interpretation, flexible, and not easily definable. Node types are concrete; your object either is or is not a book. Every book-type thing shares certain unchanging characteristics.
This reduction of data comes at a price, one that some argue traditionally divided the humanities and social sciences. If humanists care more about the differences than the regularities, more about what makes an object unique rather than what makes it similar, that is the very information they are likely to lose by defining their objects as nodes.
This is not to say it cannot be done, or even that it has not! People are clever, and network science is more flexible than some give it credit for. The important thing is either to be aware of what you are losing when you reduce your objects to one or a few types of nodes, or to change the methods of network science to fit your more complex data.
Relationships (presumably) exist. Friendships, similarities, web links, authorships, and wires all fall into this category. Network analysis generally deals with one or a small handful of types of relationships, and then a multitude of examples of that type.
Say the type we’re dealing with is an authorship. Books (the stuff) and authors (another kind of stuff) are connected to one-another via the authorship relationship, which is formalized in the phrase “X is an author of Y.” The individual relationships themselves are of the form “Franco Moretti is an author of Graphs, Maps, and Trees.”
Much like the stuff (nodes), relationships enjoy a multitude of names. I’ll call them edges. You’ll also hear them called arcs, links, ties, and relations. For simplicity sake, although edges are often used to describe only one variety of relationship, I’ll use it for pretty much everything and just add qualifiers when discussing specific types. The type of relationship corresponds to the type of edge. The individual examples are the edges themselves.
Individual edges are defined, in part, by the nodes that they connect.
A list of edges could look like this:
| Person | Is an author of |
| ----------------------------------------------------- |
| Franco Moretti | Modern Epic |
| Franco Moretti | Graphs, Maps, and Trees |
| Jacob A. Riis | How The Other Half Lives |
| Edith Hamilton | Mythology |
| Matthew Jockers | Macroanalysis |
Notice how, in this scheme, edges can only link two different types of nodes. That is, a person can be an author of a book, but a book cannot be an author of a book, nor can a person an author of a person. For a network to be truly bimodal, it must be of this form. Edges can go between types, but not among them.
This constraint may seem artificial, and in some sense it is, but for reasons I’ll get into in a later post, it is a constraint required by most algorithms that deal with bimodal networks. As mentioned above, algorithms are developed for specific purposes. Single mode networks are the ones with the most research done on them, but bimodal networks certainly come in a close second. They are networks with two types of nodes, and edges only going between those types.
Of course, the world humanists care to model is often a good deal more complicated than that, and not only does it have multiple varieties of nodes – it also has multiple varieties of edges. Perhaps, in addition to “X is an author of Y” type relationships, we also want to include “A collaborates with B” type relationships. Because edges, like nodes, can have attributes, an edge list combining both might look like this.
| Node1 | Node 2 | Edge Type |
| ----------------------------------------------------- | ----------------- |
| Franco Moretti | Modern Epic | is an author of |
| Franco Moretti | Graphs, Maps, and Trees | is an author of |
| Jacob A. Riis | How The Other Half Lives | is an author of |
| Edith Hamilton | Mythology | is an author of |
| Matthew Jockers | Macroanalysis | is an author of |
| Matthew Jockers | Franco Moretti | collaborates with |
Notice that there are now two types of edges: “is an author of” and “collaborates with.” Not only are they two different types of edges; they act in two fundamentally different ways. “X is an author of Y” is an asymmetric relationship; that is, you cannot switch out Node1 for Node2. You cannot say “Modern Epic is an author of Franco Moretti.” We call this type of relationship a directed edge, and we generally represent that visually using an arrow going from one node to another.
“A collaborates with B,” on the other hand, is a symmetric relationship. We can switch out “Matthew Jockers collaborates with Franco Moretti” with “Franco Moretti collaborates with Matthew Jockers,” and the information represented would be exactly the same. This is called an undirected edge, and is usually represented visually by a simple line connecting two nodes.
Most network algorithms and visualizations break down when combining these two flavors of edges. Some algorithms were designed for directed edges, like Google’s PageRank, whereas other algorithms are designed for undirected edges, like many centrality measures. Combining both types is rarely a good idea. Some algorithms will still run when the two are combined, however the results usually make little sense.
Both directed and undirected edges can also be weighted. For example, I can try to make a network of books, with those books that are similar to one another sharing an edge between them. The more similar they are, the heavier the weight of that edge. I can say that every book is similar to every other on a scale from 1 to 100, and compare them by whether they use the same words. Two dictionaries would probably connect to one another with an edge weight of 95 or so, whereas Graphs, Maps, and Trees would probably share an edge of weight 5 with How The Other Half Lives. This is often visually represented by the thickness of the line connecting two nodes, although sometimes it is represented as color or length.
It’s also worth pointing out the difference between explicit and inferred edges. If we’re talking about computers connected on a network via wires, the edges connecting each computer actually exist. We can weight them by wire length, and that length, too, actually exists. Similarly, citation linkages, neighbor relationships, and phone calls are explicit edges.
We can begin to move into interpretation when we begin creating edges between books based on similarity (even when using something like word comparisons). The edges are a layer of interpretation not intrinsic in the objects themselves. The humanist might argue that all edges are intrinsic all the way down, or inferred all the way up, but in either case there is a difference in kind between two computers connected via wires, and two books connected because we feel they share similar topics.
As such, algorithms made to work on one may not work on the other; or perhaps they may, but their interpretative framework must change drastically. A very central computer might be one in which, if removed, the computers will no longer be able to interact with one another; a very central book may be something else entirely.
As with nodes, edges come with many theoretical shortcomings for the humanist. Really, everything is probably related to everything else in its light cone. If we’ve managed to make everything in the world a node, realistically we’d also have some sort of edge between pretty much everything, with a lesser or greater weight. A network of nodes where almost everything is connected to almost everything else is called dense, and dense networks networks are rarely useful. Most network algorithms (especially ones that detect communities of nodes) work better and faster when the network is sparse, when most nodes are only connected to a small percentage of other nodes.
To make our network sparse, we often must artificially cut off which edges to use, especially with humanistic and inferred data. That’s what Shawn Graham showed us how to do when combining topic models with networks. The network was one of authors and topics; which authors wrote about which topics? The data itself connected every author to every topic to a greater or lesser degree, but such a dense network would not be very useful, so Shawn limited the edges to the highest weighted connections between an author and a topic. The resulting network looked like this, when it otherwise would have looked like a big ball of spaghetti and meatballs.
Unfortunately, given that humanistic data are often uncertain and biased to begin with, every arbitrary act of data-cutting has the potential to add further uncertainty and bias to a point where the network no longer provides meaningful results. The ability to cut away just enough data to make the network manageable, but not enough to lose information, is as much an art as it is a science.
Hypergraphs & Multigraphs
Mathematicians and computer scientists have actually formalized more complex varieties of networks, and they call them hypergraphs and multigraphs. Because humanities data are often so rich and complex, it may be more appropriate to represent it using these representations. Unfortunately, although ample research has been done on both, most out-of-the-box tools support neither. We have to build them for ourselves.
A hypergraph is one in which more than two nodes can be connected by one edge. A simple example would be a “is a sibling of” relationship, where the edge connected three sisters rather than two. This is a symmetric, undirected edge, but perhaps there can be directed edges as well, of the type “Alex convinced Betty to run away from Carl.”
A multigraph is one in which multiple edges can connect any two nodes. We can have, for example, a transportation graph between cities. A edge exists for every transportation route. Realistically, many routes can exist between any two cities; some by plane, several different highways, trains, etc.
I imagine both of these representations will be important for humanists going forward, but rather than relying on that computer scientist who keeps hanging out in the history department, we ourselves will have to develop algorithms that accurately capture exactly what it is we are looking for. We have a different set of problems, and though the solutions may be similar, they must be adapted to our needs.
Side note: RDF Triples
Digital humanities loves RDF. RDF basically works using something called a triple; a subject, a predicate, and an object. “Moretti is an author of Graphs, Maps, and Trees” is an example of a triple, where “Moretti” is the subject, “is an author of” is the predicate, and “Graphs, Maps, and Trees” is the object. As such, nearly all RDF documents can be represented as a directed network. Whether that representation would actually be useful depends on the situation.
Side note: Perspectives
Context is key, especially in the humanities. One thing the last few decades has taught us is that perspectives are essential, and any model of humanity that does not take into account its multifaceted nature is doomed to be forever incomplete. According to Alex, Betty and Carl are best friends. According to Carl, he can’t actually stand Betty. The structure and nature of a network might change depending on the perspective of a particular node, and I know of no model that captures this complexity. If you’re familiar with something that might capture this, or are working on it yourself, please let me know via comments or e-mail.
The above post discussed the simplest units of networks; the stuff and the relationships that connect them. Any network analysis approach must subscribe to and live with that duality of objects. Humanists face problems from the outset; data that does not fit neatly into one category or the other, complex situations that ought not be reduced, and methods that were developed with different purposes in mind. However, network analysis remains a viable methodology for answering and raising humanistic questions – we simply must be cautious, and must be willing to get our hands dirty editing the algorithms to suit our needs.
In the coming posts of this series, I’ll discuss various introductory topics including data representations, basic metrics like degree, centrality, density, clustering, and path length, as well as ways to link old network analysis concepts with common humanist problems. I’ll also try to highlight examples from the humanities, and raise methodological issues that come with our appropriation of somebody else’s algorithms.
This will probably be the longest of the posts, as some concepts are fairly central and must be discussed all-at-once. Again, if anybody has any particular concepts of network analysis they’d like to see discussed, please don’t hesitate to comment with your request.
Last post, I talked about combining textual and network analysis. Both are becoming standard tools in the methodological toolkit of the digital humanist, sitting next to GIS in what seems to be becoming the Big Three in computational humanities.
Data as Context, Data as Contextualized
Humanists are starkly aware that no particular aspect of a subject sits in a vacuum; context is key. A network on its own is a set of meaningless relationships without a knowledge of what travels through and across it, what entities make it up, and how that network interacts with the larger world. The network must be contextualized by the content. Conversely, the networks in which people and processes are situated deeply affect those entities: medium shapes message and topology shapes influence. The content must be contextualized by the network.
At the risk of the iPhonification of methodologies 1, textual, network, and geographic analysis may be combined with each other and traditional humanities research so that they might all inform one another. That last post on textual and network analysis was missing one key component for digital humanities: the humanities. Combining textual and network analysis with traditional humanities research (rather than merely using the humanities to inform text and network analysis, or vice-versa) promises to transform the sorts of questions asked and projects undertaken in Academia at large.
Just as networks can be used to contextualize text (and vice-versa), the same can be said of networks and maps (or texts and maps for that matter, or all three, but I’ll leave those for later posts). Now, instead of starting with the maps we all know and love, we’ll start by jumping into the deep end by discussing maps as any sort of representative landscape in which a network can be situated. In fact, I’m going to start off by using the network as a map against which certain relational properties can be overlaid. That is, I’m starting by using a map to contextualize a network, rather than the more intuitive other way around.
Using Maps to Contextualize a Network
The base map we’re discussing here is a map of science. They’ve made their rounds, so you’ve probably seen one, but just in case you haven’t here’s a brief description: some researchers (in this case Kevin Boyack and Richard Klavans) take tons on information from scholarly databases (in this case the Science Citation Index Expanded and the Social Science Citation Index) and create a network diagram from some set of metrics (in this case, citation similarity). They call this network representation a Map of Science.
We can debate about the merits of these maps ’till we’re blue in the face, but let’s avoid that for now. To my mind, the maps are useful, interesting, and incomplete, and the map-makers are generally well-aware of their deficiencies. The point here is that it is a map: a landscape against which one can situate oneself, and with which one may be able to find paths and understand the lay of the land.
In Boyack, Börner 2, and Klavans (2007), the three authors set out to use the map of science to explore the evolution of chemistry research. The purpose of the paper doesn’t really matter here, though; what matters is the idea of overlaying information atop a base network map.
The images above are the funding profiles of the NIH (National Institutes of Health) and NSF (National Science Foundation). The authors collected publication information attached to all the grants funded by the NSF and NIH and looked at how those publications cited one another. The orange edges show connections between disciplines on the map of science that were more prevalent within the context a particular funding agency than they were compared to the entire map of science. Boyack, Börner 3, and Klavans created a map and used it to contextualize certain funding agencies. They and other parties have since used such maps to contextualize universities, authors, disciplines, and other publication groups.
From Network Maps to Geographic Maps
Of course, the Where’s The Beef™ section of this post still has yet to be discussed, with the beef in this case being geography. How can we use existing topography to contextualize network topology? Network space rarely corresponds to geographic place, however neither of them alone can ever fully represent the landscape within which we are situated. A purely geographic map of ancient Rome would not accurately represent the world in which the ancient Romans lived, as it does not take into account the shortening of distances through well-trod trade routes.
Enter Stanford DH ninja Elijah Meeks. In two recent posts, Elijah discussed the topology/topography divide. In the first, he created a network layout algorithm which took a network with nodes originally placed in their geographic coordinates, and then distorted the network visualization to emphasize network distance. The visualization above shows the network laid out geographically. The one below shows the Imperial Roman trade routes with network distances emphasized. As Elijah says, “everything of the same color in the above map is the same network distance from Rome.”
Of course, the savvy reader has probably observed that this does not take everything into account. These are only land routes; what about the sea?
Elijah’s second post addressed just that, impressively applying GIS techniques to determine the likely route ships took to get from one port to another. This technique drives home the point he was trying to make about transitioning from network topology to network topography. The picture below, incidentally, is Elijah’s re-rendering of the last visualization taking into account both land and see routes. As you can see, the distance from any city to any other has decreased significantly, even taking into account his network-distance algorithm.
The above network visualization combines geography, two types of transportation routes, and network science to provide a more nuanced at-a-glance view of the Imperial Roman landscape. The work he highlighted in his post transitioning from topology to topography in edge shapes is also of utmost importance, however that topic will need to wait for another post.
The Republic of Letters (A Brief Interlude)
Elijah was also involved in another Stanford-based project, one very dear to my heart, Mapping the Republic of Letters. Much of my own research has dealt with the Republic of Letters, especially my time spent under Bob Hatch, and Paula Findlen, Dan Edelstein, and Nicole Coleman at Stanford have been heading up an impressive project on that very subject. I’ll go into more details about the Republic in another post (I know, promises promises), but for now the important thing to look at is their interface for navigating the Republic.
The team has gone well beyond the interface that currently faces the public, however even the original map is an important step. Overlaid against a map of Europe are the correspondences of many early modern scholars. The flow of information is apparent temporally, spatially, and through the network topology of the Republic itself. Now as any good explorer knows, no map is any substitute for a thorough knowledge of the land itself; instead, it is to be used for finding unexplored areas and for synthesizing information at a large scale. For contextualizing.
If you’ll allow me a brief diversion, I’d like to talk about tools for making these sorts of maps, now that we’re on the subject of letters. Elijah’s post on visualizing network distance included a plugin for Gephi to emphasize network distance. Gephi’s a great tool for making really pretty network visualizations, and it also comes with a small but potent handful of network analysis algorithms.
I’m on the development team of another program, the Sci² Tool, which shares a lot of Gephi’s functionality, although it has a much wider scope and includes algorithms for textual, geographic, and statistical analysis, as well as a somewhat broader range of network analysis algorithms.
This is by no means a suggestion to use Sci² over Gephi; they both have their strengths and weaknesses. Gephi is dead simple to use, produces the most beautiful graphs on the market, and is all-around fantastic software. They both excel in different areas, and by using them (and other tools!) together, it is possible to create maps combining geographic and network features without ever having to resort to programming.
The above image was generated by combining the Sci² Tool with Gephi. It is the correspondence network of Hugo Grotius, a dataset I worked on while at Huygens ING in The Hague. They are a great group, and another team doing fantastic Republic of Letters research, and they provided this letters dataset. We just developed this particular functionality in Sci² yesterday, so it will take a bit of time before we work out the bugs and release it publicly, however as soon as it is released I’ll be sure to post a full tutorial on how to make maps like the one above.
This ends the public service announcement.
These maps are not without their critics. Especially prevalent were questions along the lines of “But how is this showing me anything I didn’t already know?” or “All of this is just an artefact of population densities and standard trade routes – what are these maps telling us about the Republic of Letters?” These are legitimate critiques, however as mentioned before, these maps are still useful for at-a-glance synthesis of large scales of information, or learning something new about areas one is not yet an expert in. Another problem has been that the lines on the map don’t represent actual travel routes; those sorts of problems are beginning to be addressed by the type of work Elijah Meeks and other GIS researchers are doing.
To tackle the suggestion that these are merely representing population data, I would like to propose what I believe to be a novel idea. I haven’t published on this yet, and I’m not trying to claim scholarly territory here, but I would ask that if this idea inspires research of your own, please cite this blog post or my publication on the subject, whenever it comes out.
We have a lot of data. Of course it doesn’t feel like we have enough, and it never will, but we have a lot of data. We can use what we have, for example collecting all the correspondences from early modern Europe, and place them on a map like this one. The more data we have, the smaller time slices we can have in our maps. We create a base map that is a combination of geographic properties, statistical location properties, and network properties.
Start with a map of the world. To account for population or related correlations, do something similar to what Elijah did in this post, encoding population information (or average number of publications per city, or whatever else you’d like to account for) into the map. On top of that, place the biggest network of whatever it is that you’re looking at that you can find. Scholarly communication, citations, whatever. It’s your big Map of YourFavoriteThingHere. All of these together are your base map.
Atop that, place whatever or whomever you are studying. The correspondence of Grotius can be put on this map, like the NIH was overlaid atop the Map of Science, and areas would light up and become larger if they are surprising against the base map. Are there more letters between Paris and The Hague in the Grotius dataset then one would expect if the dataset was just randomly plucked from the whole Republic of Letters? If so, make that line brighter and thicker.
By combining geography, point statistics, and networks, we can create base maps against which we can contextualize whatever we happen to be studying. This is just one possible combination; base maps can be created from any of a myriad of sources of data. The important thing is that we, as humanists, ought to be able to contextualize our data in the same way that we always have. Now that we’re working with a lot more of it, we’re going to need help in those contextualizations. Base maps are one solution.
It’s worth pointing out one major problem with base maps: bias. Until recently, those Maps of Science making their way around the blogosphere represented the humanities as a small island off the coast of social sciences, if they showed them at all. This is because the primary publication venues of the arts and humanities were not represented in the datasets used to create these science maps. We must watch out for similar biases when constructing our own base maps, however the problem is significantly more difficult for historical datasets because the underrepresented are too dead to speak up. For a brief discussion of historical biases, you can read my UCLA presentation here.
putting every tool imaginable in one box and using them all at once ↩
Full disclosure: she’s my advisor. She’s also awesome. Hi Katy! ↩
According to Google Scholar, David Blei’s first topic modeling paper has received 3,540 citations since 2003. Everybody’s talking about topic models. Seriously, I’m afraid of visiting my parents this Hanukkah and hearing them ask “Scott… what’s this topic modeling I keep hearing all about?” They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.
Since shortly after Blei’s first publication, researchers have been looking into the interplay between networks and topic models. This post will be about that interplay, looking at how they’ve been combined, what sorts of research those combinations can drive, and a few pitfalls to watch out for. I’ll bracket the big elephant in the room until a later discussion, whether these sorts of models capture the semantic meaning for which they’re often used. This post also attempts to introduce topic modeling to those not yet fully converted aware of its potential.
A brief history of topic modeling
In my recent post on IU’s awesome alchemy project, I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) during the discussion of topic models. They’re intimately related, though LSA has been around for quite a bit longer. Without getting into too much technical detail, we should start with a brief history of LSA/LDA.
The story starts, more or less, with a tf-idf matrix. Basically, tf-idf ranks words based on how important they are to a document within a larger corpus. Let’s say we want a list of the most important words for each article in an encyclopedia.
Our first pass is obvious. For each article, just attach a list of words sorted by how frequently they’re used. The problem with this is immediately obvious to anyone who has looked at word frequencies; the top words in the entry on the History of Computing would be “the,” “and,” “is,” and so forth, rather than “turing,” “computer,” “machines,” etc. The problem is solved by tf-idf, which scores the words based on how special they are to a particular document within the larger corpus. Turing is rarely used elsewhere, but used exceptionally frequently in our computer history article, so it bubbles up to the top.
LSA and pLSA
LSA utilizes these tf-idf scores 1 within a larger term-document matrix. Every word in the corpus is a different row in the matrix, each document has its own column, and the tf-idf score lies at the intersection of every document and word. Our computing history document will probably have a lot of zeroes next to words like “cow,” “shakespeare,” and “saucer,” and high marks next to words like “computation,” “artificial,” and “digital.” This is called a sparse matrix because it’s mostly filled with zeroes; most documents use very few words related to the entire corpus.
With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. 2 It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.
The method was significantly improved by Puzicha and Hofmann (1999), who did away with the linear algebra approach of LSA in favor of a more statistically sound probabilistic model, called probabilistic latent semantic analysis (pLSA). Now is the part of the blog post where I start getting hand-wavy, because explaining the math is more trouble than I care to take on in this introduction.
Essentially, pLSA imagines an additional layer between words and documents: topics. What if every document isn’t just a set of words, but a set of topics? In this model, our encyclopedia article about computing history might be drawn from several topics. It primarily draws from the big platonic computing topic in the sky, but it also draws from the topics of history, cryptography, lambda calculus, and all sorts of other topics to a greater or lesser degree.
Now, these topics don’t actually exist anywhere. Nobody sat down with the encyclopedia, read every entry, and decided to come up with the 200 topics from which every article draws. pLSA infers topics based on what will hereafter be referred to as black magic. Using the dark arts, pLSA “discovers” a bunch of topics, attaches them to a list of words, and classifies the documents based on those topics.
Blei et al. (2003) vastly improved upon this idea by turning it into a generative model of documents, calling the model Latent Dirichlet allocation (LDA). By this time, as well, some sounder assumptions were being made about the distribution of words and document length — but we won’t get into that. What’s important here is the generative model.
Imagine you wanted to write a new encyclopedia entry, let’s say about digital humanities. Well, we now know there are three elements that make up that process, right? Words, topics, and documents. Using these elements, how would you go about writing this new article on digital humanities?
First off, let’s figure out what topics our article will consist of. It probably draws heavily from topics about history, digitization, text analysis, and so forth. It also probably draws more weakly from a slew of other topics, concerning interdisciplinarity, the academy, and all sorts of other subjects. Let’s go a bit further and assign weights to these topics; 22% of the document will be about digitization, 19% about history, 5% about the academy, and so on. Okay, the first step is done!
Now it’s time to pull out the topics and start writing. It’s an easy process; each topic is a bag filled with words. Lots of words. All sorts of words. Let’s look in the “digitization” topic bag. It includes words like “israel” and “cheese” and “favoritism,” but they only appear once or twice, and mostly by accident. More importantly, the bag also contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of “scanner.”
So here you are, you’ve dragged out your digitization bag and your history bag and your academy bag and all sorts of other bags as well. You start writing the digital humanities article by reaching into the digitization bag (remember, you’re going to reach into that bag for 22% of your words), and you pull out “OCR.” You put it on the page. You then reach for the academy bag and reach for a word in there (it happens to be “teaching,”) and you throw that on the page as well. Keep doing that. By the end, you’ve got a document that’s all about the digital humanities. It’s beautiful. Send it in for publication.
Alright, what now?
So why is the generative nature of the model so important? One of the key reasons is the ability to work backwards. If I can generate an (admittedly nonsensical) document using this model, I can also reverse the process an infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from.
Another factor contributing to the success of LDA is the ability to extend the model. In this case, we assume there are only documents, topics, and words, but we could also make a model that assumes authors who like particular topics, or assumes that certain documents are influenced by previous documents, or that topics change over time. The possibilities are endless, as evidenced by the absurd number of topic modeling variations that have appeared in the past decade. David Mimno has compiled a wonderful bibliography of many such models.
While the generative model introduced by Blei might seem simplistic, it has been shown to be extremely powerful. When a newcomer sees the results of LDA for the first time, they are immediately taken by how intuitive they seem. People sometimes ask me “but didn’t it take forever to sit down and make all the topics?” thinking that some of the magic had to be done by hand. It wasn’t. Topic modeling yields intuitive results, generating what really feels like topics as we know them 3, with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.
Topic Modeling and Networks
Topic models can interact with networks in multiple ways. While a lot of the recent interest in digital humanities has surrounded using networks to visualize how documents or topics relate to one another, the interfacing of networks and topic modeling initially worked in the other direction. Instead of inferring networks from topic models, many early (and recent) papers attempt to infer topic models from networks.
Topic Models from Networks
The first research I’m aware of in this niche was from McCallum et al. (2005). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (Steyvers et al., 2004), which assumes topics are formed based on the mixtures of authors writing a paper. McCallum et al. extended that model for directed messages in their Author-Recipient-Topic (ART) Model. In ART, it is assumed that topics of letters, e-mails or direct messages between people can be inferred from knowledge of both the author and the recipient. Thus, ART takes into account the social structure of a communication network in order to generate topics. In a later paper (McCallum et al., 2007), they extend this model to one that infers the roles of authors within the social network.
Dietz et al. (2007) created a model that looks at citation networks, where documents are generated by topical innovation and topical inheritance via citations. Nallapati et al. (2008) similarly creates a model that finds topical similarity in citing and cited documents, with the added ability of being able to predict citations that are not present. Blei himself joined the fray in 2009, creating the Relational Topic Model (RTM) with Jonathan Chang, which itself could summarize a network of documents, predict links between them, and predict words within them. Wang et al. (2011) created a model that allows for “the joint analysis of text and links between [people] in a time-evolving social network.” Their model is able to handle situations where links exist even when there is no similarity between the associated texts.
Networks from Topic Models
Some models have been made that infer networks from non-networked text. Broniatowski and Magee (2010 & 2011) extended the Author-Topic Model, building a model that would infer social networks from meeting transcripts. They later added temporal information, which allowed them to infer status hierarchies and individual influence within those social networks.
Many times, however, rather than creating new models, researchers create networks out of topic models that have already been run over a set of data. There are a lot of benefits to this approach, as exemplified by the Newton’s Chymistry project highlighted earlier. Using networks, we can see how documents relate to one another, how they relate to topics, how topics are related to each other, and how all of those are related to words.
Elijah Meeks created a wonderful example combining topic models with networks in Comprehending the Digital Humanities. Using fifty texts that discuss humanities computing, Elijah created a topic model of those documents and used networks to show how documents, topics, and words interacted with one another within the context of the digital humanities.
Elijah Jeff Drouin has also created networks of topic models in Proust, as reported by Elijah.
Peter Leonard recently directed me to TopicNets, a project that combines topic modeling and network analysis in order to create an intuitive and informative navigation interface for documents and topics. This is a great example of an interface that turns topic modeling into a useful scholarly tool, even for those who know little-to-nothing about networks or topic models.
If you want to do something like this yourself, Shawn Graham recently posted a great tutorial on how to create networks using MALLET and Gephi quickly and easily. Prepare your corpus of text, get topics with MALLET, prune the CSV, make a network, visualize it! Easy as pie.
Networks can be a great way to represent topic models. Beyond simple uses of navigation and relatedness as were just displayed, combining the two will put the whole battalion of network analysis tools at the researcher’s disposal. We can use them to find communities of similar documents, pinpoint those documents that were most influential to the rest, or perform any of a number of other workflows designed for network analysis.
As with anything, however, there are a few setbacks. Topic models are rich with data. Every document is related to every other document, if some only barely. Similarly, every topic is related to every other topic. By deciding to represent document similarity over a network, you must make the decision of precisely how similar you want a set of documents to be if they are to be linked. Having a network with every document connected to every other document is scarcely useful, so generally we’ll make our decision such that each document is linked to only a handful of others. This allows for easier visualization and analysis, but it also destroys much of the rich data that went into the topic model to begin with. This information can be more fully preserved using other techniques, such as multidimensional scaling.
A somewhat more theoretical complication makes these network representations useful as a tool for navigation, discovery, and exploration, but not necessarily as evidentiary support. Creating a network of a topic model of a set of documents piles on abstractions. Each of these systems comes with very different assumptions, and it is unclear what complications arise when combining these methods ad hoc.
Although there may be issues with the process, the combination of topic models and networks is sure to yield much fruitful research in the digital humanities. There are some fantastic tutorials out there for getting started with topic modeling in the humanities, such as Shawn Graham’s post on Getting Started with MALLET and Topic Modeling, as well as on combining them with networks, such as this post from the same blog. Shawn is right to point out MALLET, a great tool for starting out, but you can also find the code used for various models on many of the model-makers’ academic websites. One code package that stands out is Chang’s implementation of LDA and related models in R.
Ted Underwood rightly points out in the comments that other scoring systems are often used in lieu of tf-idf, most frequently log entropy. ↩
Yes yes, this is a simplification of actual LSA, but it’s pretty much how it works. SVD reduces the size of the matrix to filter out noise, and then each word row is treated as a vector shooting off in some direction. The vector of each word is compared to every other word, so that every pair of words has a relatedness score between them. Ted Underwood has a great blog post about why humanists should avoid the SVD step. ↩
They’re not, of course. We’ll worry about that later. ↩
UCLA’s Networks and Network Analysis for the Humanities this past weekend did not fail to impress. Tim Tangherlini and his mathemagical imps returned in true form, organizing a really impressively realized (and predictably jam-packed) conference that left the participants excited, exhausted, enlightened, and unanimously shouting for more next year (and the year after, and the year after that, and the year after that…) I cannot thank the ODH enough for facilitating this and similar events.
Some particular highlights included Graham Sack’s exceptionally robust comparative analysis of a few hundred early English novels (watch out for him, he’s going to be a Heavy Hitter), Sarah Horowitz‘s really convincing use of epistolary network analysis to weave the importance of women (specifically salonières) in holding together the fabric of French high society, Rob Nelson’s further work on the always impressive Mining the Dispatch, Peter Leonard‘s thoughtful and important discussion on combining text and network analysis (hint: visuals are the way to go), Jon Kleinberg‘s super fantastic wonderful keynote lecture, Glen Worthey‘s inspiring talk about not needing All Of It, Russell Horton’s rhymes, Song Chen‘s rigorous analysis of early Asian family ties, and, well, everyone else’s everything else.
Especially interesting were the discussions, raised most particularly by Kleinberg and Hoyt Long, about what particularly we were looking at when we constructed these networks. The union of so many subjective experiences surely is not the objective truth, but neither is it a proxy of objective truth – what, then, is it? I’m inclined to say that this Big Data aggregated from individual experiences provides us a baseline subjective reality that provides us local basins of attraction; that is, trends we see are measures of how likely a certain person will experience the world in a certain way when situated in whatever part of the network/world they reside. More thought and research must go into what the global and local meaning of this Big Data, and will definitely reveal very interesting results.
My talk on bias also seemed to stir some discussion. I gave up counting how many participants looked at me during their presentations and said “and of course the data is biased, but this is preliminary, and this is what I came up with and what justifies that conclusion.” And of course the issues I raised were not new; further, everybody in attendance was already aware of them. What I hoped my presentation to inspire, and it seems to have been successful, was the open discussion of data biases and constraints it puts on conclusions within the context of the presentation of those conclusions.
Some of us were joking that the issues of bias means “you don’t know, you can’t ever know what you don’t know, and you should just give up now.” This is exactly opposite to the point. As long as we’re open an honest about what we do not or cannot know, we can make claims around those gaps, inferring and guessing where we need to, and let the reader decide whether our careful analysis and historical inferences are sufficient to support the conclusions we draw. Honesty is more important than completeness or unshakable proof; indeed, neither of those are yet possible in most of what we study.
There was some twittertalk surrounding my presentation, so here’s my draft/notes for anyone interested (click ‘continue reading’ to view):
Last year, Tim Tangherlini and his magical crew of folkloric imps and applied mathematicians put together a most fantastic and exhausting workshop on networks and network analysis in the humanities. We called it #humnets for short. The workshop (one of the oh-so-fantastic ODH Summer Institutes) spanned two weeks, bringing together forward-thinking humanists and Big Deals in network science and computer science. Now, a year and a half later, we’re all reuniting (bouncing back?) at UCLA to show off all the fantastic network-y humanist-y projects we’ve come up with in the interim.
As of a few weeks ago, I was all set to present my findings from analyzing and modeling the correspondence networks of early-modern scholars. Unfortunately (for me, but perhaps fortunately for everyone else), some new data came in that Changed Everything and invalidated many of my conclusions. I was faced with a dilemma; present my research as it was before I learned about the new data (after all, it was still a good example of using networks in the humanities), or retool everything to fit the new data.
Unfortunately, there was no time to do the latter, and doing the former felt icky and dishonest. In keeping with Tony Beaver’s statement at UCLA last year (“Everything you can do I can do meta,”) I ultimately decided to present a paper on precisely the problem that foiled my presentation: systematic bias. Biases need not be an issue of methodology; you can do everything right methodologically, you can design a perfect experiment, and a systematic bias can still thwart the accuracy of a project. The bias can be due to the available observable data itself (external selection bias), it may be due to how we as researchers decide to collect that data (sample selection bias), or it may be how we decide to use the data we’ve collected (confirmation bias).
There is a small-but-growing precedent of literature on the effects of bias on network analysis. I’ll refer to it briefly in my talk at UCLA, but below is a list of the best references I’ve found on the matter. Most of them deal with sample selection bias, and none of them deal with the humanities.
For those of you who’ve read this far, congratulations! Here’s a preview of my Friday presentation (I’ll post the notes on Friday).
Effects of bias on network analysis condensed bibliography:
Achlioptas, Dimitris, Aaron Clauset, David Kempe, and Cristopher Moore. 2005. On the bias of traceroute sampling. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 694. ACM Press. doi:10.1145/1060590.1060693. http://dl.acm.org/citation.cfm?id=1060693.
———. 2009. “On the bias of traceroute sampling.” Journal of the ACM 56 (June 1): 1-28. doi:10.1145/1538902.1538905.
Costenbader, Elizabeth, and Thomas W Valente. 2003. “The stability of centrality measures when networks are sampled.” Social Networks 25 (4) (October): 283-307. doi:10.1016/S0378-8733(03)00012-1.
Gjoka, M., M. Kurant, C. T Butts, and A. Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In 2010 Proceedings IEEE INFOCOM, 1-9. IEEE, March 14. doi:10.1109/INFCOM.2010.5462078.
Gjoka, Minas, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2011. “Practical Recommendations on Crawling Online Social Networks.” IEEE Journal on Selected Areas in Communications 29 (9) (October): 1872-1892. doi:10.1109/JSAC.2011.111011.
Golub, B., and M. O. Jackson. 2010. “From the Cover: Using selection bias to explain the observed structure of Internet diffusions.” Proceedings of the National Academy of Sciences 107 (June 3): 10833-10836. doi:10.1073/pnas.1000814107.
Henzinger, Monika R., Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. “On near-uniform URL sampling.” Computer Networks 33 (1-6) (June): 295-308. doi:10.1016/S1389-1286(00)00055-4.
Kim, P.-J., and H. Jeong. 2007. “Reliability of rank order in sampled networks.” The European Physical Journal B 55 (February 7): 109-114. doi:10.1140/epjb/e2007-00033-7.
Kurant, Maciej, Athina Markopoulou, and P. Thiran. 2010. On the bias of BFS (Breadth First Search). In Teletraffic Congress (ITC), 2010 22nd International, 1-8. IEEE, September 7. doi:10.1109/ITC.2010.5608727.
Lakhina, Anukool, John W. Byers, Mark Crovella, and Peng Xie. 2003. Sampling biases in IP topology measurements. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, 1:332- 341 vol.1. IEEE, April 30. doi:10.1109/INFCOM.2003.1208685.
Latapy, Matthieu, and Clemence Magnien. 2008. Complex Network Measurements: Estimating the Relevance of Observed Properties. In IEEE INFOCOM 2008. The 27th Conference on Computer Communications, 1660-1668. IEEE, April 13. doi:10.1109/INFOCOM.2008.227.
Maiya, Arun S. 2011. Sampling and Inference in Complex Networks. Chicago: University of Illinois at Chicago, April. http://arun.maiya.net/papers/asmthesis.pdf.
Pedarsani, Pedram, Daniel R. Figueiredo, and Matthias Grossglauser. 2008. Densification arising from sampling fixed graphs. In Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 205. ACM Press. doi:10.1145/1375457.1375481. http://portal.acm.org/citation.cfm?doid=1375457.1375481.
Stumpf, Michael P. H., Carsten Wiuf, and Robert M. May. 2005. “Subnets of scale-free networks are not scale-free: Sampling properties of networks.” Proceedings of the National Academy of Sciences of the United States of America 102 (12) (March 22): 4221 -4224. doi:10.1073/pnas.0501179102.
Stutzbach, Daniel, Reza Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2009. “On Unbiased Sampling for Unstructured Peer-to-Peer Networks.” IEEE/ACM Transactions on Networking 17 (2) (April): 377-390. doi:10.1109/TNET.2008.2001730.
Effects of selection bias on historical/sociological research condensed bibliography:
Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data.” American Sociological Review 48 (3) (June 1): 386-398. doi:10.2307/2095230.
Bryant, Joseph M. 1994. “Evidence and Explanation in History and Sociology: Critical Reflections on Goldthorpe’s Critique of Historical Sociology.” The British Journal of Sociology 45 (1) (March 1): 3-19. doi:10.2307/591521.
———. 2000. “On sources and narratives in historical social science: a realist critique of positivist and postmodernist epistemologies.” The British Journal of Sociology 51 (3) (September 1): 489-523. doi:10.1111/j.1468-4446.2000.00489.x.
Duncan Baretta, Silvio R., John Markoff, and Gilbert Shapiro. 1987. “The selective Transmission of Historical Documents: The Case of the Parish Cahiers of 1789.” Histoire & Mesure 2: 115-172. doi:10.3406/hism.1987.1328.
Goldthorpe, John H. 1991. “The Uses of History in Sociology: Reflections on Some Recent Tendencies.” The British Journal of Sociology 42 (2) (June 1): 211-230. doi:10.2307/590368.
———. 1994. “The Uses of History in Sociology: A Reply.” The British Journal of Sociology 45 (1) (March 1): 55-77. doi:10.2307/591525.
Jensen, Richard. 1984. “Review: Ethnometrics.” Journal of American Ethnic History 3 (2) (April 1): 67-73.
Kosso, Peter. 2009. Philosophy of Historiography. In A Companion to the Philosophy of History and Historiography, 7-25. http://onlinelibrary.wiley.com/doi/10.1002/9781444304916.ch2/summary.
Kreuzer, Marcus. 2010. “Historical Knowledge and Quantitative Analysis: The Case of the Origins of Proportional Representation.” American Political Science Review 104 (02): 369-392. doi:10.1017/S0003055410000122.
Lang, Gladys Engel, and Kurt Lang. 1988. “Recognition and Renown: The Survival of Artistic Reputation.” American Journal of Sociology 94 (1) (July 1): 79-109.
Lustick, Ian S. 1996. “History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias.” The American Political Science Review 90 (3): 605-618. doi:10.2307/2082612.
Mariampolski, Hyman, and Dana C. Hughes. 1978. “The Use of Personal Documents in Historical Sociology.” The American Sociologist 13 (2) (May 1): 104-113.
Murphey, Murray G. 1973. Our Knowledge of the Historical Past. Macmillan Pub Co, January.
Murphey, Murray G. 1994. Philosophical foundations of historical knowledge. State Univ of New York Pr, July.
Rubin, Ernest. 1943. “The Place of Statistical Methods in Modern Historiography.” American Journal of Economics and Sociology 2 (2) (January 1): 193-210.
Schatzki, Theodore. 2006. “On Studying the Past Scientifically∗.” Inquiry 49 (4) (August): 380-399. doi:10.1080/00201740600831505.
Wellman, Barry, and Charles Wetherell. 1996. “Social network analysis of historical communities: Some questions from the present for the past.” The History of the Family 1 (1): 97-121. doi:10.1016/S1081-602X(96)90022-6.