Herein lies Part the Second of n posts introducing various network concepts. Part the First can be found here. From here on out, the posts will be cover only one topic at a time. I’ll occasionally use math, but will do my best to explain it from the ground up, assuming no previous knowledge of mathematical notation.
Node Degree: An Introduction
Today I’ll cover the deceptively simple concept of node degree. I say “deceptive” because, on the one hand, network degree can tell you quite a lot. On the other hand, degree can often lead one astray, especially as networks become larger and more complicated.
A node’s degreeis, simply, how many edges it is connected to. Generally, this also correlates to how many neighbors a node has, where a node’s neighborhood is those other nodes connected directly to it by an edge. In the network below, each node is labeled by its degree.
If you take a minute to study the network, something might strike you as odd. The bottom-right node, with degree 5, is connected to only four distinct edges, and really only three other nodes (four, including itself). Self-loops, which will be discussed later because they’re annoying and we hates them preciousss, are counted twice. A self-loop is any edge which starts and ends at the same node.
Why are self-loops counted twice? Well, as a rule of thumb you can say that, since the degree is the number of times the node is connected to an edge, and a self-loop connected to a node twice, that’s the reason. There are some more mathy reasons dealing with matrix representation, another topic for a later date. Suffice to say that many network algorithms will not work well if self-loops are only counted once.
The odd node out on the bottom left, with degree zero, is called an isolate. An isolate is any node with no edges.
At any rate, the concept is clearly simple enough. Count the number of times a node is connected to an edge, get the degree. If only getting higher education degrees were this easy.
Node degree is occasionally called degree centrality. Centrality is generally used to determine how important nodes are in a network, and lots of clever researchers have come up with lots of clever ways to measure it. “Importance” can mean a lot of things. In social networks, centrality can be the amount of influence or power someone has; in the U.S. electrical grid network, centrality might mean which power station should be removed to cause the most damage to the network.
The simplest way of measuring node importance is to just look at its degree. This centrality measurement at once seems deeply intuitive and extremely silly. If we’re looking at the social network of facebook, with every person a node connected by an edge to their friends, it’s no surprise that the most well-connected person is probably also the most powerful and influential in the social space. On the same token, though, degree centrality is such a coarse-grained measurement that it’s really anybody’s guess what exactly it’s measuring. It could mean someone has a lot of power; it could also mean that someone tried to become friends with absolutely everybody on facebook.
Degree Centrality Sampling Warnings
Degree works best as a measure of network centrality when you have full knowledge of the network. That is, a social network exists, and instead of getting some glimpse of it and analyzing just that, you have the entire context of the social network: all the friends, all the friends of friends, and so forth.
When you have an ego-network (a network of one person, like a list of all my friends and who among them are friends with one another), clearly the node with the highest centrality is the ego node itself. This knowledge tells you very little about whether that ego is actually central within the larger network, because you sampled the network such that the ego is necessarily the most central. Sampling strategies – how you pick which nodes and edges to collect – can fundamentally affect centrality scores.
A historian of science might generate a correspondence network from early modern letters currently held in Oxford’s library. In fact, this is currently happening, and the resulting resource will be invaluable. Unfortunately, centrality scores generated from nodes in that early modern letter writing network will more accurately reflect the whims of Oxford editors and collectors over the years, rather than the underlying correspondence network itself. Oxford scholars over the years selected certain collections of letters, be they from Great People or sent to or from Oxford, and that choice of what to hold at Oxford libraries will bias centrality scores toward Oxford-based scholars, Great People, and whatever else was selected for.
Similarly, the generation of a social network from a literary work will bias the recurring characters; characters that occur more frequently are simply statistically more likely to appear with more people, and as such will have the highest degrees. It is likely that the degree centrality and frequency of character occurrence are almost exactly correlated.
Of course, if what you’re looking for is the most central character in the novel or the most central figure from Oxford’s perspective, this measurement might be perfectly sufficient. The important thing is to be aware of the limitations of degree centrality, and the possible biasing effects from selection and sampling. Once those biases are explicit, careful and useful inferences can still be drawn.
Things get a bit more complicated when looking at document similarity networks. If you’ve got a network of books with edges connecting them based on whether they share similar topics or keywords, your degree centrality score will mean something very different. In this case, centrality could mean the most general book. Keep in mind that book length might affect these measurements as well; the longer a book is, the more likely (by chance alone) it will cover more topics. Thus, longer books may also appear to be more central, if one is not careful in generating the network.
Degree Centrality in Bi-Modal Networks
Recall that bi-modal networks are ones where there are two different types of nodes (e.g., articles and authors), and edges are relationships that bridge those types (e.g., authorships). In this example, the more articles an author has published, the more central she is. Degree centrality would have nothing to do, in this case, with the number of co-authorships, the position in the social network, etc.
With an even more multi-modal network, having many types of nodes, degree centrality becomes even less well defined. As the sorts of things a node can connect to increases, the utility of simply counting the number of connections a node has decreases.
Micro vs. Macro
Looking at the degree of an individual node, and comparing it against others in the network, is useful for finding out about the relative position of that node within the network. Looking at the degree of every node at once turns out to be exceptionally useful for talking about the network as a whole, and comparing it to others. I’ll leave a thorough discussion of degree distributions for a later post, but it’s worth mentioning them in brief here. The degree distribution shows how many nodes have how many edges.
As it happens, many real world networks exhibit something called “power-law properties” in their degree distributions. What that essentially means is that a small number of nodes have an exceptionally high degree, whereas most nodes have very low degrees. By comparing the degree distributions of two networks, it is possible to say whether they are structurally similar. There’s been some fantastic work comparing the degree distribution of social networks in various plays and novels to find if they are written or structured similarly.
For the entirety of this post, I’ve been talking about networks that were unweighted and undirected. Every edge counted just as much as every other, and they were all symmetric (a connection from A to B implies the same connection from B to A). Degree can be extended to both weighted and directed (asymmetric) networks with relative ease.
Combining degree with edge weights is often called strength. The strength of a node is the sum of the weights of its edges. For example, let’s say Steve is part of a weighted social network. The first time he interacts with someone, an edge is created to connect the two with a weight of 1. Every subsequent interaction incrementally increases the weight by 1, so if he’s interacted with Sally four times, Samantha two times, and Salvador six times, the edge weights between them are 4, 2, and 6 respectively.
In the above example, because Steve is connected to three people, his degree is 3. Because he is connected to one of them four times, another twice, and another six times, his weight is 4+2+6=8.
Combining degree with directed edges is also quite simple. Instead of one degree score, every node now has two different degrees: in-degree and out-degree. The in-degree is the number of edges pointing to a node, and the out-degree is the number of edges pointing away from it. If Steve borrowed money from Sally, and lent money to Samantha and Salvador, his in-degree might be 1 and his out-degree 2.
The degree of a node is really very simple: more connections, higher degree. However, this simple metric accounts for quite a great deal in network science. Many algorithms that analyze both node-level properties and network-level properties are closely correlated with degree and degree distribution. This is a pareto-like effect; a great deal about a network is driven by the degree of its nodes.
While degree-based results are often intuitive, it is worth pointing out that the prime importance of degree is a direct result of the binary network representation of nodes and edges. Interactions either happen or they don’t, and everything that is is a self-contained node or edge. Thus, how many nodes, how many edges, and which nodes have which edges will be the driving force of any network analysis. This is both a limitation and a strength; basic counts influence so much, yet they are apparently powerful enough to yield intuitive, interesting, and ultimately useful results.
A bunchofmyrecentposts have mentioned networks. Elijah Meeks not-so-subtly hinted that it might be a good idea to discuss some of the basics of networks on this blog, and I’m happy to oblige. He already introduced network visualizations on his own blog, and did a fantastic job of it, so I’m going to try to stick to more of the conceptual issues here, gearing the discussion generally toward people with little-to-no background in networks or math, and specifically to digital humanists interested in applying network analysis to their own work. This will be part of an ongoing series, so if you have any requests, please feel free to mention them in the comments below (I’ve already been asked to discuss how social networks apply to fictional worlds, which is probably next on the list).
A network is a fantastic tool in the digital humanist’s toolbox – one of many – and it’s no exaggeration to say pretty much any data can be studied via network analysis. With enough stretching and molding, you too could have a network analysis problem! As with many other science-derived methodologies, it’s fairly easy to extend the metaphor of network analysis into any number of domains.
The danger here is two-fold.
When you’re given your first hammer, everything looks like a nail. Networks can be used on any project. Networks should be used on far fewer. Networks in the humanities are experiencing quite the awakening, and this is due in part to until-recently untapped resources. There is a lot of low-hanging fruit out there on the networks+humanities tree, and they ought to be plucked by those brave and willing enough to do so. However, that does not give us an excuse to apply networks to everything. This series will talk a little bit about when hammers are useful, and when you really should be reaching for a screwdriver.
Methodology appropriation is dangerous. Even when the people designing a methodology for some specific purpose get it right – and they rarely do – there is often a score of theoretical and philosophical caveats that get lost when the methodology gets translated. In the more frequent case, when those caveats are not known to begin with, “borrowing” the methodology becomes even more dangerous. Ted Underwood blogs a great example of why literary historians ought to skip a major step in Latent Semantic Analysis, because the purpose of the literary historian is so very different from that of computer scientists who designed the algorithm. This series will attempt to point out some of the theoretical baggage and necessary assumptions of the various network methods it covers.
Nothing worth discovering has ever been found in safe waters. Or rather, everything worth discovering in safe waters has already been discovered, so it’s time to shove off into the dangerous waters of methodology appropriation, cognizant of the warnings but not crippled by them.
Anyone with a lot of time and a vicious interest in networks should stop reading right now, and instead pick up copies of Networks, Crowds, and Markets (Easley & Kleinberg, 2010) and Networks: An Introduction (Newman, 2010). The first is a non-mathy introduction to most of the concepts of network analysis, and the second is a more in depth (and formula-laden) exploration of those concepts. They’re phenomenal, essential, and worth every penny.
Those of you with slightly less time, but somehow enough to read my rambling blog (there are apparently a few of you out there), so good of you to join me. We’ll start with the really basic basics, but stay with me, because by part n of this series, we’ll be going over the really cool stuff only ninjas, Gandhi, and The Rolling Stones have worked on.
Generally, network studies are made under the assumption that neither the stuff nor the relationships are the whole story on their own. If you’re studying something with networks, odds are you’re doing so because you think the objects of your study are interdependent rather than independent. Representing information as a network implicitly suggests not only that connections matter, but that they are required to understand whatever’s going on.
Oh, I should mention that people often use the word “graph” when talking about networks. It’s basically the mathy term for a network, and its definition is a bit more formalized and concrete. Think dots connected with lines.
Because networks are studied by lots of different groups, there are lots of different words for pretty much the same concepts. I’ll explain some of them below.
Stuff (presumably) exists. Eggplants, true love, the Mary Celeste, tall people, and Terry Pratchett’s Thief of Time all fall in that category. Network analysis generally deals with one or a small handful of types of stuff, and then a multitude of examples of that type.
Say the type we’re dealing with is a book. While scholars might argue the exact lines of demarcation separating book from non-book, I think we can all agree that most of the stuff in my bookshelf are, in fact, books. They’re the stuff. There are different examples of books; a quotation dictionary, a Poe collection, and so forth.
I’ll call this assortment of stuff nodes. You’ll also hear them called vertices (mostly from the mathematicians and computer scientists), actors (from the sociologists), agents (from the modelers), or points (not really sure where this one comes from).
The type of stuff corresponds to the type of node. The individual examples are the nodes themselves. All of the nodes are books, and each book is a different node.
Nodes can have attributes. Each node, for example, may include the title, the number of pages, and the year of publication.
A list of nodes could look like this:
| Title | # of pages | year of publication |
| ----------------------------------------------------------- |
| Graphs, Maps, and Trees | 119 | 2005 |
| How The Other Half Lives | 233 | 1890 |
| Modern Epic | 272 | 1995 |
| Mythology | 352 | 1942 |
| Macroanalysis | unknown | 2011 |
We can get a bit more complicated and add more node types to the network. Authors, for example. Now we’ve got a network with books and authors (but nothing linking them, yet!). Franco Moretti and Graphs, Maps, and Trees are both nodes, although they are of different varieties, and not yet connected. We would have a second list of nodes, part of the same network, that might look like this:
| Author | Birth | Death |
| --------------------------------- |
| Franco Moretti | ? | n/a |
| Jacob A. Riis | 1849 | 1914 |
| Edith Hamilton | 1867 | 1963 |
| Matthew Jockers | ? | n/a |
A network with two types of nodes is called 2-mode, bimodal, or bipartite. We can add more, making it multimodal. Publishers, topics, you-name-it. We can even add seemingly unrelated node-types, like academic conferences, or colors of the rainbow. The list goes on. We would have a new list for each new variety of node.
Presumably we could continue adding nodes and node-types until we run out of stuff in the universe. This would be a bad idea, and not just because it would take more time, energy, and hard-drives than could ever possibly exist.
As it stands now, network science is ill-equipped to deal with multimodal networks. 2-mode networks are difficult enough to work with, but once you get to three or more varieties of nodes, most algorithms used in network analysis simply do not work. It’s not that they can’t work; it’s just that most algorithms were only created to deal with networks with one variety of node.
This is a trap I see many newcomers to network science falling into, especially in the Digital Humanities. They find themselves with a network dataset of, for example, authors and publishers. Each author is connected with one or several publishers (we’ll get into the connections themselves in the next section), and the up-and-coming network scientist loads the network into their favorite software and visualizes it. Woah! A network!
Then, because the software is easy to use, and has a lot of buttons with words that from a non-technical standpoint seem to make a lot of sense, they press those buttons to see what comes out. Then, they change the visual characteristics of the network based on the buttons they’ve pressed.
Let’s take a concrete example. Popular network software Gephi comes with a button that measures the centrality of nodes. Centrality is a pretty complicated concept that I’ll get into more detail later, but for now it’s enough to say that it does exactly what it sounds like; it finds how central, or important, each node is in a network. The newcomer to network analysis loads the author-publisher network into Gephi, finds the centrality of every node, and then makes the nodes bigger that have the highest centrality.
The issue here is that, although the network loads into Gephi perfectly fine, and although the centrality algorithm runs smoothly, the resulting numbers do not mean what they usually mean. Centrality, as it exists in Gephi, was fine-tuned to be used with single mode networks, whereas the author-publisher network is bimodal. Centrality measures have been made for bimodal networks, but those algorithms are not included with Gephi.
Most computer scientists working with networks do so with only one or a few types of nodes. Humanities scholars, on the other hand, are often dealing with the interactions of many types of things, and so the algorithms developed for traditional network studies are insufficient for the networks we often have. There are ways of fitting their algorithms to our networks, or vice-versa, but that requires fairly robust technical knowledge of the task at hand.
Besides dealing with the single mode / multimodal issue, humanists also must struggle with fitting square pegs in round holes. Humanistic data are almost by definition uncertain, open to interpretation, flexible, and not easily definable. Node types are concrete; your object either is or is not a book. Every book-type thing shares certain unchanging characteristics.
This reduction of data comes at a price, one that some argue traditionally divided the humanities and social sciences. If humanists care more about the differences than the regularities, more about what makes an object unique rather than what makes it similar, that is the very information they are likely to lose by defining their objects as nodes.
This is not to say it cannot be done, or even that it has not! People are clever, and network science is more flexible than some give it credit for. The important thing is either to be aware of what you are losing when you reduce your objects to one or a few types of nodes, or to change the methods of network science to fit your more complex data.
Relationships (presumably) exist. Friendships, similarities, web links, authorships, and wires all fall into this category. Network analysis generally deals with one or a small handful of types of relationships, and then a multitude of examples of that type.
Say the type we’re dealing with is an authorship. Books (the stuff) and authors (another kind of stuff) are connected to one-another via the authorship relationship, which is formalized in the phrase “X is an author of Y.” The individual relationships themselves are of the form “Franco Moretti is an author of Graphs, Maps, and Trees.”
Much like the stuff (nodes), relationships enjoy a multitude of names. I’ll call them edges. You’ll also hear them called arcs, links, ties, and relations. For simplicity sake, although edges are often used to describe only one variety of relationship, I’ll use it for pretty much everything and just add qualifiers when discussing specific types. The type of relationship corresponds to the type of edge. The individual examples are the edges themselves.
Individual edges are defined, in part, by the nodes that they connect.
A list of edges could look like this:
| Person | Is an author of |
| ----------------------------------------------------- |
| Franco Moretti | Modern Epic |
| Franco Moretti | Graphs, Maps, and Trees |
| Jacob A. Riis | How The Other Half Lives |
| Edith Hamilton | Mythology |
| Matthew Jockers | Macroanalysis |
Notice how, in this scheme, edges can only link two different types of nodes. That is, a person can be an author of a book, but a book cannot be an author of a book, nor can a person an author of a person. For a network to be truly bimodal, it must be of this form. Edges can go between types, but not among them.
This constraint may seem artificial, and in some sense it is, but for reasons I’ll get into in a later post, it is a constraint required by most algorithms that deal with bimodal networks. As mentioned above, algorithms are developed for specific purposes. Single mode networks are the ones with the most research done on them, but bimodal networks certainly come in a close second. They are networks with two types of nodes, and edges only going between those types.
Of course, the world humanists care to model is often a good deal more complicated than that, and not only does it have multiple varieties of nodes – it also has multiple varieties of edges. Perhaps, in addition to “X is an author of Y” type relationships, we also want to include “A collaborates with B” type relationships. Because edges, like nodes, can have attributes, an edge list combining both might look like this.
| Node1 | Node 2 | Edge Type |
| ----------------------------------------------------- | ----------------- |
| Franco Moretti | Modern Epic | is an author of |
| Franco Moretti | Graphs, Maps, and Trees | is an author of |
| Jacob A. Riis | How The Other Half Lives | is an author of |
| Edith Hamilton | Mythology | is an author of |
| Matthew Jockers | Macroanalysis | is an author of |
| Matthew Jockers | Franco Moretti | collaborates with |
Notice that there are now two types of edges: “is an author of” and “collaborates with.” Not only are they two different types of edges; they act in two fundamentally different ways. “X is an author of Y” is an asymmetric relationship; that is, you cannot switch out Node1 for Node2. You cannot say “Modern Epic is an author of Franco Moretti.” We call this type of relationship a directed edge, and we generally represent that visually using an arrow going from one node to another.
“A collaborates with B,” on the other hand, is a symmetric relationship. We can switch out “Matthew Jockers collaborates with Franco Moretti” with “Franco Moretti collaborates with Matthew Jockers,” and the information represented would be exactly the same. This is called an undirected edge, and is usually represented visually by a simple line connecting two nodes.
Most network algorithms and visualizations break down when combining these two flavors of edges. Some algorithms were designed for directed edges, like Google’s PageRank, whereas other algorithms are designed for undirected edges, like many centrality measures. Combining both types is rarely a good idea. Some algorithms will still run when the two are combined, however the results usually make little sense.
Both directed and undirected edges can also be weighted. For example, I can try to make a network of books, with those books that are similar to one another sharing an edge between them. The more similar they are, the heavier the weight of that edge. I can say that every book is similar to every other on a scale from 1 to 100, and compare them by whether they use the same words. Two dictionaries would probably connect to one another with an edge weight of 95 or so, whereas Graphs, Maps, and Trees would probably share an edge of weight 5 with How The Other Half Lives. This is often visually represented by the thickness of the line connecting two nodes, although sometimes it is represented as color or length.
It’s also worth pointing out the difference between explicit and inferred edges. If we’re talking about computers connected on a network via wires, the edges connecting each computer actually exist. We can weight them by wire length, and that length, too, actually exists. Similarly, citation linkages, neighbor relationships, and phone calls are explicit edges.
We can begin to move into interpretation when we begin creating edges between books based on similarity (even when using something like word comparisons). The edges are a layer of interpretation not intrinsic in the objects themselves. The humanist might argue that all edges are intrinsic all the way down, or inferred all the way up, but in either case there is a difference in kind between two computers connected via wires, and two books connected because we feel they share similar topics.
As such, algorithms made to work on one may not work on the other; or perhaps they may, but their interpretative framework must change drastically. A very central computer might be one in which, if removed, the computers will no longer be able to interact with one another; a very central book may be something else entirely.
As with nodes, edges come with many theoretical shortcomings for the humanist. Really, everything is probably related to everything else in its light cone. If we’ve managed to make everything in the world a node, realistically we’d also have some sort of edge between pretty much everything, with a lesser or greater weight. A network of nodes where almost everything is connected to almost everything else is called dense, and dense networks networks are rarely useful. Most network algorithms (especially ones that detect communities of nodes) work better and faster when the network is sparse, when most nodes are only connected to a small percentage of other nodes.
To make our network sparse, we often must artificially cut off which edges to use, especially with humanistic and inferred data. That’s what Shawn Graham showed us how to do when combining topic models with networks. The network was one of authors and topics; which authors wrote about which topics? The data itself connected every author to every topic to a greater or lesser degree, but such a dense network would not be very useful, so Shawn limited the edges to the highest weighted connections between an author and a topic. The resulting network looked like this, when it otherwise would have looked like a big ball of spaghetti and meatballs.
Unfortunately, given that humanistic data are often uncertain and biased to begin with, every arbitrary act of data-cutting has the potential to add further uncertainty and bias to a point where the network no longer provides meaningful results. The ability to cut away just enough data to make the network manageable, but not enough to lose information, is as much an art as it is a science.
Hypergraphs & Multigraphs
Mathematicians and computer scientists have actually formalized more complex varieties of networks, and they call them hypergraphs and multigraphs. Because humanities data are often so rich and complex, it may be more appropriate to represent it using these representations. Unfortunately, although ample research has been done on both, most out-of-the-box tools support neither. We have to build them for ourselves.
A hypergraph is one in which more than two nodes can be connected by one edge. A simple example would be a “is a sibling of” relationship, where the edge connected three sisters rather than two. This is a symmetric, undirected edge, but perhaps there can be directed edges as well, of the type “Alex convinced Betty to run away from Carl.”
A multigraph is one in which multiple edges can connect any two nodes. We can have, for example, a transportation graph between cities. A edge exists for every transportation route. Realistically, many routes can exist between any two cities; some by plane, several different highways, trains, etc.
I imagine both of these representations will be important for humanists going forward, but rather than relying on that computer scientist who keeps hanging out in the history department, we ourselves will have to develop algorithms that accurately capture exactly what it is we are looking for. We have a different set of problems, and though the solutions may be similar, they must be adapted to our needs.
Side note: RDF Triples
Digital humanities loves RDF. RDF basically works using something called a triple; a subject, a predicate, and an object. “Moretti is an author of Graphs, Maps, and Trees” is an example of a triple, where “Moretti” is the subject, “is an author of” is the predicate, and “Graphs, Maps, and Trees” is the object. As such, nearly all RDF documents can be represented as a directed network. Whether that representation would actually be useful depends on the situation.
Side note: Perspectives
Context is key, especially in the humanities. One thing the last few decades has taught us is that perspectives are essential, and any model of humanity that does not take into account its multifaceted nature is doomed to be forever incomplete. According to Alex, Betty and Carl are best friends. According to Carl, he can’t actually stand Betty. The structure and nature of a network might change depending on the perspective of a particular node, and I know of no model that captures this complexity. If you’re familiar with something that might capture this, or are working on it yourself, please let me know via comments or e-mail.
The above post discussed the simplest units of networks; the stuff and the relationships that connect them. Any network analysis approach must subscribe to and live with that duality of objects. Humanists face problems from the outset; data that does not fit neatly into one category or the other, complex situations that ought not be reduced, and methods that were developed with different purposes in mind. However, network analysis remains a viable methodology for answering and raising humanistic questions – we simply must be cautious, and must be willing to get our hands dirty editing the algorithms to suit our needs.
In the coming posts of this series, I’ll discuss various introductory topics including data representations, basic metrics like degree, centrality, density, clustering, and path length, as well as ways to link old network analysis concepts with common humanist problems. I’ll also try to highlight examples from the humanities, and raise methodological issues that come with our appropriation of somebody else’s algorithms.
This will probably be the longest of the posts, as some concepts are fairly central and must be discussed all-at-once. Again, if anybody has any particular concepts of network analysis they’d like to see discussed, please don’t hesitate to comment with your request.
Last post, I talked about combining textual and network analysis. Both are becoming standard tools in the methodological toolkit of the digital humanist, sitting next to GIS in what seems to be becoming the Big Three in computational humanities.
Data as Context, Data as Contextualized
Humanists are starkly aware that no particular aspect of a subject sits in a vacuum; context is key. A network on its own is a set of meaningless relationships without a knowledge of what travels through and across it, what entities make it up, and how that network interacts with the larger world. The network must be contextualized by the content. Conversely, the networks in which people and processes are situated deeply affect those entities: medium shapes message and topology shapes influence. The content must be contextualized by the network.
At the risk of the iPhonification of methodologies 1, textual, network, and geographic analysis may be combined with each other and traditional humanities research so that they might all inform one another. That last post on textual and network analysis was missing one key component for digital humanities: the humanities. Combining textual and network analysis with traditional humanities research (rather than merely using the humanities to inform text and network analysis, or vice-versa) promises to transform the sorts of questions asked and projects undertaken in Academia at large.
Just as networks can be used to contextualize text (and vice-versa), the same can be said of networks and maps (or texts and maps for that matter, or all three, but I’ll leave those for later posts). Now, instead of starting with the maps we all know and love, we’ll start by jumping into the deep end by discussing maps as any sort of representative landscape in which a network can be situated. In fact, I’m going to start off by using the network as a map against which certain relational properties can be overlaid. That is, I’m starting by using a map to contextualize a network, rather than the more intuitive other way around.
Using Maps to Contextualize a Network
The base map we’re discussing here is a map of science. They’ve made their rounds, so you’ve probably seen one, but just in case you haven’t here’s a brief description: some researchers (in this case Kevin Boyack and Richard Klavans) take tons on information from scholarly databases (in this case the Science Citation Index Expanded and the Social Science Citation Index) and create a network diagram from some set of metrics (in this case, citation similarity). They call this network representation a Map of Science.
We can debate about the merits of these maps ’till we’re blue in the face, but let’s avoid that for now. To my mind, the maps are useful, interesting, and incomplete, and the map-makers are generally well-aware of their deficiencies. The point here is that it is a map: a landscape against which one can situate oneself, and with which one may be able to find paths and understand the lay of the land.
In Boyack, Börner 2, and Klavans (2007), the three authors set out to use the map of science to explore the evolution of chemistry research. The purpose of the paper doesn’t really matter here, though; what matters is the idea of overlaying information atop a base network map.
The images above are the funding profiles of the NIH (National Institutes of Health) and NSF (National Science Foundation). The authors collected publication information attached to all the grants funded by the NSF and NIH and looked at how those publications cited one another. The orange edges show connections between disciplines on the map of science that were more prevalent within the context a particular funding agency than they were compared to the entire map of science. Boyack, Börner 3, and Klavans created a map and used it to contextualize certain funding agencies. They and other parties have since used such maps to contextualize universities, authors, disciplines, and other publication groups.
From Network Maps to Geographic Maps
Of course, the Where’s The Beef™ section of this post still has yet to be discussed, with the beef in this case being geography. How can we use existing topography to contextualize network topology? Network space rarely corresponds to geographic place, however neither of them alone can ever fully represent the landscape within which we are situated. A purely geographic map of ancient Rome would not accurately represent the world in which the ancient Romans lived, as it does not take into account the shortening of distances through well-trod trade routes.
Enter Stanford DH ninja Elijah Meeks. In two recent posts, Elijah discussed the topology/topography divide. In the first, he created a network layout algorithm which took a network with nodes originally placed in their geographic coordinates, and then distorted the network visualization to emphasize network distance. The visualization above shows the network laid out geographically. The one below shows the Imperial Roman trade routes with network distances emphasized. As Elijah says, “everything of the same color in the above map is the same network distance from Rome.”
Of course, the savvy reader has probably observed that this does not take everything into account. These are only land routes; what about the sea?
Elijah’s second post addressed just that, impressively applying GIS techniques to determine the likely route ships took to get from one port to another. This technique drives home the point he was trying to make about transitioning from network topology to network topography. The picture below, incidentally, is Elijah’s re-rendering of the last visualization taking into account both land and see routes. As you can see, the distance from any city to any other has decreased significantly, even taking into account his network-distance algorithm.
The above network visualization combines geography, two types of transportation routes, and network science to provide a more nuanced at-a-glance view of the Imperial Roman landscape. The work he highlighted in his post transitioning from topology to topography in edge shapes is also of utmost importance, however that topic will need to wait for another post.
The Republic of Letters (A Brief Interlude)
Elijah was also involved in another Stanford-based project, one very dear to my heart, Mapping the Republic of Letters. Much of my own research has dealt with the Republic of Letters, especially my time spent under Bob Hatch, and Paula Findlen, Dan Edelstein, and Nicole Coleman at Stanford have been heading up an impressive project on that very subject. I’ll go into more details about the Republic in another post (I know, promises promises), but for now the important thing to look at is their interface for navigating the Republic.
The team has gone well beyond the interface that currently faces the public, however even the original map is an important step. Overlaid against a map of Europe are the correspondences of many early modern scholars. The flow of information is apparent temporally, spatially, and through the network topology of the Republic itself. Now as any good explorer knows, no map is any substitute for a thorough knowledge of the land itself; instead, it is to be used for finding unexplored areas and for synthesizing information at a large scale. For contextualizing.
If you’ll allow me a brief diversion, I’d like to talk about tools for making these sorts of maps, now that we’re on the subject of letters. Elijah’s post on visualizing network distance included a plugin for Gephi to emphasize network distance. Gephi’s a great tool for making really pretty network visualizations, and it also comes with a small but potent handful of network analysis algorithms.
I’m on the development team of another program, the Sci² Tool, which shares a lot of Gephi’s functionality, although it has a much wider scope and includes algorithms for textual, geographic, and statistical analysis, as well as a somewhat broader range of network analysis algorithms.
This is by no means a suggestion to use Sci² over Gephi; they both have their strengths and weaknesses. Gephi is dead simple to use, produces the most beautiful graphs on the market, and is all-around fantastic software. They both excel in different areas, and by using them (and other tools!) together, it is possible to create maps combining geographic and network features without ever having to resort to programming.
The above image was generated by combining the Sci² Tool with Gephi. It is the correspondence network of Hugo Grotius, a dataset I worked on while at Huygens ING in The Hague. They are a great group, and another team doing fantastic Republic of Letters research, and they provided this letters dataset. We just developed this particular functionality in Sci² yesterday, so it will take a bit of time before we work out the bugs and release it publicly, however as soon as it is released I’ll be sure to post a full tutorial on how to make maps like the one above.
This ends the public service announcement.
These maps are not without their critics. Especially prevalent were questions along the lines of “But how is this showing me anything I didn’t already know?” or “All of this is just an artefact of population densities and standard trade routes – what are these maps telling us about the Republic of Letters?” These are legitimate critiques, however as mentioned before, these maps are still useful for at-a-glance synthesis of large scales of information, or learning something new about areas one is not yet an expert in. Another problem has been that the lines on the map don’t represent actual travel routes; those sorts of problems are beginning to be addressed by the type of work Elijah Meeks and other GIS researchers are doing.
To tackle the suggestion that these are merely representing population data, I would like to propose what I believe to be a novel idea. I haven’t published on this yet, and I’m not trying to claim scholarly territory here, but I would ask that if this idea inspires research of your own, please cite this blog post or my publication on the subject, whenever it comes out.
We have a lot of data. Of course it doesn’t feel like we have enough, and it never will, but we have a lot of data. We can use what we have, for example collecting all the correspondences from early modern Europe, and place them on a map like this one. The more data we have, the smaller time slices we can have in our maps. We create a base map that is a combination of geographic properties, statistical location properties, and network properties.
Start with a map of the world. To account for population or related correlations, do something similar to what Elijah did in this post, encoding population information (or average number of publications per city, or whatever else you’d like to account for) into the map. On top of that, place the biggest network of whatever it is that you’re looking at that you can find. Scholarly communication, citations, whatever. It’s your big Map of YourFavoriteThingHere. All of these together are your base map.
Atop that, place whatever or whomever you are studying. The correspondence of Grotius can be put on this map, like the NIH was overlaid atop the Map of Science, and areas would light up and become larger if they are surprising against the base map. Are there more letters between Paris and The Hague in the Grotius dataset then one would expect if the dataset was just randomly plucked from the whole Republic of Letters? If so, make that line brighter and thicker.
By combining geography, point statistics, and networks, we can create base maps against which we can contextualize whatever we happen to be studying. This is just one possible combination; base maps can be created from any of a myriad of sources of data. The important thing is that we, as humanists, ought to be able to contextualize our data in the same way that we always have. Now that we’re working with a lot more of it, we’re going to need help in those contextualizations. Base maps are one solution.
It’s worth pointing out one major problem with base maps: bias. Until recently, those Maps of Science making their way around the blogosphere represented the humanities as a small island off the coast of social sciences, if they showed them at all. This is because the primary publication venues of the arts and humanities were not represented in the datasets used to create these science maps. We must watch out for similar biases when constructing our own base maps, however the problem is significantly more difficult for historical datasets because the underrepresented are too dead to speak up. For a brief discussion of historical biases, you can read my UCLA presentation here.
putting every tool imaginable in one box and using them all at once ↩
Full disclosure: she’s my advisor. She’s also awesome. Hi Katy! ↩
According to Google Scholar, David Blei’s first topic modeling paper has received 3,540 citations since 2003. Everybody’s talking about topic models. Seriously, I’m afraid of visiting my parents this Hanukkah and hearing them ask “Scott… what’s this topic modeling I keep hearing all about?” They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.
Since shortly after Blei’s first publication, researchers have been looking into the interplay between networks and topic models. This post will be about that interplay, looking at how they’ve been combined, what sorts of research those combinations can drive, and a few pitfalls to watch out for. I’ll bracket the big elephant in the room until a later discussion, whether these sorts of models capture the semantic meaning for which they’re often used. This post also attempts to introduce topic modeling to those not yet fully converted aware of its potential.
A brief history of topic modeling
In my recent post on IU’s awesome alchemy project, I briefly mentioned Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) during the discussion of topic models. They’re intimately related, though LSA has been around for quite a bit longer. Without getting into too much technical detail, we should start with a brief history of LSA/LDA.
The story starts, more or less, with a tf-idf matrix. Basically, tf-idf ranks words based on how important they are to a document within a larger corpus. Let’s say we want a list of the most important words for each article in an encyclopedia.
Our first pass is obvious. For each article, just attach a list of words sorted by how frequently they’re used. The problem with this is immediately obvious to anyone who has looked at word frequencies; the top words in the entry on the History of Computing would be “the,” “and,” “is,” and so forth, rather than “turing,” “computer,” “machines,” etc. The problem is solved by tf-idf, which scores the words based on how special they are to a particular document within the larger corpus. Turing is rarely used elsewhere, but used exceptionally frequently in our computer history article, so it bubbles up to the top.
LSA and pLSA
LSA utilizes these tf-idf scores 1 within a larger term-document matrix. Every word in the corpus is a different row in the matrix, each document has its own column, and the tf-idf score lies at the intersection of every document and word. Our computing history document will probably have a lot of zeroes next to words like “cow,” “shakespeare,” and “saucer,” and high marks next to words like “computation,” “artificial,” and “digital.” This is called a sparse matrix because it’s mostly filled with zeroes; most documents use very few words related to the entire corpus.
With this matrix, LSA uses singular value decomposition to figure out how each word is related to every other word. Basically, the more often words are used together within a document, the more related they are to one another. 2 It’s worth noting that a “document” is defined somewhat flexibly. For example, we can call every paragraph in a book its own “document,” and run LSA over the individual paragraphs.
The method was significantly improved by Puzicha and Hofmann (1999), who did away with the linear algebra approach of LSA in favor of a more statistically sound probabilistic model, called probabilistic latent semantic analysis (pLSA). Now is the part of the blog post where I start getting hand-wavy, because explaining the math is more trouble than I care to take on in this introduction.
Essentially, pLSA imagines an additional layer between words and documents: topics. What if every document isn’t just a set of words, but a set of topics? In this model, our encyclopedia article about computing history might be drawn from several topics. It primarily draws from the big platonic computing topic in the sky, but it also draws from the topics of history, cryptography, lambda calculus, and all sorts of other topics to a greater or lesser degree.
Now, these topics don’t actually exist anywhere. Nobody sat down with the encyclopedia, read every entry, and decided to come up with the 200 topics from which every article draws. pLSA infers topics based on what will hereafter be referred to as black magic. Using the dark arts, pLSA “discovers” a bunch of topics, attaches them to a list of words, and classifies the documents based on those topics.
Blei et al. (2003) vastly improved upon this idea by turning it into a generative model of documents, calling the model Latent Dirichlet allocation (LDA). By this time, as well, some sounder assumptions were being made about the distribution of words and document length — but we won’t get into that. What’s important here is the generative model.
Imagine you wanted to write a new encyclopedia entry, let’s say about digital humanities. Well, we now know there are three elements that make up that process, right? Words, topics, and documents. Using these elements, how would you go about writing this new article on digital humanities?
First off, let’s figure out what topics our article will consist of. It probably draws heavily from topics about history, digitization, text analysis, and so forth. It also probably draws more weakly from a slew of other topics, concerning interdisciplinarity, the academy, and all sorts of other subjects. Let’s go a bit further and assign weights to these topics; 22% of the document will be about digitization, 19% about history, 5% about the academy, and so on. Okay, the first step is done!
Now it’s time to pull out the topics and start writing. It’s an easy process; each topic is a bag filled with words. Lots of words. All sorts of words. Let’s look in the “digitization” topic bag. It includes words like “israel” and “cheese” and “favoritism,” but they only appear once or twice, and mostly by accident. More importantly, the bag also contains 157 appearances of the word “TEI,” 210 of “OCR,” and 73 of “scanner.”
So here you are, you’ve dragged out your digitization bag and your history bag and your academy bag and all sorts of other bags as well. You start writing the digital humanities article by reaching into the digitization bag (remember, you’re going to reach into that bag for 22% of your words), and you pull out “OCR.” You put it on the page. You then reach for the academy bag and reach for a word in there (it happens to be “teaching,”) and you throw that on the page as well. Keep doing that. By the end, you’ve got a document that’s all about the digital humanities. It’s beautiful. Send it in for publication.
Alright, what now?
So why is the generative nature of the model so important? One of the key reasons is the ability to work backwards. If I can generate an (admittedly nonsensical) document using this model, I can also reverse the process an infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from.
Another factor contributing to the success of LDA is the ability to extend the model. In this case, we assume there are only documents, topics, and words, but we could also make a model that assumes authors who like particular topics, or assumes that certain documents are influenced by previous documents, or that topics change over time. The possibilities are endless, as evidenced by the absurd number of topic modeling variations that have appeared in the past decade. David Mimno has compiled a wonderful bibliography of many such models.
While the generative model introduced by Blei might seem simplistic, it has been shown to be extremely powerful. When a newcomer sees the results of LDA for the first time, they are immediately taken by how intuitive they seem. People sometimes ask me “but didn’t it take forever to sit down and make all the topics?” thinking that some of the magic had to be done by hand. It wasn’t. Topic modeling yields intuitive results, generating what really feels like topics as we know them 3, with virtually no effort on the human side. Perhaps it is the intuitive utility that appeals so much to humanists.
Topic Modeling and Networks
Topic models can interact with networks in multiple ways. While a lot of the recent interest in digital humanities has surrounded using networks to visualize how documents or topics relate to one another, the interfacing of networks and topic modeling initially worked in the other direction. Instead of inferring networks from topic models, many early (and recent) papers attempt to infer topic models from networks.
Topic Models from Networks
The first research I’m aware of in this niche was from McCallum et al. (2005). Their model is itself an extension of an earlier LDA-based model called the Author-Topic Model (Steyvers et al., 2004), which assumes topics are formed based on the mixtures of authors writing a paper. McCallum et al. extended that model for directed messages in their Author-Recipient-Topic (ART) Model. In ART, it is assumed that topics of letters, e-mails or direct messages between people can be inferred from knowledge of both the author and the recipient. Thus, ART takes into account the social structure of a communication network in order to generate topics. In a later paper (McCallum et al., 2007), they extend this model to one that infers the roles of authors within the social network.
Dietz et al. (2007) created a model that looks at citation networks, where documents are generated by topical innovation and topical inheritance via citations. Nallapati et al. (2008) similarly creates a model that finds topical similarity in citing and cited documents, with the added ability of being able to predict citations that are not present. Blei himself joined the fray in 2009, creating the Relational Topic Model (RTM) with Jonathan Chang, which itself could summarize a network of documents, predict links between them, and predict words within them. Wang et al. (2011) created a model that allows for “the joint analysis of text and links between [people] in a time-evolving social network.” Their model is able to handle situations where links exist even when there is no similarity between the associated texts.
Networks from Topic Models
Some models have been made that infer networks from non-networked text. Broniatowski and Magee (2010 & 2011) extended the Author-Topic Model, building a model that would infer social networks from meeting transcripts. They later added temporal information, which allowed them to infer status hierarchies and individual influence within those social networks.
Many times, however, rather than creating new models, researchers create networks out of topic models that have already been run over a set of data. There are a lot of benefits to this approach, as exemplified by the Newton’s Chymistry project highlighted earlier. Using networks, we can see how documents relate to one another, how they relate to topics, how topics are related to each other, and how all of those are related to words.
Elijah Meeks created a wonderful example combining topic models with networks in Comprehending the Digital Humanities. Using fifty texts that discuss humanities computing, Elijah created a topic model of those documents and used networks to show how documents, topics, and words interacted with one another within the context of the digital humanities.
Elijah Jeff Drouin has also created networks of topic models in Proust, as reported by Elijah.
Peter Leonard recently directed me to TopicNets, a project that combines topic modeling and network analysis in order to create an intuitive and informative navigation interface for documents and topics. This is a great example of an interface that turns topic modeling into a useful scholarly tool, even for those who know little-to-nothing about networks or topic models.
If you want to do something like this yourself, Shawn Graham recently posted a great tutorial on how to create networks using MALLET and Gephi quickly and easily. Prepare your corpus of text, get topics with MALLET, prune the CSV, make a network, visualize it! Easy as pie.
Networks can be a great way to represent topic models. Beyond simple uses of navigation and relatedness as were just displayed, combining the two will put the whole battalion of network analysis tools at the researcher’s disposal. We can use them to find communities of similar documents, pinpoint those documents that were most influential to the rest, or perform any of a number of other workflows designed for network analysis.
As with anything, however, there are a few setbacks. Topic models are rich with data. Every document is related to every other document, if some only barely. Similarly, every topic is related to every other topic. By deciding to represent document similarity over a network, you must make the decision of precisely how similar you want a set of documents to be if they are to be linked. Having a network with every document connected to every other document is scarcely useful, so generally we’ll make our decision such that each document is linked to only a handful of others. This allows for easier visualization and analysis, but it also destroys much of the rich data that went into the topic model to begin with. This information can be more fully preserved using other techniques, such as multidimensional scaling.
A somewhat more theoretical complication makes these network representations useful as a tool for navigation, discovery, and exploration, but not necessarily as evidentiary support. Creating a network of a topic model of a set of documents piles on abstractions. Each of these systems comes with very different assumptions, and it is unclear what complications arise when combining these methods ad hoc.
Although there may be issues with the process, the combination of topic models and networks is sure to yield much fruitful research in the digital humanities. There are some fantastic tutorials out there for getting started with topic modeling in the humanities, such as Shawn Graham’s post on Getting Started with MALLET and Topic Modeling, as well as on combining them with networks, such as this post from the same blog. Shawn is right to point out MALLET, a great tool for starting out, but you can also find the code used for various models on many of the model-makers’ academic websites. One code package that stands out is Chang’s implementation of LDA and related models in R.
Ted Underwood rightly points out in the comments that other scoring systems are often used in lieu of tf-idf, most frequently log entropy. ↩
Yes yes, this is a simplification of actual LSA, but it’s pretty much how it works. SVD reduces the size of the matrix to filter out noise, and then each word row is treated as a vector shooting off in some direction. The vector of each word is compared to every other word, so that every pair of words has a relatedness score between them. Ted Underwood has a great blog post about why humanists should avoid the SVD step. ↩
They’re not, of course. We’ll worry about that later. ↩
“Newton wrote and transcribed about a million words on the subject of alchemy.” —chymistry.org
Beside bringing us things like calculus, universal gravitation, and perhaps the inspiration for certain Pink Floyd albums, Isaac Newton spent many years researching what was then known as “chymistry,” a multifaceted precursor to, among other things, what we now call chemistry, pharmacology, and alchemy.
Researchers at Indiana University, notably William R. Newman, John A. Walsh, Dot Porter, and Wallace Hooper, have spent the last several years developing The Chymistry of Isaac Newton, an absolutely wonderful history of science resource which, as of this past month, has digitized all 59 of Newton’s alchemical manuscripts assembled by John Keynes in 1936. Among the sites features are heavily annotated transcriptions, manuscript images, often scholarly synopses, and examples of alchemical experiments. That you can try at home. That’s right, you can do alchemy with this website. They also managed to introduce alchemical symbols into unicode (U+1F700 – U+1F77F), which is just indescribably cool.
What I really want to highlight, though, is a brand new feature introduced by Wallace Hooper: automated Latent Semantic Analysis (LSA) of the entire corpus. For those who are not familiar with it, LSA is somewhat similar LDA, the algorithm driving the increasingly popular Topic Models used in Digital Humanities. They both have their strengths and weaknesses, but essentially what they do is show how documents and terms relate to one another.
In this case, the entire corpus of Newton’s alchemical texts is fed into the LSA implementation (try it for yourself), and then based on the user’s preferences, the algorithm spits out a network of terms, documents, or both together. That is, if the user chooses document-document correlations, a list is produced of the documents that are most similar to one another based on similar word use within them. That list includes weights – how similar are they to one another? – and those weights can be used to create a network of document similarity.
One of the really cool features of this new service is that it can export the network either as CSV for the technical among us, or as an nwb file to be loaded into the Network Workbench or the Sci² Tool. From there, you can analyze or visualize the alchemical networks, or you can export the files into a network format of your choice.
It’s great to see more sophisticated textual analyses being automated and actually used. Amber Welch recently posted on Moving Beyond the Word Cloud using the wonderful TAPoR, and Michael Widner just posted a thought-provoking article on using Voyeur Tools for the process of paper revision. With tools this easy to use, it won’t be long now before the first thing a humanist does when approaching a text (or a million texts) is to glance at all the high-level semantic features and various document visualizations before digging in for the close read.
As this blog is still quite new, and I’m still nigh-unknown, now would probably be a good time to mark my scholarly territory. Instead of writing a long description that nobody would read, I figured I’d take a cue from my own data-oriented research and analyze everything I’ve read over the last year. The pictures below give a pretty accurate representation of my research interests.
I’ll post a long tutorial on exactly how to replicate this later, but the process was fairly straightforward and required no programming or complicated data manipulation. First, I exported all my Zotero references since last October in BibTeX format, a common bibliographic standard. I imported that file into the Sci² Tool, a data analysis and visualization tool developed at the center I work in, and normalized all the words in the titles and abstracts. That is, “applied,” “applies” and “apply” were all merged into one entity. I got a raw count of word use and stuck it in everybody’s favorite word cloud tool, Wordle, and the results of that is the first image below. [Post-publication note: Angela does not approve of my word-cloud. I can’t say I blame her. Word clouds are almost wholly useless, but at least it’s still pretty.]
I then used Sci² to extract a word co-occurrence network, connecting two words if they appeared together within the title+abstract of a paper or book I’d read. If they appeared together once, they were appended with a score of 1, if they appeared together twice, 2, and so on. I then re-weighted the connections by exclusivity; that is, if two words appeared exclusively with one another, they scored higher. “Republ” appeared 32 times, “Letter” appeared 47 times, and 31 of those times they appeared together, so their connection is quite strong. On the other hand, “Scienc” appeared 175 times, “Concept” 120 times, but they only appeared together 32 times, so their connection is much weaker. “Republ” and “Letter” appeared with one another just as frequently as “Scienc” and “Concept,” but because “Scienc” and “Concept” were so much more widely used, their connection score is lower.
Once the general network was created, I loaded the data into Gephi, a great new network visualization tool. Gephi clustered the network based on what words co-occurred frequently, and colored the words and their connections based on that clustering. The results are below (click the image to enlarge it).
These images sum up my research interests fairly well, and a look at the network certainly splits my research into the various fields and subfields I often draw from. Neither of these graphics are particularly sophisticated, but they do give a good at-a-glance notion of the scholarly landscape from my perspective. In the coming weeks, I will post tutorials to create these and similar data visualizations or analyses with off-the-shelf tools, so stay-tuned.
UCLA’s Networks and Network Analysis for the Humanities this past weekend did not fail to impress. Tim Tangherlini and his mathemagical imps returned in true form, organizing a really impressively realized (and predictably jam-packed) conference that left the participants excited, exhausted, enlightened, and unanimously shouting for more next year (and the year after, and the year after that, and the year after that…) I cannot thank the ODH enough for facilitating this and similar events.
Some particular highlights included Graham Sack’s exceptionally robust comparative analysis of a few hundred early English novels (watch out for him, he’s going to be a Heavy Hitter), Sarah Horowitz‘s really convincing use of epistolary network analysis to weave the importance of women (specifically salonières) in holding together the fabric of French high society, Rob Nelson’s further work on the always impressive Mining the Dispatch, Peter Leonard‘s thoughtful and important discussion on combining text and network analysis (hint: visuals are the way to go), Jon Kleinberg‘s super fantastic wonderful keynote lecture, Glen Worthey‘s inspiring talk about not needing All Of It, Russell Horton’s rhymes, Song Chen‘s rigorous analysis of early Asian family ties, and, well, everyone else’s everything else.
Especially interesting were the discussions, raised most particularly by Kleinberg and Hoyt Long, about what particularly we were looking at when we constructed these networks. The union of so many subjective experiences surely is not the objective truth, but neither is it a proxy of objective truth – what, then, is it? I’m inclined to say that this Big Data aggregated from individual experiences provides us a baseline subjective reality that provides us local basins of attraction; that is, trends we see are measures of how likely a certain person will experience the world in a certain way when situated in whatever part of the network/world they reside. More thought and research must go into what the global and local meaning of this Big Data, and will definitely reveal very interesting results.
My talk on bias also seemed to stir some discussion. I gave up counting how many participants looked at me during their presentations and said “and of course the data is biased, but this is preliminary, and this is what I came up with and what justifies that conclusion.” And of course the issues I raised were not new; further, everybody in attendance was already aware of them. What I hoped my presentation to inspire, and it seems to have been successful, was the open discussion of data biases and constraints it puts on conclusions within the context of the presentation of those conclusions.
Some of us were joking that the issues of bias means “you don’t know, you can’t ever know what you don’t know, and you should just give up now.” This is exactly opposite to the point. As long as we’re open an honest about what we do not or cannot know, we can make claims around those gaps, inferring and guessing where we need to, and let the reader decide whether our careful analysis and historical inferences are sufficient to support the conclusions we draw. Honesty is more important than completeness or unshakable proof; indeed, neither of those are yet possible in most of what we study.
There was some twittertalk surrounding my presentation, so here’s my draft/notes for anyone interested (click ‘continue reading’ to view):
Last year, Tim Tangherlini and his magical crew of folkloric imps and applied mathematicians put together a most fantastic and exhausting workshop on networks and network analysis in the humanities. We called it #humnets for short. The workshop (one of the oh-so-fantastic ODH Summer Institutes) spanned two weeks, bringing together forward-thinking humanists and Big Deals in network science and computer science. Now, a year and a half later, we’re all reuniting (bouncing back?) at UCLA to show off all the fantastic network-y humanist-y projects we’ve come up with in the interim.
As of a few weeks ago, I was all set to present my findings from analyzing and modeling the correspondence networks of early-modern scholars. Unfortunately (for me, but perhaps fortunately for everyone else), some new data came in that Changed Everything and invalidated many of my conclusions. I was faced with a dilemma; present my research as it was before I learned about the new data (after all, it was still a good example of using networks in the humanities), or retool everything to fit the new data.
Unfortunately, there was no time to do the latter, and doing the former felt icky and dishonest. In keeping with Tony Beaver’s statement at UCLA last year (“Everything you can do I can do meta,”) I ultimately decided to present a paper on precisely the problem that foiled my presentation: systematic bias. Biases need not be an issue of methodology; you can do everything right methodologically, you can design a perfect experiment, and a systematic bias can still thwart the accuracy of a project. The bias can be due to the available observable data itself (external selection bias), it may be due to how we as researchers decide to collect that data (sample selection bias), or it may be how we decide to use the data we’ve collected (confirmation bias).
There is a small-but-growing precedent of literature on the effects of bias on network analysis. I’ll refer to it briefly in my talk at UCLA, but below is a list of the best references I’ve found on the matter. Most of them deal with sample selection bias, and none of them deal with the humanities.
For those of you who’ve read this far, congratulations! Here’s a preview of my Friday presentation (I’ll post the notes on Friday).
Effects of bias on network analysis condensed bibliography:
Achlioptas, Dimitris, Aaron Clauset, David Kempe, and Cristopher Moore. 2005. On the bias of traceroute sampling. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 694. ACM Press. doi:10.1145/1060590.1060693. http://dl.acm.org/citation.cfm?id=1060693.
———. 2009. “On the bias of traceroute sampling.” Journal of the ACM 56 (June 1): 1-28. doi:10.1145/1538902.1538905.
Costenbader, Elizabeth, and Thomas W Valente. 2003. “The stability of centrality measures when networks are sampled.” Social Networks 25 (4) (October): 283-307. doi:10.1016/S0378-8733(03)00012-1.
Gjoka, M., M. Kurant, C. T Butts, and A. Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In 2010 Proceedings IEEE INFOCOM, 1-9. IEEE, March 14. doi:10.1109/INFCOM.2010.5462078.
Gjoka, Minas, Maciej Kurant, Carter T Butts, and Athina Markopoulou. 2011. “Practical Recommendations on Crawling Online Social Networks.” IEEE Journal on Selected Areas in Communications 29 (9) (October): 1872-1892. doi:10.1109/JSAC.2011.111011.
Golub, B., and M. O. Jackson. 2010. “From the Cover: Using selection bias to explain the observed structure of Internet diffusions.” Proceedings of the National Academy of Sciences 107 (June 3): 10833-10836. doi:10.1073/pnas.1000814107.
Henzinger, Monika R., Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. “On near-uniform URL sampling.” Computer Networks 33 (1-6) (June): 295-308. doi:10.1016/S1389-1286(00)00055-4.
Kim, P.-J., and H. Jeong. 2007. “Reliability of rank order in sampled networks.” The European Physical Journal B 55 (February 7): 109-114. doi:10.1140/epjb/e2007-00033-7.
Kurant, Maciej, Athina Markopoulou, and P. Thiran. 2010. On the bias of BFS (Breadth First Search). In Teletraffic Congress (ITC), 2010 22nd International, 1-8. IEEE, September 7. doi:10.1109/ITC.2010.5608727.
Lakhina, Anukool, John W. Byers, Mark Crovella, and Peng Xie. 2003. Sampling biases in IP topology measurements. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, 1:332- 341 vol.1. IEEE, April 30. doi:10.1109/INFCOM.2003.1208685.
Latapy, Matthieu, and Clemence Magnien. 2008. Complex Network Measurements: Estimating the Relevance of Observed Properties. In IEEE INFOCOM 2008. The 27th Conference on Computer Communications, 1660-1668. IEEE, April 13. doi:10.1109/INFOCOM.2008.227.
Maiya, Arun S. 2011. Sampling and Inference in Complex Networks. Chicago: University of Illinois at Chicago, April. http://arun.maiya.net/papers/asmthesis.pdf.
Pedarsani, Pedram, Daniel R. Figueiredo, and Matthias Grossglauser. 2008. Densification arising from sampling fixed graphs. In Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 205. ACM Press. doi:10.1145/1375457.1375481. http://portal.acm.org/citation.cfm?doid=1375457.1375481.
Stumpf, Michael P. H., Carsten Wiuf, and Robert M. May. 2005. “Subnets of scale-free networks are not scale-free: Sampling properties of networks.” Proceedings of the National Academy of Sciences of the United States of America 102 (12) (March 22): 4221 -4224. doi:10.1073/pnas.0501179102.
Stutzbach, Daniel, Reza Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2009. “On Unbiased Sampling for Unstructured Peer-to-Peer Networks.” IEEE/ACM Transactions on Networking 17 (2) (April): 377-390. doi:10.1109/TNET.2008.2001730.
Effects of selection bias on historical/sociological research condensed bibliography:
Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data.” American Sociological Review 48 (3) (June 1): 386-398. doi:10.2307/2095230.
Bryant, Joseph M. 1994. “Evidence and Explanation in History and Sociology: Critical Reflections on Goldthorpe’s Critique of Historical Sociology.” The British Journal of Sociology 45 (1) (March 1): 3-19. doi:10.2307/591521.
———. 2000. “On sources and narratives in historical social science: a realist critique of positivist and postmodernist epistemologies.” The British Journal of Sociology 51 (3) (September 1): 489-523. doi:10.1111/j.1468-4446.2000.00489.x.
Duncan Baretta, Silvio R., John Markoff, and Gilbert Shapiro. 1987. “The selective Transmission of Historical Documents: The Case of the Parish Cahiers of 1789.” Histoire & Mesure 2: 115-172. doi:10.3406/hism.1987.1328.
Goldthorpe, John H. 1991. “The Uses of History in Sociology: Reflections on Some Recent Tendencies.” The British Journal of Sociology 42 (2) (June 1): 211-230. doi:10.2307/590368.
———. 1994. “The Uses of History in Sociology: A Reply.” The British Journal of Sociology 45 (1) (March 1): 55-77. doi:10.2307/591525.
Jensen, Richard. 1984. “Review: Ethnometrics.” Journal of American Ethnic History 3 (2) (April 1): 67-73.
Kosso, Peter. 2009. Philosophy of Historiography. In A Companion to the Philosophy of History and Historiography, 7-25. http://onlinelibrary.wiley.com/doi/10.1002/9781444304916.ch2/summary.
Kreuzer, Marcus. 2010. “Historical Knowledge and Quantitative Analysis: The Case of the Origins of Proportional Representation.” American Political Science Review 104 (02): 369-392. doi:10.1017/S0003055410000122.
Lang, Gladys Engel, and Kurt Lang. 1988. “Recognition and Renown: The Survival of Artistic Reputation.” American Journal of Sociology 94 (1) (July 1): 79-109.
Lustick, Ian S. 1996. “History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias.” The American Political Science Review 90 (3): 605-618. doi:10.2307/2082612.
Mariampolski, Hyman, and Dana C. Hughes. 1978. “The Use of Personal Documents in Historical Sociology.” The American Sociologist 13 (2) (May 1): 104-113.
Murphey, Murray G. 1973. Our Knowledge of the Historical Past. Macmillan Pub Co, January.
Murphey, Murray G. 1994. Philosophical foundations of historical knowledge. State Univ of New York Pr, July.
Rubin, Ernest. 1943. “The Place of Statistical Methods in Modern Historiography.” American Journal of Economics and Sociology 2 (2) (January 1): 193-210.
Schatzki, Theodore. 2006. “On Studying the Past Scientifically∗.” Inquiry 49 (4) (August): 380-399. doi:10.1080/00201740600831505.
Wellman, Barry, and Charles Wetherell. 1996. “Social network analysis of historical communities: Some questions from the present for the past.” The History of the Family 1 (1): 97-121. doi:10.1016/S1081-602X(96)90022-6.