Teaching Yourself to Code in DH

tl;dr Book-length introductions to programming or analytic methods (math / statistics / etc.) aimed at or useful for humanists with limited coding experience.


I’m collecting programming & methodological textbooks for humanists as part of a reflective study on DH, but figured it’d also be useful for those interested in teaching themselves to code, or teachers who need a textbook for their class. Though I haven’t read them all yet, I’ve organized them into very imperfect categories and provided (hopefully) some useful comments. Short coding exercises, books that assume some pre-existing knowledge of coding, and theoretical introductions are not listed here.

Thanks to @Literature_Geek, @ProgHist, @heatherfro, @electricarchaeo, @digitaldante, @kintopp, @dmimno, & @collinj for their contributions to the growing list. In the interest of maintaining scope, not all of their suggestions appear below.

Historical Analysis

  • The Programming Historian, 1st edition (2007). William J. Turkel and Alan MacEachern.
    • An open access introduction to programming in Python. Mostly web scraping and basic text analysis. Probably best to look to newer resources, due to the date. Although it’s aimed at historians, the methods are broadly useful to all text-based DH.
  • The Programming Historian, 2nd edition (ongoing). Afanador-Llach, Maria José, Antonio Rojas Castro, Adam Crymble, Víctor Gayol, Fred Gibbs, Caleb McDaniel, Ian Milligan, Amanda Visconti, and Jeri Wieringa, eds.
    • Constantly updating lessons, ostensibly aimed at historians, but useful to all of DH. Includes introductions to web development, text analysis, GIS, network analysis, etc. in multiple programming languages. Not a monograph, and no real order.
  • Computational Historical Thinking with Applications in R (ongoing). Lincoln Mullen.
    • A series of lessons in in R, still under development with quite a few chapters missing. Probably the only programming book aimed at historians that actually focuses on historical questions and approaches.
  • The Rubyist Historian (2004). Jason Heppler.
    • A short introduction to programming in Ruby. Again, ostensibly aimed at historians, but really just focused on the fundamentals of coding, and useful in that context.
  • Natural Language Processing for Historical Texts (2012). Michael Piotrowski.
    • About natural language processing, but not an introduction to coding. Instead, an introduction to the methodological approaches of natural language processing specific to historical texts (OCR, spelling normalization, choosing a corpus, part of speech tagging, etc.). Teaches a variety of tools and techniques.
  • The Historian’s Macroscope (2015). Graham, Milligan, & Weingart.
    • Okay I’m cheating a bit here! This isn’t teaching you to program, but Shawn, Ian, and I spent a while writing this intro to digital methods for historians, so I figured I’d sneak a link in.

Literary & Linguistic Analysis

  • Text Analysis with R for Students of Literature (2014). Matthew Jockers.
    • Step-by-step introduction to learning R, specifically focused on literary text analysis, both for close and distant reading, with primers on the statistical approaches being used. Includes approaches to, e.g., word frequency distribution, lexical variety, classification, and topic modeling.
  • The Art of Literary Text Analysis (ongoing). Stéfan Sinclair & Geoffrey Rockwell.
    • A growing, interactive textbook similar in scope to Jockers’ book (close & distant reading in literary analysis), but in Python rather than R. Heavily focused on the code itself, and includes such methods as topic modeling and sentiment analysis.
  • Statistics for Corpus Linguistics (1998). Michael Oakes.
    • Don’t know anything about this one, sorry!

General Digital Humanities

Many of the above books are focused on literary or historical analysis only in name, but are really useful for everyone in DH. The below are similar in scope, but don’t aim themselves at one particular group.

  • Humanities Data in R (2015). Lauren Tilton & Taylor Arnold.
    • General introduction to programming through R, and broadly focused on many approaches, including basic statistics, networks, maps, texts, and images. Teaches concepts and programmatic implementations.
  • Digital Research Methods with Mathematica (2015). William J. Turkel.
    • A Mathematica notebook (thus, not accessible unless you have an appropriate reader) teaching text, image, and geo-based analysis. Mathematica itself is an expensive piece of software without an institutional license, so this resource may be inaccessible to many learners. [NOTE: Arno Bosse wrote positive feedback on this textbook in a comment below.]
  • Exploratory Programming for the Arts and Humanities (2016). Nick Montfort.
    • An introduction to the fundamentals of programming specifically for arts and humanities, languages Python and Processing, that goes through statistics, text, sound, animation, images, and so forth. Much more expansive than many other options listed here, but not as focused on needs of text analysis (which is probably a good thing).
  • An Introduction to Text Analysis: A Coursebook (2016). Brandon Walsh & Sarah Horowitz.
    • A brief textbook with exercises and explanatory notes specific to text analysis for the study of literature and history. Not an introduction to programming, but covers some of the mathematical and methodological concepts used in these sorts of studies.
  • Python Programming for Humanists (ongoing). Folgert Karsdorp and Maarten van Gompel.
    • Interactive (Jupyter) notebooks teaching Python for statistical text analysis. Quite thorough, teaching methodological reasoning and examples, including quizzes and other lesson helpers, going from basic tokenization up through unsupervised learning, object-oriented programming, etc.

Statistical Methods & Machine Learning

  • Statistics for the Humanities (2014). John Canning.
    • Not an introduction to coding of any sort, but a solid intro to statistics geared at the sort of stats needed by humanists (archaeologists, literary theorists, philosophers, historians, etc.). Reading this should give you a solid foundation of statistical methods (sampling, confidence intervals, bias, etc.)
  • Data Mining: Practical Machine Learning Tools and Techniques, 4th edition (2016). Witten, Frank, Hall, & Pal.
    • A practical intro to machine learning in Weka, Java-based software for data mining and modeling. Not aimed at humanists, but legible to the dedicated amateur. It really gets into the weeds of how machine learning works.
  • Text Mining with R (2017). Julia Silge and David Robinson.
    • Introduction to text mining aimed at data scientists in the statistical programming language R. Some knowledge of R is expected; the authors suggest using R for Data Science (2016) by Grolemund & Wickham to get up to speed. This is for those interested in current data science coding best-practices, though it does not get as in-depth as some other texts focused on literary text analysis. Good as a solid base to learn from.
  • The Curious Journalist’s Guide to Data (2016). Jonathan Stray.
    • Not an intro to programming or math, but rather a good guide to quantitatively thinking through evidence and argument. Aimed at journalists, but of potential use to more empirically-minded humanists.

Data Visualization, Web Development, & Related

  • D3.js in Action, 2nd edition (2017). Elijah Meeks.
    • Introduction to programmatic, online data visualization in javascript and the library D3.js. Not aimed at the humanities, but written by a digital humanist; easy to read and follow. The great thing about D3 is it’s a library for visualizing something in whatever fashion you might imagine, so this is a good book for those who want to design their own visualizations rather than using off-the-shelf tools.
  • Drupal for Humanists (2016). Quinn Dombrowski.
    • Full-length introduction to Drupal, a web platform that allows you to build “environments for gathering, annotating, arranging, and presenting their research and supporting materials” on the web. Useful for those interested in getting started with the creation of web-based projects but who don’t want to dive head-first into from-scratch web development.
  • (Xe)LaTeX appliqué aux sciences humaines (2012). Maïeul Rouquette, Brendan Chabannes et Enimie Rouquette.
    • French introduction to LaTeX for humanists. LaTeX is the primary means scientists use to prepare documents (instead of MS Word or similar software), which allows for more sustainable, robust, and easily typeset scholarly publications. If humanists wish to publish in natural (or some social) science journals, this is an important skill.

Networks Demystified 9: Bimodal Networks

What do you think, is a year long enough to wait between Networks Demystified posts? I don’t think so, which is why it’s been a year and a month. Welcome back! A recent twitter back-and-forth culminated in a request for a discussion of “bimodal networks”, and my Networks Demystified series seemed like a perfect place for just such a discussion.

What’s a bimodal network, you ask? (Go on, ask aloud at your desk. Nobody will look at you funny, this is the age of Siri!) A bimodal network is one which connects two varieties of things. It’s also called a bipartite, 2-partite, or 2-mode network. A network of authors connected to the papers they write is bimodal, as are networks of books to topics, and people to organizations they are affiliated with.

A bimodal network.
A bimodal network.

This is a bimodal network which connects people and the clubs they belong to. Alice is a member of the Network Club and the We Love History Society, Bob‘s in the Network Club and the No Adults Allowed Club, and Carol‘s in the No Adults Allowed Club.

If this makes no sense, read my earlier Networks Demystified posts (the first two posts), or the our Historian’s Macroscope chapter, for a primer on networks. If it does make sense, excellent! The rest of this post will hopefully take you out of your comfort zone, but remain understandable to someone who doesn’t speak math.

k-partite Networks & Projections

Bimodal networks are part of a larger class of k-partite networks. Unipartite/unimodal networks have only one type of node (remember, nodes are the stuff being connected by the edges), bipartite/bimodal networks have two types of nodes, tripartite/trimodal networks have three types of node, and so on to infinity.

The most common networks you’ll see being researched are unipartite. Who follows whom on Twitter? Who’s writing to whom in early modern Europe? What articles cite which other articles? All are examples of unipartite networks. It’s important to realize this isn’t necessarily determined by the dataset, but by the researcher doing the studying. For example, you can use the same organization affiliation dataset to create a unipartite network of who is in a club with whom, or a bipartite network of which person is affiliated with each organization.

The same dataset used to create a unipartite (left) and a bipartite (right) network.
The same dataset used to create a unipartite (left) and a bipartite (right) network.

The above illustration shows the same dataset used to create a unimodal and a bimodal network. The process of turning a pre-existing bimodal network into a unimodal network is called a bimodal projection. This process collapses one set of nodes into edges connecting the other set. In this case, because Alice and Bob are both members of the Network Club, the Network Club collapses into becoming an edge between those two people. The No Adults Allowed Club collapses into an edge between Bob and Carol. Because only Alice is a member of the We Love History Society, it does not collapse into an edge connecting any people.

You can also collapse the network in the opposite direction, connecting organizations who share people. No Adults Allowed and Network Club would share an edge (Bob), as would Network Club and We Love History Society (Alice).

Why Bimodal Networks?

If the same dataset can be described with unimodal networks, which are less complex, why go to bi-, tri-, or multimodal? The answer to that is in your research question: different network representations suit different questions better.

Collaboration is a hot topic in bibliometrics. Who collaborates with whom? Why? Do your collaborators affect your future collaborations? Co-authorship networks are well-suited to some of these questions, since they directly connect collaborators who author a piece together. This is a unimodal network: I wrote The Historian’s Macroscope with Shawn Graham and Ian Milligan, so we draw an edge connecting each of us together.

Some of the more focused questions of collaboration, however, require a more nuanced view of the data. Let’s say you want to know how individual instances of collaboration affect individual research patterns going forward. In this case, you want to know more than the fact that I’ve co-authored two pieces with Shawn and Ian, and they’ve co-authored three pieces together.

For this added nuance, we can draw an edge from each of us to The Historian’s Macroscope (rather than each-other), then another set edges to the piece we co-authored in The Programming Historian, and a last set of edges going from Shawn and Ian to the piece they wrote in the Journal of Digital Humanities. That’s three people nodes and three publication nodes.

Scott, Ian, and Shawn's co-authorship network
Scott, Ian, and Shawn’s co-authorship network

Why Not Bimodal Networks?

Humanities data are often a rich array of node types: people, places, things, ideas, all connected to each other via a complex network. The trade-off is, the more complex and multimodal your dataset, the less you can reasonably do with it. This is one of the fundamental tensions between computational and traditional humanities. More categories lead to a richer understanding of the diversity of human experience, but are incredibly unhelpful when you want to count things.

Consider two pie-charts showing the religious makeup of the United States. The first chart groups together religions that fall under a similar umbrella, and the second does not. That is, the first chart groups religions like Calvinists and Lutherans together into the same pie slice (Protestants), and the second splits them into separate slices. The second, more complex chart obviously presents a richer picture of religious diversity in the United States, but it’s also significantly more difficult to read. It might trick you into thinking there are more Catholics than Protestants in the country, due to how the pie is split.

The same is true in network analysis. By creating a dataset with a hundred varieties of nodes, you lose your ability to see a bigger picture through meaningful aggregations.

Surely, you’re thinking, bimodal networks, with only two categories, should be fine! Wellllll, yes and no. You don’t bump into the same aggregation problem you do with very multimodal networks; instead, you bump into technical and mathematical issues. These issues are why I often warn non-technical researchers away from bimodal networks in research. They’re not theoretically unsound, they’re just difficult to work with properly unless you know what changes when you’re working with these complex networks.

The following section will discuss a few network metrics you may be familiar with, and what they mean for bimodal networks.

Network Metrics and Bimodality

The easiest thing to measure in a network is a node’s degree centrality. You’ll recall this is a measurement of how many edges are attached to a node, which gives a rough proxy for this concept we’ve come to call network “centrality“. It means different things depending on your data and your question: the most important or well-connected person in your social network; the point in the U.S. electrical grid which is most vulnerable to attack; the book that shares the most concepts with other books (the encyclopedia?); the city that the most traders pass through to get to their destination. These are all highly “central” in the networks they occupy.

A network with each node labeled with its degree centrality.
A network with each node labeled with its degree centrality, via Wikipedia.

Degree centrality is the easiest such proxy to compute: how many connections does a node have? The idea is that nodes that are more highly connected are more central. The assumption only goes so far, and it’s easy to come up with nodes that are central that do not have a  high degree, as with the network below.

The blue node is highly central, but only has a degree centrality of 3. [via]
The blue node is highly central, but only has a degree centrality of 3. [via]
That’s the thing with these metrics: if you know how they work, you know which networks they apply well to, and which they do not. If what you mean by “centrality” is “has more friends”, and we’re talking about a Facebook network, then degree centrality is a perfect metric for the job.

If what you mean is “an important stop for river trade”, and we’re talking about 12th century Russia, then degree centrality sucks. The below is an illustration of such a network by Pitts (1978):

Russian river trade routes. Numbers/nodes are cities, and edges are rivers between them.
Russian river trade routes. Numbers/nodes are cities, and edges are rivers between them.

Moscow is number 35, and pretty clearly the most central according to the above criteria (you’ll likely pass through it to reach other destinations). But it only has a degree centrality of four! Node 9 also has a degree centrality of four, but clearly doesn’t play as important a structural role as Moscow in this network.

We already see that depending on your question, your definitions, and your dataset, specific metrics will either be useful or not. Metrics may change meanings entirely from one network to the next – for example, looking at bimodal rather than unimodal networks.

Consider what degree centrality means for the Alice, Bob, and Carol’s bimodal affiliation network above, where each is associated with a different set of clubs. Calculate the degree centralities in your head (hint: if you can’t, you haven’t learned what degree centrality means yet. Try again.).

Alice and Bob have a degree of 2, and Carol has a degree of 1. Is this saying anything about how central each is to the network? Not at all. Compare this to the unimodal projection, and you’ll see Bob is clearly the only structurally central actor in the network. In a bimodal network, degree centrality is nothing more than a count of affiliations with the other half of the network. It is much less likely to tell you something structurally useful than if you were looking at a unimodal network.

Consider another common measurement: clustering coefficient. You’ll recall that a node’s local clustering coefficient is the extent to which its neighbors are neighbors to one another. If all my Facebook friends know each other, I have a high clustering coefficient; if none of them know each other, I have a low clustering coefficient. If all of a power plant’s neighbors directly connect to one another, it has a high clustering coefficient, and if they don’t, it has a low clustering coefficient.

Clustering coefficient, from largest to smallest. [via]
Clustering coefficient, from largest to smallest. [via]
This measurement winds up being important for all sorts of reasons, but one way to interpret its meaning is as a proxy for the extent to which a node bridges diverse communities, the extent to which it is an important broker. In the 17th century, Henry Oldenburg was an important broker between disparate scholarly communities, in that he corresponded with people all across Europe, many of whom would never meet one another. The fact that they’d never meet is represented by the local clustering coefficient. It’s low, so we know his neighbors were unlikely to be neighbors of one another.

You can get creative (and network scientists often are) with what this metric means in the context of your own dataset. As long as you know how the algorithm works (taking the fraction of neighbors who are neighbors to one another), and the structural assumptions underlying your dataset, you can argue why clustering coefficient is a useful proxy for answering whatever question you’re asking.

Your argument may be pretty good, like if you say clustering coefficient is a decent (but not the best) proxy for revealing nodes that broker between disparate sections of a unimodal social network. Or your argument may be bad, like if you say clustering coefficient is a good proxy for organizational cohesion on the bimodal Alice, Bob, and Carol affiliation network above.

A thorough glance at the network, and a realization of our earlier definition of clustering coefficient (taking the fraction of neighbors who are neighbors to one another), should reveal why this is a bad justification. Alice’s clustering coefficient is zero. As is Bob’s. As is the Network Club’s. Every node has a clustering coefficient of zero, because no node’s neighbors connect to each other. That’s just the nature of bimodal networks: they connect across, rather than between, modes. Alice can never connect directly with Bob, and the Network Club can never connect directly with the We Love History Society.

Bob’s neighbors (the organizations) can never be neighbors with each other. There will never be a clustering coefficient as we defined it.

In short, the simplest definition of clustering coefficient doesn’t work on bimodal networks. It’s obvious if you know how your network works, and how clustering coefficient is calculated, but if you don’t think about it before you press the easy “clustering coefficient” button in Gephi, you’ll be lead astray.

Gephi doesn’t know if your network is bimodal or unimodal or ∞modal. Gephi doesn’t care. Gephi just does what you tell it to. You want Gephi to tell you the degree centralities in a bimodal network? Here ya go! You want it to give you the local clustering coefficients of nodes in a bimodal network? Voila! Everything still works as though these metrics would produce meaningful, sensible results.

But they won’t be meaningful on your network. You need to be your own network’s sanity check, and not rely on software to tell you something’s a bad idea. Think about your network, think about your algorithm, and try to work through what an algorithm means in the context of your data.

Using Bimodal Networks

This doesn’t mean you should stop using bimodal networks. Most of the easy network software out there comes with algorithms made for unimodal networks, but other algorithms exist and are available for more complex networks. Very occasionally, but by no means always, you can project your bimodal network to a unimodal network, as described above, and run your unimodal algorithms on that new network projection.

There are a number of times when this doesn’t work well. At 2,300 words, this tutorial is already too long, so I’ll leave thinking through why as an exercise for the reader. It’s less complicated than you’d expect, if you have a pen and paper and know how fractions work.

The better solution, usually, is to use an algorithm meant for bi- or multimodal networks. Tore Opsahl has put together a good primer on the subject with regard to clustering coefficient (slightly mathy, but you can get through it with ample use of Wikipedia). He argues that projection isn’t an optimal solution, but gives a simple algorithm for a finding bimodal clustering coefficients, and directions to do so in R. Essentially the algorithm extends the visibility of the clustering coefficient, asking whether a node’s neighbors 2 hops away can reach the others via 2 hops as well. Put another way, I don’t want to know what clubs Bob belongs to, but rather whether Alice and Carol can also connect to one another through a club.

It’s a bit difficult to write without the use of formulae, but looking at the bimodal network and thinking about what clustering coefficient ought to mean should get you on the right track.

Bimodal networks aren’t an unsolved problem. If you search Google Scholar for bimodal, bipartite, and 2-mode networks, you’ll discover all sorts of clever methods for analyzing bimodal networks, including some great introductory texts by Borgatti and Everett.

The issue is there aren’t easy solutions through platforms like Gephi, and that’s probably on us as Digital Humanists.  I’ve found that DHers are much more likely to have bi- or multimodal datasets than most network researchers. If we want to be able to analyze them easily, we need to start developing our own plugins to Gephi, or our own tools, to do so. Push-button solutions are great if you know what’s happening when you push the button.

So let this be an addendum to my previous warnings against using bimodal networks: by all means, use them, but make sure you really think about the algorithms and your data, and what any given metric might imply when run on your network specifically. There are all sorts of free resources online you can find by googling your favorite algorithm. Use them.


For more information, read up on specific algorithms, methods, interpretations, etc. for two-mode networks from Tore Opsahl.

 

Digital History, Saturn’s Rings, and the Battle of Trafalgar

History and astronomy are a lot alike. When people claim history couldn’t possibly be scientific, because how can you do science without direct experimentation, astronomy should be used as an immediate counterexample.

Astronomers and historians both view their subjects from great distances; too far to send instruments for direct measurement and experimentation. Things have changed a bit in the last century for astronomy, of course, with the advent of machines sensitive enough to create earth-based astronomical experiments. We’ve also built ships to take us to the farthest reaches, for more direct observations.

Voyager 1 Spacecraft, on the cusp of interstellar space. [via]
Voyager 1 Spacecraft, on the cusp of interstellar space. [via]
It’s unlikely we’ll invent a time machine any time soon, though, so historians are still stuck looking at the past in the same way we looked at the stars for so many thousands of years: through a glass, darkly. Like astronomers, we face countless observational distortions, twisting the evidence that appears before us until we’re left with an echo of a shadow of the past. We recreate the past through narratives, combining what we know of human nature with the evidence we’ve gathered, eventually (hopefully) painting ever-clearer pictures of a time we could never touch with our fingers.

Some take our lack of direct access as a good excuse to shake away all trappings of “scientific” methods. This seems ill-advised. Retaining what we’ve learned over the past 50 years about how we construct the world we see is important, but it’s not the whole story, and it’s got enough parallels with 17th century astronomy that we might learn some lessons from that example.

Saturn’s Rings

In the summer 1610, Galileo observed Saturn through a telescope for the first time. He wrote with surprise that

Galileo's observation of Saturn through a telescope, 1610. [via]
Galileo’s Saturn. [via]

the star of Saturn is not a single star, but is a composite of three, which almost touch each other, never change or move relative to each other, and are arranged in a row along the zodiac, the middle one being three times larger than the two lateral ones…

This curious observation would take half a century to resolve into what we today see as Saturn’s rings. Galileo wrote that others, using inferior telescopes, would report seeing Saturn as oblong, rather than as three distinct spheres. Low and behold, within months, several observers reported an oblong Saturn.

Galileo's Saturn in 1616.
Galileo’s Saturn in 1616.

What shocked Galileo even more, however, was an observation two years later when the two smaller bodies disappeared entirely. They appeared consistently, with every observation, and then one day poof they’re gone. And when they eventually did come back, they looked remarkably odd.

Saturn sometimes looked as though it had “handles”, one connected to either side, but the nature of those handles were unknown to Galileo, as was the reason why sometimes it looked like Saturn had handles, sometimes moons, and sometimes nothing at all.

Saturn was just really damn weird. Take a look at these observations from Gassendi a few decades later:

Gassendi's Saturn [via]
Gassendi’s Saturn [via]
What the heck was going on? Many unsatisfying theories were put forward, but there was no real consensus.

Enter Christiaan Huygens, who in the 1650s was fascinated by the Saturn problem. He believed a better telescope was needed to figure out what was going on, and eventually got some help from his brother to build one.

The idea was successful. Within short order, Huygens developed the hypothesis that Saturn was encircled by a ring. This explanation, along with the various angles we would be viewing Saturn and its ring from Earth, accounted for the multitude of appearances Saturn could take. The figure below explains this:

Huygens' Saturn [via]
Huygens’ Saturn [via]
The explanation, of course, was not universally accepted. An opposing explanation by an anti-Copernican Jesuit contested that Saturn had six moons, the configuration of which accounted for the many odd appearances of the planet. Huygens countered that the only way such a hypothesis could be sustained would be with inferior telescopes.

While the exact details of the dispute are irrelevant, the proposed solution was very clever, and speaks to contemporary methods in digital history. The Accademia del Cimento devised an experiment that would, in a way, test the opposing hypotheses. They built two physical models of Saturn, one with a ring, and one with six satellites configured just-so.

The Model of Huygens' Saturn [via]
The Model of Huygens’ Saturn [via]
In 1660, the experimenters at the academy put the model of a ringed Saturn at the end of a 75-meter / 250-foot hallway. Four torches illuminated the model but were obscured from observers, so they wouldn’t be blinded by the torchlight.  Then they had observers view the model through various quality telescopes from the other end of the hallway. The observers were essentially taken from the street, so they wouldn’t have preconceived notions of what they were looking at.

Depending on the distance and quality of the telescope, observers reported seeing an oblong shape, three small spheres, and other observations that were consistent with what astronomers had seen. When seen through a glass, darkly, a ringed Saturn does indeed form the most unusual shapes.

In short, the Accademia del Cimento devised an experiment, not to test the physical world, but to test whether an underlying reality could appear completely different through the various distortions that come along with how we observe it. If Saturn had rings, would it look to us as though it had two small satellites? Yes.

This did not prove Huygens’ theory, but it did prove it to be a viable candidate given the observational instruments at the time. Within a short time, the ring theory became generally accepted.

The Battle of Trafalgar

So what’s Saturn’s ring have to do with the price of tea in China? What about digital history?

The importance is in the experiment and the model. You do not need direct access to phenomena, whether they be historical or astronomical, to build models, conduct experiments, or generally apply scientific-style methods to test, elaborate, or explore a theory.

In October 1805, Lord Nelson led the British navy to a staggering victory against the French and Spanish during the Napoleonic Wars. The win is attributed to Nelson’s unusual and clever battle tactics of dividing his forces in columns perpendicular to the single line of the enemy ships. Twenty-seven British ships defeated thirty-three Franco-Spanish ones. Nelson didn’t lose a single British ship lost, while the Franco-Spanish fleet lost twenty-two.

Horatio Nelson [via]
Horatio Nelson [via]
But let’s say the prevailing account is wrong. Let’s say, instead, due to the direction of the wind and the superior weaponry of the British navy, victory was inevitable: no brilliant naval tactician required.

This isn’t a question of counterfactual history, it’s simply a question of competing theories. But how can we support this new theory without venturing into counterfactual thinking, speculation? Obviously Nelson did lead the fleet, and obviously he did use novel tactics, and obviously a resounding victory ensued. These are indisputable historical facts.

It turns out we can use a similar trick to what the Accademia del Cimento devised in 1660: pretend as though things are different (Saturn has a ring; Nelson’s tactics did not win the battle), and see whether our observations would remain the same (Saturn looks like it is flanked by two smaller moons; the British still defeated the French and Spanish).

It turns out, further, that someone’s already done this. In 2003, two Italian physicists built a simulation of the Battle of Trafalgar, taking into account details of the ships, various strategies, wind direction, speed, and so forth. The simulation is a bit like a video game that runs itself: every ship has its own agency, with the ability to make decisions based on its environment, to attack and defend, and so forth.  It’s from a class of simulations called agent-based models.

When the authors directed the British ships to follow Lord Nelson’s strategy, of two columns, the fleet performed as expected: little loss of life on behalf of the British, major victory, and so forth. But when they ran the model without Nelson’s strategy, a combination of wind direction and superior British firepower still secured a British victory, even though the fleet was outnumbered.

…[it’s said] the English victory in Trafalgar is substantially due to the particular strategy adopted by Nelson, because a different plan would have led the outnumbered British fleet to lose for certain. On the contrary, our counterfactual simulations showed that English victory always occur unless the environmental variables (wind speed and direction) and the global strategies of the opposed factions are radically changed, which lead us to consider the British fleet victory substantially ineluctable.

Essentially, they tested assumptions of an alternative hypothesis, and found those assumptions would also lead to the observed results. A military historian might (and should) quibble with the details of their simplifying assumptions, but that’s all part of the process of improving our knowledge of the world. Experts disagree, replace simplistic assumptions with more informed ones, and then improve the model to see if the results still hold.

The Parable of the Polygons

This agent-based approach to testing theories about how society works is exemplified by the Schelling segregation model. This week the model shot to popularity through Vi Hart and Nicky Case’s Parable of the Polygons, a fabulous, interactive discussion of some potential causes of segregation. Go click on it, play through it, experience it. It’s worth it. I’ll wait.

Finished? Great! The model shows that, even if people only move homes if less than 1/3rd of their neighbors are the same color that they are, massive segregation will still occur. That doesn’t seem like too absurd a notion: everyone being happy with 2/3rds of their neighbors as another color, and 1/3rd as their own, should lead to happy, well-integrated communities, right?

Wrong, apparently. It turns out that people wanting 33% of their neighbors to be the same color as they are is sufficient to cause segregated communities. Take a look at the community created in Parable of the Polygons under those conditions:

Parable of the Polygons
Parable of the Polygons

This shows that very light assumptions of racism can still easily lead to divided communities. It’s not making claims about racism, or about society: what it’s doing is showing that this particular model, where people want a third of their neighbors to be like them, is sufficient to produce what we see in society today. Much like Saturn having rings is sufficient to produce the observation of two small adjacent satellites.

More careful work is needed, then, to decide whether the model is an accurate representation of what’s going on, but establishing that base, that the model is a plausible description of reality, is essential before moving forward.

Digital History

Digital history is a ripe field for this sort of research. Like astronomers, we cannot (yet?) directly access what came before us, but we can still devise experiments to help support our research, in finding plausible narratives and explanations of the past. The NEH Office of Digital Humanities has already started funding workshops and projects along these lines, although they are most often geared toward philosophers and literary historians.

The person doing the most thoughtful theoretical work at the intersection of digital history and agent-based modeling is likely Marten Düring, who is definitely someone to keep an eye on if you’re interested in this area. An early innovator and strong practitioner in this field is Shawn Graham, who actively blogs about related issues.  This technique, however, is far from the only one available to historians for devising experiments with the past. There’s a lot we can still learn from 17th century astronomers.

Do historians need scientists?

[edit: I’m realizing I didn’t make it clear in this post that I’m aware many historians consider themselves scientists, and that there’s plenty of scientific historical archaeology and anthropology. That’s exactly what I’m advocating there be more of, and more varied.]

Short Answer: Yes.

Less Snarky Answer: Historians need to be flexible to fresh methods, fresh perspectives, and fresh blood. Maybe not that last one, I guess, as it might invite vampires.Okay, I suppose this answer wasn’t actually less snarky.

Long Answer

The long answer is that historians don’t necessarily need scientists, but that we do need fresh scientific methods. Perhaps as an accident of our association with the ill-defined “humanities”, or as a result of our being placed in an entirely different culture (see: C.P. Snow), most historians seem fairly content with methods rooted in thinking about text and other archival evidence. This isn’t true of all historians, of course – there are economic historians who use statistics, historians of science who recreate old scientific experiments, classical historians who augment their research with archaeological findings, archival historians who use advanced ink analysis,  and so forth. But it wouldn’t be stretching the truth to say that, for the most part, historiography is the practice of thinking cleverly about words to make more words.

I’ll argue here that our reliance on traditional methods (or maybe more accurately, our odd habit of rarely discussing method) is crippling historiography, and is making it increasingly likely that the most interesting and innovative historical work will come from non-historians. Sometimes these studies are ill-informed, especially when the authors decide not to collaborate with historians who know the subject, but to claim that a few ignorant claims about history negate the impact of these new insights is an exercise in pedantry.

In defending the humanities, we like to say that scientists and technologists with liberal arts backgrounds are more well-rounded, better citizens of the world, more able to contextualize their work. Non-humanists benefit from a liberal arts education in pretty much all the ways that are impossible to quantify (and thus, extremely difficult to defend against budget cuts). We argue this in the interest of rounding a person’s knowledge, to make them aware of their past, of their place in a society with staggering power imbalances and systemic biases.

Humanities departments should take a page from their own books. Sure, a few general ed requirements force some basic science and math… but I got an undergraduate history degree in a nice university, and I’m well aware how little STEM I actually needed to get through it. Our departments are just as guilty of narrowness as those of our STEM colleagues, and often because of it, we rely on applied mathematicians, statistical physicists, chemists, or computer scientists to do our innovative work for (or sometimes, thankfully, with) us.

Of course, there’s still lots of innovative work to be done from a textual perspective. I’m not downplaying that. Not everyone needs to use crazy physics/chemistry/computer science/etc. methods. But there’s a lot of low hanging fruit at the intersection of historiography and the natural sciences, and we’re not doing a great job of plucking it.

The story below is illustrative.

Gutenberg

Last night, Blaise Agüera y Arcas presented his research on Gutenberg to a packed house at our rare books library. He’s responsible for a lot of the cool things that have come out of Microsoft in the last few years, and just got a job at Google, where presumably he will continue to make cool things. Blaise has degrees in physics and applied mathematics. And, a decade ago, Blaise and historian/librarian Paul Needham sent ripples through the History of the Book community by showing that Gutenberg’s press did not work at all the way people expected.

It was generally assumed that Gutenberg employed a method called punchcutting in order to create a standard font. A letter carved into a metal rod (a “punch”) would be driven into a softer metal (a “matrix”) in order to create a mold. The mold would be filled with liquid metal which hardened to form a small block of a single letter (a “type”), which would then be loaded onto the press next to other letters, inked, and then impressed onto a page. Because the mold was metal, many duplicate “types” could be made of the same letter, thus allowing many uses of the same letter to appear identical on a single pressed page.

Punch matrix system. [via]
Punch matrix system. [via]
Type to be pressed. [via]
Type to be pressed. [via]
This process is what allowed all the duplicate letters to appear identical in Gutenberg’s published books. Except, of course, careful historians of early print noticed that letters weren’t, in fact, identical. In the 1980s, Paul Needham and a colleague attempted to produce an inventory of all the different versions of letters Gutenberg used, but they stopped after frequently finding 10 or more obviously distinct versions of the same letter.

Needham's inventory of Gutenberg type. [via]
Needham’s inventory of Gutenberg type. [via]
This was perplexing, but the subject was bracketed away for a while, until Blaise Agüera y Arcas came to Princeton and decided to work with Needham on the problem. Using extremely high-resolution imagining techniques, Blaise noted that there were in fact hundreds of versions of every letter. Not only that, there were actually variations and regularities in the smaller elements that made up letters. For example, an “n” was formed by two adjacent vertical lines, but occasionally the two vertical lines seem to have flipped places entirely. The extremely basic letter “i” itself had many variations, but within those variations, many odd self-similarities.

Variations in the letter "i" in Gutenberg's type. [via]
Variations in the letter “i” in Gutenberg’s type. [via]
Historians had, until this analysis, assumed most letter variations were due to wear of the type blocks. This analysis blew that hypothesis out of the water. These “i”s were clearly not all made in the same mold; but then, how had they been made? To answer this, they looked even closer at the individual letters.

 

Close up of Gutenberg letters, with light shining through page. [via]
Close up of Gutenberg letters, with light shining through page. [via]
It’s difficult to see at first glance, but they found something a bit surprising. The letters appeared to be formed of overlapping smaller parts: a vertical line, a diagonal box, and so forth. The below figure shows a good example of this. The glyphs on the bottom have have a stem dipping below the bottom horizontal line, while the glyphs at the top do not.

Abbreviation of 'per'. [via]
Abbreviation of ‘per’. [via]
The conclusion Needham and Agüera y Arcas drew, eventually, was that the punchcutting method must not have been used for Gutenberg’s early material. Instead, a set of carved “strokes” were pushed into hard sand or soft clay, configured such that the strokes would align to form various letters, not unlike the formation of cuneiform. This mold would then be used to cast letters, creating the blocks we recognize from movable type. The catch is that this soft clay could only cast letters a few times before it became unusable and would need to be recreated. As Gutenberg needed multiple instances of individual letters per page, many of those letters would be cast from slightly different soft molds.

Low-Hanging Fruit

At the end of his talk, Blaise made an offhand comment: how is it that historians/bibliographers/librarians have been looking at these Gutenbergs for so long, discussing the triumph of their identical characters, and not noticed that the characters are anything but uniform? Or, of those who had noticed it, why hadn’t they raised any red flags?

The insights they produced weren’t staggering feats of technology. He used a nice camera, a light shining through the pages of an old manuscript, and a few simple image recognition and clustering algorithms. The clustering part could even have been done by hand, and actually had been, by Paul Needham. And yes, it’s true, everything is obvious in hindsight, but there were a lot of eyes on these bibles, and odds are if some of them had been historians who were trained in these techniques, this insight could have come sooner. Every year students do final projects and theses and dissertations, but what percent of those use techniques from outside historiography?

In short, there’s a lot of very basic assumptions we make about the past that could probably be updated significantly if we had the right skillset, or knew how to collaborate with those who did. I think people like William Newman, who performs Newton’s alchemical experiments, is on the right track. As is Shawn Graham, who reanimates the trade networks of ancient Rome using agent-based simulations, or Devon Elliott, who creates computational and physical models of objects from the history of stage magic. Elliott’s models have shown that certain magic tricks couldn’t possibly have worked as they were described to.

The challenge is how to encourage this willingness to reach outside traditional historiographic methods to learn about the past. Changing curricula to be more flexible is one way, but that is a slow and institutionally difficult process. Perhaps faculty could assign group projects to students taking their gen-ed history courses, encouraging disciplinary mixes and non-traditional methods. It’s an open question, and not an easy one, but it’s one we need to tackle.

Networks Demystified 8: When Networks are Inappropriate

A few hundred years ago, I promised to talk about when not to use networks, or when networks are used improperly. With The Historian’s Macroscope in the works, I’ve decided to finally start answering that question, and this Networks Demystified is my first attempt at doing so. If you’re new here, this is part of an annoyingly long series (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis, 5 communities and PageRank, 6 this space left intentionally blank, 7 co-citation analysis II). I’ve issued a lot of vague words of caution without doing a great job of explaining them, so here is the first substantive part of that explanation.

Networks are great. They allow you to do things like understand the role of postal routes in the circulation of knowledge in early modern Europe, or of the spread of the black death in the middle ages, or the diminishing importance of family ties in later Chinese governments. They’re versatile, useful, and pretty easy in today’s software environment. And they’re sexy, to boot. I mean, have you seen this visualization of curved lines connecting U.S. cities? I don’t even know what it’s supposed to represent, but it sure looks pretty enough to fund!

A really pretty network visualization. [via]
A really pretty network visualization. [via]

So what could possibly dissuade you from using a specific network, or the concept of networks in general? A lot of things, it turns out, and even a big subset of things that belong only to historians. I won’t cover all of them here, but I will mention a few big ones.

An Issue of Memory Loss

Okay, I lied about not knowing what the above network visualization represents. It turns out it’s a network of U.S. air travel pathways; if a plane goes from one city to another, an edge connects the two cities together. Pretty straightforward. And pretty useful, too, if you want to model something like the spread of an epidemic. You can easily see how someone with the newest designer virus flying into Texas might infect half-a-dozen people at the airport, who would in turn travel to other airports, and quickly infect most parts of the country with major airports. Transportation networks like this are often used by the CDC for just such a purpose, to determine what areas might need assistance/quarantine/etc.

The problem is that, although such a network might be useful for epidemiology, it’s not terribly useful for other seemingly intuitive questions. Take migration patterns: you want to know how people travel. I’ll give you another flight map that’s a bit easier to read.

Flight patterns over U.S. [via]
Flight patterns over U.S. [via]
The first thing people tend to do when getting their hands on a new juicy network dataset is to throw it into their favorite software suite (say, Gephi) and run a bunch of analyses on it. Of those, people really like things like PageRank or Betweenness Centrality, which can give the researcher a sense of important nodes in the network based on how central they are; in this case, how many flights have to go through a particular city in order to get where they eventually intend to go.

Let’s look at Las Vegas. By anyone’s estimation it’s pretty important; well-connected to cities both near and far, and pretty central in the southwest. If I want to go from Denver to Los Angeles and a direct flight isn’t possible, Las Vegas seems to be the way to go. If we also had road networks, train networks, cell-phone networks, email networks, and so forth all overlaid on top of this one, looking at how cities interact with each other, we might be able to begin to extrapolate other information like how rumors spread, or where important trade hubs are.

Here’s the problem: network structures are deceitful. They come with a few basic assumptions that are very helpful in certain areas, but extremely dangerous in others, and they are the reason why you shouldn’t analyze a network without thinking through what you’re implying by fitting your data to the standard network model. In this case, the assumption to watch out for is what’s known as a lack of memory.

The basic networks you learn about, with nodes and edges and maybe some attributes, embed no information on how those networks are generally traversed. They have no memories. For the purposes of disease tracking, this is just fine: all epidemiologists generally need to know is whether two people might accidentally happen to find themselves in the same place at the same time, and where they individually go from there. The structure of the network is enough to track the spread of a disease.

For tracking how people move, or how information spreads, or where goods travel, structure alone is rarely enough. It turns out that Las Vegas is basically a sink, not a hub, in the world of airline travel. People who travel there tend to stay for a few days before traveling back home. The fact that it happens to sit between Colorado and California is meaningless, because people tend not to go through Vegas to get from one to another, even though individually, people from both states travel there with some frequency.

If the network had a memory to it, if it somehow knew not just that a lot of flights tended to go between Colorado and Vegas and between LA and Vegas, but also that the people who went to Vegas returned to where they came from, then you’d be able to see that Vegas isn’t the same sort of hub that, say, Atlanta is. Travel involving Vegas tends to be to or from, rather than through. In truth, all cities have their own unique profiles, and some may be extremely central to the network without necessarily being centrally important in questions about that network (like human travel patterns).

The same might be true of letter-writing networks in early modern Europe, my research of choice. We often find people cropping up as extremely central, connecting very important figures whom we did not previously realize were connected, only to find out that later that, well, it’s not exactly what we thought. This new central figure, we’ll call him John Smith, happened to be the cousin of an important statesman, the neighbor of a famous philosopher, and the once-business-partner of some lawyer. None of the three ever communicated with John about any of the others, and though he was structurally central on the network, he was no-one of any historical note. A lack of memory in the network that information didn’t flow through John, only to or from him, means my centrality measurements can often be far from the mark.

It turns out that in letter-writing networks, people have separate spheres: they tend to write about family with family members, their governmental posts with other officials, and their philosophies with other philosophers. The overarching structure we see obscures partitions between communities that seem otherwise closely-knit. When researching with networks, especially going from the visualization to the analysis phase, it’s important to keep in mind what the algorithms you use do, and what assumptions they and your network structure embed in the evidence they provide.

Sometimes, the only network you have might be the wrong network for the job. I have a lot of peers (me included) who try to understand the intellectual landscape of early modern Europe using correspondence networks, but this is a poor proxy indeed for what we are trying to understand. Because of the spurious structural connections, like that of our illustrious John Smith, early modern networks give us a sense of unity that might not have been present at the time.

And because we’re only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.

If there’s a moral to the story, it’s that there are many networks that can connect the same group of nodes, and many questions that can be asked of any given network, but before trying to use networks to study history, you should be careful to make sure the questions match the network.

Multimodality

As humanists asking humanistic questions, our networks tend to be more complex than the sort originally explored in network science. We don’t just have people connected to people or websites to websites, we’ve got people connected to institutions to authored works to ideas to whatever else, and we want to know how they all fit together. Cue the multimodal network, or a network that includes several types of nodes (people, books, places, etc.).

I’m going to pick on Elijah Meeks’ map of of the DH2011 conference, because I know he didn’t actually use it to commit the sins I’m going to discuss. His network connected participants in the conference with their institutional affiliations and the submissions they worked on together.

Part of Elijah Meeks' map of DH2011. [via]
Part of Elijah Meeks’ map of DH2011. [via]
From a humanistic perspective, and especially from a Latourian one, these multimodal networks make a lot of sense. There are obviously complex relationships between many varieties of entities, and the promise of networks is to help us understand these relationships. The issue here, however, is that many of the most common metrics you’ll find in tools like Gephi were not created for multimodal networks, and many of the basic assumptions of network research need to be re-aligned in light of this type of use.

Let’s take the local clustering coefficient as an example. It’s a measurement often used to see if a particular node spans several communities, and it’s calculated by seeing how many of a node’s connections are connected to each other. More concretely, if all of my friends were friends with one another, I would have a high local clustering coefficient; if, however, my friends tended not to be friends with one another, and I was the only person in common between them, my local clustering coefficient would be quite low. I’d be the bridge holding the disparate communities together.

If you study the DH2011 network, the problem should become clear: local clustering coefficient is meaningless in multimodal networks. If people are connected to institutions and conference submissions, but not to one another, then everyone must have the same local clustering coefficient: zero. Nobody’s immediate connections are connected to each other, by definition in this type of network.

Local clustering coefficient is an extreme example, but many of the common metrics break down or mean something different when multiple node-types are introduced to the network. People are coming up with ways to handle these networks, but the methods haven’t yet made their way into popular software. Yet another reason that a researcher should have a sense of how the algorithms work and how they might interact with their own data.

No Network Zone

The previous examples pointed out when networks might be used inappropriately, but there are also times when there is no appropriate use for a network. This isn’t so much based on data (most data can become a network if you torture them enough), but on research questions. Networks seem to occupy a similar place in the humanities as power laws do in computational social sciences: they tend to crop up everywhere regardless of whether they actually add anything informative. I’m not in the business of calling out poor uses of networks, but a good rule of thumb on whether you should include a network in your poster or paper is to ask yourself whether its inclusion adds anything that your narrative doesn’t.

Alternatively, it’s also not uncommon to see over-explanations of networks, especially network visualizations. A narrative description isn’t always the best tool for conveying information to an audience; just as you wouldn’t want to see a table of temperatures over time when a simple line chart would do, you don’t want a two-page description of communities in a network when a simple visualization would do.

This post is a bit less concise and purposeful than the others in this series, but stay-tuned for a revamped (and hopefully better) version to show up in The Historian’s Macroscope. In the meantime, as always, comments are welcome and loved and will confer good luck on all those who write them.

Networks Demystified 7: Doing Co-Citation Analyses

So this is awkward. I’ve published Networks Demystified 7: Doing Citation Analyses before Networks Demystified 6: Organizing Your Twitter Lists. What depraved lunatic would do such a thing? The kind of depraved lunatic that is teaching this very subject twice in the next two weeks: deal with it, you’ll get your twitterstructions soon, internet. In the meantime, enjoy the irregular nature of the scottbot irregular.

And this is part 7 of my increasingly inaccurately named trilogy of instructional network analysis posts (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis, 5 communities and PageRank, 6 this space left intentionally blank). I’m covering how to actually do citation analyses, so it’s a continuation of part 4 of the series. If you want to know what citation analysis is and why to do it, as well as a laundry list of previous examples in the humanities and social sciences, go read that post. If you want to just finally be able to analyze citations, like you’ve always dreamed, read on. 1

You’re going to need two things for these instructions: The Sci2 Tool, and either a subscription to the multi-gazillion dollar ISI Web of Science database, or this sample dataset. The Sci2 (Science of Science) Tool is a fairly buggy program (I’m allowed to say that because I’m kinda off-and-on the development team and I wrote half the user manual) that specializes in ingesting data of various formats and turning them into networks for analysis and visualization. It’s a good tool to use before you run to Gephi to make your networks pretty, and has a growing list of available plugins. If you already have the Sci2 Tool, download it again, because there’s a new version and it doesn’t auto-update. Go download it. It’s 80mb, I’ll wait.

Once you’ve registered for (not my decision, don’t blame me!) and downloaded the tool, extract the zip folder wherever you want, no install necessary. The first thing to do is increase the amount of memory available to the program, assuming you have at least a gig of RAM on your computer. We’re going to be doing some intensive analysis, so you’ll need the extra space. Edit sci2.ini; on Windows, that can be done by right-clicking on the file and selecting ‘edit’; on Mac, I dunno, elbow-click and press ‘CHANGO’? I have no idea how things work on Macs. (Sorry Mac-folk! We’ve actually documented in more detail how to increase memory – on both Windows and Mac – here)

Once editing the file, you’ll see a nigh-unintelligble string of letters and numbers that end in “-Xmx350m”. Assuming you have more than a gig of RAM on your computer, change that to “-Xmx1000m”. If you don’t have more RAM, really, you should go get some. Or use only a quarter of the dataset provided. Save it and close the text editor.

Run Sci2.exe We didn’t pay Microsoft to register the app, so if you’re on Windows, you may get a OHMYGODWARNING sign. Click ‘run anyway’ and safely let my team’s software hack your computer and use it to send pictures of cats to famous network scientists. (No, we’ll be good, promise). You’ll get to a screen remarkably like Figure 7. Leave it open, and if you’re at an institution that pays ISI Web of Science the big bucks, head there now. Otherwise ignore this and just download the sample dataset.

Downloading Data

I’m a historian of science, so let’s look for history of science articles. Search for ‘Isis‘ as a ‘Publication Name’ from the drop-down menu (see Figure 1) and notice that, as of 9/23/2013, there are 14,858 results (see Figure 2).

Figure 1: Searching for Isis as the name of a publication.
Figure 1: Searching for Isis as the name of a publication.
Figure 2: Isis periodical search results.
Figure 2: Isis periodical search results.

This is a list of every publication in the journal ISIS. Each individual record includes bibliographic material, abstract, and the list of references that are cited in the article. To get a reasonable dataset to work with, we’re going to download every article ever published in ISIS, of which there are 1,189. The rest of the records are book reviews, notes, etc. Select only the articles by clicking the checkbox next to ‘articles’ on the left side of the results screen and clicking ‘refine’.

The next step is to download all the records. This web service limits you to 500 records per download, so you’re going to need to download 3 separate files (records 1-500, 501-1000, and 1001-1189) and combine them together, which is a fairly complicated step, so pay close attention. There’s a little “Send to:” drop-down menu at the top of the search results (Figure 3). Click it, and click ‘Other File Formats’.

Figure 3: Saving Web of Science records.
Figure 3: Saving Web of Science records.

At the pop-up box, check the radio box for records 1 to 500 and enter those numbers, change the record content to ‘Full Record and Cited References’, and change the file format to ‘Plain Text’ (Figure 4). Save the file somewhere you’ll be able to find it. Do this twice more, changing the numbers to 501-1000 and 1001-1189, saving these files as well.

Figure 4: Parameters for downloading Web of Science files.
Figure 4: Parameters for downloading Web of Science files.

You’ll end up with three files, possibly named: savedrecs.txt, savedrecs(1).txt, and savedrecs(2).txt. If you open one up (Figure 5), you’ll see that each individual article gets its own several-dozen lines, and includes information like author, title, keywords, abstract, and (importantly in our case) cited references.

Figure 5: An example Isis record.
Figure 5: An example ISIS record.
Figure 6: The end of an ISIS record file.
Figure 6: The end of an ISIS record file.

You’ll also notice (Figures 5 & 6) that first two lines and last line of every file are special header and footer lines. If we want to merge the three files so that the Sci2 Tool can understand it, we have to delete the footer of the first file, the header and footer of the second file, and the header of the last file, so that the new text file only has one header at the beginning, one footer at the end, and none in between. Those of you who are familiar enough with a text editor (and let’s be honest, it should be everyone reading this) go ahead and copy the three files into one huge file with only one header and footer. If you’re feeling lazy, just download it here.

Creating a Citation Network

Now open the Sci2 Tool (Figure 7) and go to File->Load in the drop-down menu. Find your super file with all of ISIS and open it, loading it as an ‘ISI flat format’ file (Figure 8).

Figure 7: The Sci2 Tool.
Figure 7: The Sci2 Tool.
Figure 8: Loading a file as an ISI flat format file.
Figure 8: Loading a file as an ISI flat format file.

If all goes correctly, two new files should appear in the Data Manager, the pane on the right-hand side of the software. I’ll take a bit of a detour here to explain the Sci2 Tool.

The main ‘Console’ pane on the top-left will include a complete log of your workflow, including all the various algorithms you use, what settings and parameters you use with them, and how to cite the various ones you use. When you close the program, a copy of the text in the ‘Console’ pain will save itself as a log file in the program directory so you can go back to it later and see what exactly you did.

The ‘Scheduler’ pane on the bottom is just that: it shows you what algorithms are currently running and what already ran. You can safely ignore it.

Along with the drop-down menus at the top, the already-mentioned ‘Data Manager’ pane on the right is where you’ll be spending most of your time. Every time you load a file, it will appear in the data manager. Every time you run an algorithm on or manipulate that file in some way, a copy of it with the new changes will appear hierarchically nested below the original file. This is so, if you make a mistake, want to use an earlier version of the file, or want to run run a different set of analyses, you can still do so. You can right-click on files in the data manager to view or save them in various file formats. It is important to remember to make sure that the appropriate file is selected in the data manager when you run an analysis, as it’s easy to accidentally run an algorithm on some other random data file.

With that in mind, once your file is loaded, make sure to select (by left-clicking) the ‘1189 Unique ISI Records’ data file in the data manager. If you right-click and view the file, it should open up in Excel (Figure 9) or whatever your default *.csv viewer is, and you’ll see that the previous text file has been converted to a spreadsheet. You can look through it to see what the data look like.

Figure 9: All of the ISIS History of Science journal articles as a csv.
Figure 9: All of the ISIS History of Science journal articles as a csv.

When you’re done ogling at all the pretty data, close the spreadsheet and go back to the tool. Making sure the ‘1189 Unique ISI Records’ file is selected, go to ‘Data Preparation -> Extract Paper Citation Network’ in the drop-down menu.

Voilà! You now have a history of science citation network. The algorithm spits out two files: ‘Extracted paper-citation network’, which is the network file itself, and ‘Paper information’, which is a spreadsheet that includes all the nodes in the network (in this case, articles that either were published in ISIS or are cited by them). It includes a ‘localCitationCount’ column, which tells you how frequently a work is cited within the dataset (Shapin’s Leviathan and the Air Pump‘ is cited 16 times, you’ll see if you open up the file), and a ‘globalCitationCount’ column, which is how many times ISI Web of Science thinks the article has been cited overall, not just within the dataset (Merton’s ” The Matthew effect in science II” is cited 183 times overall). ‘globalCitationCount’ statistics are of course only available for the records you downloaded, so you have them for ISIS published articles, but none of the other records.

Select ‘Extracted paper-citation network’ in the data manager. From the drop-down menu, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’. It’s a good idea to run this on any network you have, just to see the basic statistics of what you’re working with. The details will appear in the console window (Figure 10).

Figure 10: Network analysis toolkit output on the ISIS citation network.
Figure 10: Network analysis toolkit output on the ISIS citation network.

There are a few things worth noting right away. The first is that there are 52,479 nodes; that means that our adorable little dataset of 1,189 articles actually referenced over 50,000 other works between them, about 50 refs/article. The second fact worth noting is that there are 54,915 directed edges, which is the total number of direct citations in the dataset. One directed edge is a citation from a citing node (an ISIS article) to a cited node (either an ISIS article, or a book, or whatever the author decides to reference).

The last bit worth pointing out is the number of weakly connected components, and the size of the largest connected component. Each weakly connected component is a chunk of the network connected by citation chains: if article A and B are the only articles which cite article C, if article C cites nothing else, and if A and B are uncited by any other articles, they together make a weakly connected component. As soon as another citation link comes from or to them, it becomes part of that component. In our case, the biggest component is 46,971 nodes, which means that most of the nodes in the network are connected to each other. That’s important, it means history of science as represented by ISIS is relatively cohesive. There are 215 weakly connected components in all, small islands that are disconnected from the mainland.

If you have Gephi installed, you can visualize the network by selecting ‘Extracted paper-citation network’ in the data manager and clicking ‘Visualization -> Networks -> Gephi’, though what you do from there is beyond the scope of these instructions. It also probably won’t make a heck of a lot of sense: there aren’t many situations where visualizing a citation network are actually useful. It’s what’s called a Directed Acyclic Graph, which are generally the most visually boring graphs around (don’t cite me on this).

I do have a very important warning. You can tell it’s important because it’s bold. The Sci2 Tool was made by my advisor Katy Börner as a tool for people with similar research to her own, whose interests lie in modeling and predicting the spread of information on a network. As such, the direction of citation edges created by the tool are opposite what many expect. They go from the cited source to the citing source, because the idea is that’s the direction that information flows, rather than from the citing source to the cited source. As a historian, I’m more interested in considering the network in the reverse direction: citing to cited, as that gives more agency to the author. More details in the footnote. 2

Great, now that that’s out of the way, let’s get to the more interesting analyses. Select ‘Extracted paper-citation network’ in the data manager and run ‘Data Preparation -> Extract Document Co-Citation Network’. And then wait. Have you waited for a while? Good, wait some more. This is a process. And 50,000 articles is a lot of articles. While you’re waiting, re-read Networks Demystified 4: Co-Citation Analysis to get an idea of what it is you’re doing and why you want to do it.

Okay, we’re done (assuming you increased the allotted memory to the tool like we discussed earlier). You’re no presented the ‘Co-citation Similarity Network’ in the data manager, and you should, once again, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ in the Data Manager. This as well will take some time, and you’ll see why shortly.

Figure 11: Network analysis toolkit of the ISIS co-citation network.
Figure 11: Network analysis toolkit of the ISIS co-citation network.

Notice that while there are the same number of nodes (citing or cited articles) as before, 52,479, the number of edges went from 54,915 to 2,160,275, a 40x increase. Why? Because every time two articles are cited together, they get an edge between them and, according to the ‘Average degree’ in the console pane, each article or book is cited alongside an average of 82 other works.

In order to make the analysis and visualization of this network easier we’re going to significantly cut its size. Recall that document co-citation networks connect documents that are cited alongside each other, and that the weight of that connection is increased the more often the two documents appear together in a bibliography. What we’re going to do here is drastically reduce the network’s size deleting any edge between documents unless they’ve been cited together more than once. Select ‘Co-citation Similarity Network’ and run ‘Preprocessing -> Networks -> Extract Edges Above or Below Value’. Use the default settings (Figure 12).

Note that when you’re doing a scholarly citation analysis, cutting all the edges below a certain value (called ‘thresholding’) is usually a bad idea unless you know exactly how it will affect your study. We’re doing it here to make the walkthrough easier.

Figure 12: Extracting edges to reduce the size of the network.
Figure 12: Extracting edges to reduce the size of the network.

Run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ on the new ‘Edges above 1 by weight’ dataset, and note that the network has been reduced from two million edges to three thousand edges, a much more manageable number for our purposes. You’ll also see that there are 51,313 isolated nodes: nodes that are no longer connected to the network because we cut so many edges in our mindless rampage. Who cares about them? Let’s delete them too! Select ‘Edges above 1 by weight’ and run ‘Preprocessing -> Networks -> Delete Isolates’, and watch as fifty thousand precious history of science citations vanish in a puff of metadata. Gone.

If you run the Network Analysis Toolkit on the new network, you’ll see that we’re left with a small co-citation net of 1,166 documents and 3,344 co-citations between them. The average degree tells us that each document is connected to, on average, 6 other documents, and that the largest connected component contains 476 documents.

So now’s the moment of truth, the time to visualize all your hard work. If you know how to use Gephi, and have it installed, select ‘With isolates removed’ in the data manager and run ‘Visualization -> Networks -> Gephi’. If you don’t, run ‘Visualization -> Networks -> GUESS’ instead, and give it a minute to load. You will be presented with this stunning work of art vaguely reminiscent of last night’s spaghetti and meatball dinner (Figure 13).

Figure 13: GUESS in all its glory.
Figure 13: GUESS in all its glory.

Fear not! The first step to prettifying the network is to run ‘Layout -> GEM’ and then ‘Layout -> Bin Pack’. Better already, right? Then you can make edits using the graph modifier below (or using python commands in the interpreter), but the friendly folks at my lab have put together a script for you that will do that automatically. Run ‘Script -> Run Script’.

When you do, you will be presented with a godawful java applet that automatically sticks you in some horrible temp directory that you have to find your way out of. In the ‘Look In:’ navigation drop-down, find your way back to your desktop or your documents directory and then find wherever you installed the Sci2 Tool. In the Sci2 directory, there’s a folder called ‘scripts’, and in the ‘scripts’ folder, there’s a ‘GUESS’ folder, and in the ‘GUESS’ folder you will find the holy grail. Select ‘reference-co-occurrence-nw.py’ and press ‘open’.

Magic! Your document co-citation network is now all green and pretty, and you can zoom in and out using either the +/- button on the left, or using your mouse wheel and clicking and dragging on the network itself. It’ll look a bit like Figure 14.

Figure 14: Co-Citation network in GUESS.
Figure 14: Co-Citation network in GUESS.

If you feel more dangerous and cool, you can try visualizing the same network in Gephi, and it might come out something like Figure 15.

Figure 15: Gephi's document co-citation network, with nodes sized by how frequently they're cited in ISIS.
Figure 15: Gephi’s document co-citation network, with nodes sized by how frequently they’re cited in ISIS. Click to enlarge.

That’s it! You’ve co-cited a dataset. I hope you feel proud of yourself, because you should. And all without breaking a sweat. If you want (and you should want), you can save your results by right clicking the various files in the data manager you want to save. I’d recommend saving the most recent file, ‘With isolates removed’, and saving it as an NWB file, which is fairly easy to read and is the Sci2 Tool’s native format.

Stay-tuned for the paradoxically earlier-numbered Networks Demystified 6, on organizing your twitter feed.

Notes:

  1. Part 4 also links to a few great tutorials on how to do this with programming, but if you don’t know the first thing about programming, start here instead.
  2. Those of you who know network basics, keep this in mind when running your analyses: PageRank, In & Out Degree, etc., may be opposite of what you expect, with the papers that cite the most sources as those with the highest In-Degree and PageRank. If this is opposite your workflow, you can fairly easily change the data by hand in a spreadsheet editor or with regular expressions.

Networks Demystified 5: Communities, PageRank, and Sampling Caveats

The fifth and sixth (coming soon…) installment of Networks Demystified will be a bit more applied than the previous bunch (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis). Like many of my recent posts, this one is in response to a Twitter conversation:

If you follow a lot of people on Twitter (Michael follows over a thousand), getting a grasp of them all and organizing them can be tough. Luckily network analysis can greatly ease the task of organizing twitter follows, and this and next post will teach you how to do that using NodeXL, a plugin for Microsoft Excel that (unfortunately) only works on Windows. It’s super easy, though, so if you have access to a Windows machine with Office installed, it’s worth trying it out despite the platform limitations.

This installment will explain the concept of modularity for group detection in networks, as well as why certain metrics like centrality should be avoided when using certain kinds of datasets. I’m going to be as gentle as I can be on the math, so this tutorial is probably best-suited for those just learning network techniques, but will fall short for those hoping for more detailed or specific information.

Next installment, Networks Demystified 6, will include the actual step-by-step instructions of how to run these analyses using NodeXL. I’m posting the description first, because I strongly believe you should learn the concepts before applying the techniques. At least that’s the theory: actually I’m posting this first because Twitter is rate-limiting the download of my follower/followee network, and I’m impatient and want to post this right away.

Modularity / Community Detection

Modularity is a technique for finding which groups of nodes in a network are more similar to each other than to other groups; it lets you spot communities.

It is unfortunate (for me) that modularity is one of the more popular forms of community detection, because it also happens to be one of the methods more difficult to explain without lots of strange symbols, which I’m trying to avoid. First off, the modularity technique is not one simple algorithm, as much as it is a conceptual framework for thinking about communities in networks. There modularity you run in Gephi is different than modularity in NodeXL, because there’s more than one way to write the concept into an algorithm, and they’re not all exactly the same.

Randomness

But to describe modularity itself, let’s take a brief detour through random-network lane. Randomization is a popular tool among network scientists, statisticians, and late 20th century avant-garde music composers for a variety of reasons. Suppose you’re having a high-stakes coin-flip contest with your friend, who winds up beating you 68/32. Before you run away crying that your friend cheated, because a fair coin should always land 50/50, remember that the universe is a random place. The 68/32 score could’ve appeared by chance alone, so you write up a quick computer program to flip a thousand coins a hundred times each, and if in those thousand computational coin-flip experiments, a decent amount come up around 68/32, you can reasonably assume your friend didn’t cheat.

The use of a simulated random result to see if what you’ve noticed is surprising (or, sometimes, significant) is quite common. I used it on the Irregular when reviewing Matthew Jockers’ Macroanalysis, shown in the graphic halfway down the page and reproduced here. I asked, in an extremely simplistic way, whether the trends Jockers saw over time were plausible by creating four dummy universes where randomness ruled, to see if his results could be attributable to chance alone. By comparing his data to my fake data, I concluded that some of his results were probably very accurate, and some of them might have just been chance.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

Network analysts use the same sort of technique all the time. Do you want to know if it’s surprising that some actress is only six degrees away from Kevin Bacon (or anybody else on the network)? Generate a bunch of random networks with the same amount of nodes (actors) and edges (connections between them if they star in a movie together), and see if, in most cases, you can get from any one actor to any other in only six hops. Odds are you could; that’s just how random networks work.

What’s surprising is that in these, as well as most other social networks, people tend to be much more tightly clustered together than expected from a random network. They form little groups and cliques. It is significantly unlikely that in such cliquish networks, where the same groups of actors tend to appear with each other constantly, that everyone would still be only six degrees away from one another. It’s commonly known that social networks organize in what are called small-worlds, where people tend to be much more closely connected to one another than one would expect when they’re in such tight cliques. This is the power of random networks: they help pick out the unusual.

Modularity Explained

Which brings us back to modularity. With some careful thinking, one would come up with a quick solutions to figuring out how to find communities in networks: find clusters of nodes that have more internal edges between them than external edges to other groups.

What a network community should look like. [via]
What network communities should look like. [via]
There’s a lurking problem with this idea, though. If you were just counting the number of in-group connections vs. out-group connections, you could come up with an optimal solution very quickly if you say the entire network is one community: voila! no outgoing connections, and lots of internal connections. If instead you say in advance that you want two communities, or you only want communities of a certain size, it mitigates the problem somewhat, but then you’re stuck with needing to set the number of communities beforehand, which is a difficult constraint if you’re not sure what that number should be.

 

The key is randomness. You want to find communities of nodes for which there are more internal links than you would expect given that the graph was random, and fewer external links than you would expect given the graph was random. Mark Newman defines modularity as: “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.”

Modularity is thus a network-level measurement, and it can change based on what communities you choose in your network. For example, in the figure above, most of the edges in the network are within the Freakish Grey Blobs (hereafter FGBs), and within the FGBs the edges are very dense. In that case, we would expect the modularity to be quite high. However, imagine we drew the FGBs around different nodes in the network instead: if we made four FGBs instead of three, splitting the left group into two, we’d find that a larger fraction of the edges are falling outside of groups, thus decreasing the overall network’s modularity score.

Similarly, let’s say we made two FGBs instead of three. We merge the two groups in the right into one supergroup (group 1), and leave the group on the left (group 1) the same. What would happen to the modularity? In that case, because group 2 is now less dense (defining density as the number of edges within the group compared to the total possible number of edges within it), and we’d expect a random network to look a bit more similar, so the overall network’s modularity score would (again) decrease slightly.

That’s modularity in a nutshell. The method of finding the appropriate groupings in a network varies, but essentially, all the algorithms keep drawing FGBs around different groups of nodes until the overall modularity score of the network is as high as possible. Find the right configuration of FGBs such that the modularity score is very high, and then label the nodes in each separate FGB as their own community. In the figure above, there are three communities, and your favorite network analysis software will label them as such.

Some metrics to avoid (with caveats)

There’s a stubbornly persistent desire, when analyzing a tasty new network dataset, to just run every algorithm in the box and see what comes up. PageRank and centrality? Sure! Clustering? Sounds great! Unfortunately, each algorithm makes certain underlying assumptions about the data, and our twitter network breaks many of those assumptions.

The most important worth mentioning is that we’ve already sinned. Remember how we plan on calculating modularity, and remember how I defined it earlier? Nothing was mentioned about whether or not the edges were directed. Asymmetrical edges (like asymmetries between follower and followee) are not understood by the modularity algorithm we described, which assumes there would be no difference between a follower, a followee, or a reciprocal connection of both. Running modularity on a directed network is, in general, a bad idea: in most networks, the direction of an edge is very important for determining community involvement. We can safely ignore this issue here, as we’re dealing with the fairly low-stakes problem of letting the computer help us organize our twitter network, but in publications or higher-stakes circumstances, this would be something to avoid without thinking through the implications very carefully.

A network metric that might seem more appropriate to the forthcoming twitter dataset, PageRank, is similarly inadequate without a few key changes. As I haven’t demystified PageRank yet, here’s a short description, with the promise to expand on it later.

PageRank is Google’s algorithm for ranking websites in their search results, and it’s inspired by citation analysis, but it turns out to be useful in various other circumstances. There are two ways to explain the algorithm, both equally accurate. The first has to do with probability: what is the probability that, if someone just starts clicking links on the web at random, they’ll eventually land on your website. The higher the chance that someone clicking links at random will reach your site, the higher your PageRank.

PageRank’s other definition makes a bit more ‘on-the-ground’ sense; given a large, directed network (like websites linking to other websites), those sites that are very popular can determine another site’s score by whether or not they link to it. Say a really famous website, like BBC, links to your site; you get lots of points. If Sam’s New England Crab Shack & Duck Farm links to your site, however, you won’t get many points. Seemingly paradoxically, the more points your website has, the more points you can give to sites that you link to. Sites that get linked to a lot are considered reputable, and in turn they link to other sites and pass that reputation along. But, the clever bit is that your site can only pass a fraction of its reputation along based on how many other sites it links to, thus if your site only links to the Scottbot Irregular, the Irregular will get lots of points from it, but if it links to ten sites including the Irregular, my site would only get a tenth of the potential points.

How PageRank works(-ish).  Those sites which have more points in turn confer more points to others. [via]
How PageRank works(-ish). Those sites which have more points in turn confer more points to others. [via]
This generalizes pretty easily to all sorts of networks including, as it happens, twitter follow networks. Those who are followed by lots of people are scored highly; if one of those highly scoring individuals follows only a select few, that select few will also receive a significant increase in rank. When a user is followed by many other users with very high scores, that user is scored the highest of them all. PageRank, then, is a neat way of looking at who has the power in a twitter network. Those at the top are those who even the relatively popular find interesting and worth following.

 

Which brings us to this, the network we’re creating to organize our twitter neighborhood. The network type is right: a directed, unweighted network. The algorithm will work fine. It will tell you, for example, that you are (or are nearly) the most popular person in your twitter neighborhood. And why wouldn’t it? Most of the people in your neighborhood follow you, or follow people who follow you, so the math is inevitable.

And the problem is obvious. Your sampling strategy (the criteria you used to gather your data) inherently biases this particular network metric, and most other metrics within the same family. You’ve used what’s called snowball sampling, so-named because your sample snowballs into a huge network in relatively short order, starting from a single person: you. It’s you, then those you follow, then those they follow, and so forth. You are inevitably at the center of your snowball, and the various network centrality measurements will react accordingly.

Well, you might ask, what if you just ignore yourself when looking at the network? Nope. Because PageRank (among other algorithms) takes everyone’s score into account when calculating others’ scores; even if you close your eyes whenever your name pops up, your presence will still exert an invisible influence on the network. In the case of PageRank, because your score is so high, you’ll be conferring a much higher score to (potentially) otherwise unpopular people you happen to follow.

The short-term solution is to remove yourself from the network before you run any of your analyses. This actually still isn’t perfect, for reasons I don’t feel like getting into because the post is already too long, but it will give at least a better idea of PageRank centrality within your twitter neighborhood.

While you’re at it, you should also remove yourself before running community detection. As you might be the connection that bridges two otherwise disconnected communities together, and for the purpose of this study you’re trying to organize people separate from your own influence on them, running modularity on the network without you in it will likely give you a better sense of your neighborhood.

Continuing

Stay-tuned for the next exciting installment of Networks Demystified, wherein I’ll give step-by-step instructions on how to actually do the things I’ve described using NodeXL. If you want a head-start, go ahead and download and start playing with it.

The Historian’s Macroscope

Whelp, it appears the cat’s out of the bag. Shawn Graham, Ian Milligan, and I have signed our ICP contract and will shortly begin the process of writing The Historian’s Macroscope, a book introducing the process and rationale of digital history to a broad audience. The book will be a further experiment in live-writing: as we have drafts of the text, they will go online immediately for comments and feedback. The publishers have graciously agreed to allow us to keep the live-written portion online after the book goes on sale, and though what remains online will not be the final copy-edited and typeset version, we (both authors and publishers) feel this is a good compromise to prevent the cannibalization of book sales while still keeping much of the content open and available for those who cannot afford the book or are looking for a taste before they purchase it. Thankfully, this plan also fits well with my various pledges to help make a more open scholarly world.

Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
We’re announcing the project several months earlier than we’d initially intended. In light of the American Historical Association’s recent statement endorsing the six year embargo of dissertations on the unsupported claim that it will help career development, we wanted to share our own story to offset the AHA’s narrative. Shawn, Ian, and I have already worked together on a successful open access chapter in The Programming Historian, and have all worked separately releasing public material on our respective blogs. It was largely because of our open material that we were approached to write this book, and indeed much of the material we’ve already posted online will be integrated into the final publication. It would be an understatement to say our publisher’s liaison Alice jumped at this opportunity to experiment with a semi-open publication.

The disadvantage to announcing so early is that we don’t have any content to tease you with. Stay-tuned, though. By September, we hope to have some preliminary content up, and we’d love to read your thoughts and comments; especially from those not already aligned with the DH world.

Networks Demystified 4: Co-Citation Analysis

This installment of Networks Demystified is the first one that’s actually applied. A few days ago, a discussion arose over twitter involving citation networks, and this post fills the dual purpose of continuing that discussion, and teaching a bit about basic citation analysis. If you’re looking for the very basics of networks, see part 1 and part 2. Part 3 is a warning for anyone who feels the urge to say “power law.” To recap: nodes are the dots/points in the network, edges are the lines/arrows/connections.

Understanding Sociology, Philosophy, and Literary Theory using One Easy Method™!

The growing availability of humanities and social science (HSS) citation data in databases like ISI’s Web of Science (warning: GIANT paywall. Good luck getting access if your university doesn’t subscribe.) has led to a groundswell of recent blog activity in the area, mostly by the humanists and social scientists themselves. Which is a good thing, because citation analyses of HSS will happen whether we’re involving in doing them or not, so if humanists start becoming familiar with the methods, at least we can begin getting humanistically informed citation analyses of our own data.

ISI Web of Science paywall
The size of ISI’s Web of Science paywall. You shall not pass. [via]
This is a sort of weird post. It’s about history and philosophy of science, by way of social history, by way of literary theory, by way of philosophy, by way of sociology. About this time last year, Dan Wang asked the question Is There a Canon in Economic Sociology (pdf)? Wang was searching for a set of core texts for economic sociology, using a set of 52 syllabi regarding the subject. It’s a reasonable first pass at the question, counting how often each article appears in the syllabi (plus some more complex measurements) as well as how often individual authors appear. Those numbers are used to support the hypothesis that there is a strongly present canon, both of authors and individual articles, in economic sociology. This is an example of an extremely simple bimodal network analysis where there are two varieties of node: syllabi or articles. Each syllabi cites multiple articles, and several of those articles are cited by multiple syllabi. The top part of Figure 1 is what this would look like in a basic network representation.

Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [Via Mark Newman, PNAS]
Figure 1: basic bimodal network (top) and the resulting co-citation network (bottom). [via Mark Newman]
Wang was also curious how instructors felt these articles fit together, so he used a common method called co-citation analysis to answer the question. The idea is that if two articles are cited in the same syllabus, they are probably related, so they get an edge drawn between them. He further restricted his analysis so that articles had to appear together in the same class session, rather than the the same syllabus, to be considered related to each other. What results is a new network (Figure 1, below) of article similarity based on how frequently they appear together (how frequently they are cited by the same source). In Figure 1, you can see that because article H and article F are both cited in syllabus class session 3, they get an edge drawn between them.

A further restriction was then placed on the network, what’s called a threshold. Two articles would only get an edge drawn between them if they were cited by at least 2 different class sessions (threshold = 2). The resulting economic sociology syllabus co-citation network looked like Figure 2, pulled from the original article. From this picture, one can begin to develop a clear sense of the demarcations of subjects and areas within economic sociology, thus splitting the canon into its constituent parts.

Figure 2: Co-citation network in economic sociology. [via]
Figure 2: Co-citation network in economic sociology. Edge thickness represents how often articles appear together in syllabi, and node size is based on a measure of centrality. [via]
In short order, Kieran Healy blogged a reply to this study, providing his own interpretations of the graph and what the various clusters represented. Remember Healy’s name, as it’s important later in the story. Two days after Healy’s blog post, Neal Caren took inspiration and created a co-citation analysis of sociology more broadly–not just economic sociology–using data he downloaded from ISI’s Web of Science (remember the giant paywall from before?). Instead of using syllabi, Caren looked at articles found in American Journal of Sociology, American Sociological Review, Social Forces and Social Problems since 2008. Web of Science gave him a list of every citation from every article in those journals, and he performed the same sort of co-citation analysis as Dan Wang did with syllabi, but at a much larger scale.

Because the dataset Caren used was so much larger, he had to enforce much stricter thresholds to keep the visualization manageable. Whereas Wang’s graph showed all articles, and connected them if they appeared together in more than 2 class sessions, Caren’s graph only connected articles which were cited together more than 4 times (threshold = 4). Further, a cited article wouldn’t even appear on the network visualization unless the article itself had been cited 8 or more times, thus reducing the amount of articles appearing on the visualization overall. The final network had 397 nodes (articles) and 1,597 edges (connections between articles). He also used a popular community detection algorithm to color the different article nodes based on which other articles they were most related to. Figure 3 shows the resulting network, and clicking on it will lead to an interactive version.

Figure 3: Neal Caren's sociology co-citation analysis. Click the picture to see the interactive version. [via]
Figure 3: Neal Caren’s sociology co-citation analysis. Click the picture to see the interactive version. [via]
Caren adds a bit of contextual description in his blog post, explaining what the various clusters represent and why this visualization is a valid and useful one for the field of sociology. Notably, at the end of the post, he shares his raw data, a python script for analyzing it, and all the code for visualizing the network and making it interactive and pretty.

Jump forward a year. Kieran Healy, the one who wrote the original post inspiring Neal Caren’s, decides to try his own hand at a citation analysis using some of the code and methods that Neal Caren had posted about. Healy’s blog post, created just a few days ago, looks at the field of philosophy through the now familiar co-citation analysis. Healy’s analysis covers 20 years of four major philosophy journals, consisting of 2,200 articles. These articles together make over 34,000 citations, although many of the cited articles are duplicates of articles that had already been cited. Healy writes:

The more often any single paper is cited, the more important it’s likely to be. But the more often any two papers are cited together, the more likely they are to be part of some research question or ongoing problem or conversation topic within the discipline.

With a dataset this large, the resulting co-citation network wound up having over a million edges, or connections between co-cited articles. Healy decides to only focus on the 500 most highly-cited items in the journals (not the best practice for a co-citation analysis, but I’ll address that in a later post), resulting in only articles that had been cited more than 10 times within the four journal dataset to be present in the network. Figure 4 shows the resulting network, which like Figure 3, can be clicked on to reach the interactive version.

Figure 4: Kieran Healy's co-citation analysis of four philosophy journals. Click for interactivity. [via]
Figure 4: Kieran Healy’s co-citation analysis of four philosophy journals. Click for interactivity. [via]
The post goes on to provide a fairly thorough and interesting analysis of the various communities formed by article clusters, thus giving a description of the general philosophy landscape as it currently stands. The next day, Healy posted a follow-up delving further into citations of philosopher David Lewis, and citation frequencies by gender. Going through the most highly cited 500 or so philosophy articles by hand, Healy finds that 3.6% of the articles are written by women; 6.3% are written by David Lewis; the overwhelming majority are written by white men. It’s not lost on me that the overwhelming majority of people doing these citation analyses are also white men – someone please help change that? Healy posted a second follow-up a few days later, worth reading, on his reasoning behind which journals he used and why he looked at citations in general. He concludes “The 1990s were not the 1950s. And yet essentially none of the women from this cohort are cited in the conversation with anything close to the same frequency, despite working in comparable areas, publishing in comparable venues, and even in many cases having jobs at comparable departments.”

Merely short days after Healy’s articles, Jonathan Goodwin became inspired, using the same code Healy and Caren used to perform a co-citation analysis of Literary Theory Journals. He began by concluding that these co-citation analysis were much more useful (better) than his previous attempts at direct citation analysis. About four decades of bibliometric research backs up Goodwin’s claim. Figure 5 shows Goodwin’s Literary Theory co-citation network, drawn from five journals and clickable for the interactive version, where he adds a bit of code so that the user can determine herself what threshold she wants to cut off co-citation weights. Goodwin describes the code to create the effect on his github account. In a follow-up post, directly inspired by Healy’s, Goodwin looks at citations to women in literary theory. His results? When a feminist theory journal is included, 8 of the top 30 authors are women (27%); when that journal is not included, only 2 of the top 30 authors are women (7%).

Figure 5: Goodwin's literary theory co-citation network. [via]
Figure 5: Goodwin’s literary theory co-citation network. [via]

At the Speed of Blog

Just after these blog posts were published, a quick twitter exchange between Jonathan Goodwin, John Theibault, and myself (part of it readable here) spurred Goodwin, in the space of 20 minutes, to download, prepare, and visualize the co-citation data of four social history journals over 40 years. He used ISI Web of Science data, Neal Caren’s code, a bit of his own, and a few other bits of open script which he generously cites and links to. All of this is to highlight not only the phenomenal speed of research when unencumbered by the traditional research process, but also the ease with which these sorts of analysis can be accomplished. Most of this is done using some (fairly simple) programming, but there are just as easy solutions if you don’t know how to or don’t care to code–one specifically which I’ll mention later, the Sci2 Tool. From data to visualization can take a matter of minutes; a first pass at interpretation won’t take much longer. These are fast analyses, pretty useful for getting a general overview of some discipline, and can provide quite a bit of material for deeper analysis.

The social history dataset is now sitting on Goodwin’s blog just waiting to be interpreted by the right expert. If you or anyone you know is familiar with social history, take a stab at figuring out what the analysis reveals, and then let us all know in a blog post of your own. I’ll be posting a little more about it as well soon, though I’m no expert of the discipline. Also, if you’re interested in citation analysis in the humanities, and you’ll be at DH2013 in Nebraska, I’ll be chairing a session all about citations in the humanities featuring an impressive lineup of scholars. Come join us and bring questions, July 17th at 10:30am.

Discovering History and Philosophy of Science

Before I wrap up, it’s worth mentioning that in one of Kieran Healy’s blog posts, he thanks Brad Wray for pointing out some corrections in the dataset. Brad Wray is one of the few people to have published a recent philosophy citation analysis in a philosophy journal. Wray is a top-notch philosopher, but his citation analysis (Philosophy of Science: What are the Key Journals in the Field?, Erkenntnis, May 2010 72:3, paywalled) falls a bit short of the mark, and as this is an instructional piece on co-citation analysis, it’s worth taking some time here to explore why.

Wray’s article’s thesis is that “there is little evidence that there is such a field as the history and philosophy of science (HPS). Rather, philosophy of science is most properly conceived of as a sub-field of philosophy.” He arrives at this conclusion via a citation analysis of three well-respected monographs, A Companion to the Philosophy of ScienceThe Routledge Companion to Philosophy of Science, and The Philosophy of Science edited by David Papineau, in total comprising 149 articles. Wray then counts how many times major journals are cited within each article, and shows that in most cases, the most frequently cited journals across the board are strict philosophy of science journals.

The data used to support Wray’s thesis–that there is no such field as history & philosophy of science (HPS)–is this coarse-level journal citation data. No history of science journal is listed in the top 10-15 journals cited by the three monographs, and HPS journals appear, but very infrequently. Of the evidence, Wray writes “if there were such a field as history and philosophy of science, one would expect scholars in that field to be citing publications in the leading history of science journal. But, it appears that philosophy of science is largely independent of the history of science.”

It is curious that Wray would suggest that total citations from strict philosophy of science companions can be used as evidence of whether a related but distinct field, HPS, actually exists. Low citations from philosophy of science to history of science is that evidence. Instead, a more nuanced approach to this problem would be similar to the approach above: co-citation analysis. Perhaps HPS can be found by analyzing citations from journals which are ostensibly HPS, rather than analyzing three focused philosophy of science monographs. If a cluster of articles should appear in a co-citation analysis, this would be strong evidence that such a discipline currently exists among citing articles. If such a cluster does not appear, this would not be evidence of the non-existence of HPS (absence of evidence ≠ evidence of absence), but that the dataset or the analysis type is not suited to finding whatever HPS might be. A more thorough analysis would be required to actually disprove the existence of HPS, although one imagines it would be difficult explaining that disproof to the people who think that’s what they are.

With this in mind, I decided to perform the same sort of co-citation analysis as Dan Wang, Kieran Healy, Neal Caren, and Jonathan Goodwin, and see what could be found. I drew from 15 journals classified in ISI’s Web of Science as “History & Philosophy of Science” (British Journal for the Philosophy of Science, Journal of Philosophy, Synthese, Philosophy of Science, Studies in History and Philosophy of Science, Annals of Science, Archive for History of Exact Sciences, British Journal for the History of Science, Historical Studies in the Natural Sciences, History and Philosophy of the Life Sciences, History of Science, Isis, Journal for the History of Astronomoy, Osiris, Social Studies of Science, Studies in History and Philosophy of Modern Physics, and Technology and Culture). In all I collected 12,510 articles dating from 1956, with over 300,000 citations between them. For the purpose of not wanting to overheat my laptop, I decided to restrict my analysis to looking only at those articles within the dataset; that is, if any article from any of the 15 journals cited any other article from one of the 15 journals, it was included in the analysis.

I also changed my unit of analysis from the article to the author. I didn’t want to see how often two articles were cited by some third article–I wanted to see how often two authors were cited together within some article. The resulting co-citation analysis gives author-author pairs rather than article-article pairs, like the examples above. In all, there were 7,449 authors in the dataset, and 10,775 connections between author pairs; I did not threshold edges, so the some authors in the network were cited together only once, and some as many as 60 times. To perform the analysis I used the Science of Science (Sci2) Tool, no programming required, (full advertisement disclosure: I’m on the development team), and some co-authors and I have written up how to do a similar analysis in the documentation tutorials.

The resulting author co-citation network, in Figure 6, reveals two fairly distinct clusters of authors. You can click the image to enlarge, but I’ve zoomed in on the two communities, one primarily history of science, the other primarily philosophy of science. At first glance, Wray’s hypothesis appears to be corroborated by the visualization; there’s not much in the way of a central cluster between the two. That said, a closer look at the middle, Figure 7, highlights a group of people whom either have considered themselves within HPS, or others have considered HPS.

Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!]
Figure 6: Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently. Click to enlarge. [via me!] 
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.
Figure 7: Author co-citation analysis of history and philosophy of science journals, zoomed in on the area between history and philosophy, with authors highlighted who might be considered HPS. Click to enlarge.

Figures 6 & 7 don’t prove anything, but they do suggest that within citation patterns, history of science and philosophy of science are clearly more cohesive than some combined HPS might be. Figure 7 suggests there might be more to the story, and what is needed in the next step to try to pin down HPS–if indeed it exists as some sort of cohesive unit–is to find articles that specifically self-identify as HPS, and through their citation and language patterns, try to see what they have in common with and what separates them from the larger community. A more thorough set of analytics, visualizations, and tables, which I’ll explain further at some point, can be found here (apologies for the pdf, this was originally made in preparation for another project).

The reason I bring up this example is not to disparage Wray, whose work did a good job of finding the key journals in philosophy of science, but to argue that we as humanists need to make sure the methods we borrow match the questions we ask. Co-citation analysis happens to be a pretty good method for exploring the question Wray asked in his thesis, but there are many more situations where it wouldn’t be particularly useful. The recent influx of blog posts on the subject, and the upcoming DH2013 session, is exciting, because it means humanists are beginning to take citation analysis seriously and are exploring the various situations in which its methods are appropriate. I look forward to seeing what comes out of the Social History data analysis, as well as future directions this research will take.

Friends don’t let friends calculate p-values (without fully understanding them)

I wrote this in 2012 in response to a twitter conversation with Mike Taylor, who was patient enough to read a draft of this post in late November and point out all the various ways it could be changed and improved. No doubt if I had taken the time to take his advice and change the post accordingly, this post would be a beautiful thing and well-worth reading. I’d like to thank him for his patient comments, and apologize for not acting on them, as I don’t foresee being able to take the time in the near future to get back to this post. I don’t really like seeing a post sit in my draft folder for months, so I’ll release it out to the world as naked and ugly as the day it was born, with the hopes that I’ll eventually sit down and write a better one that takes all of Mike’s wonderful suggestions into account. Also, John Kruschke apparently published a paper with a similar title; as it’s a paper I’ve read before but forgot about, I’m guessing I inadvertently stole his fantastic phrase. Apologies to John!

——————–

I recently [okay, at one point this was ‘recent’] got in a few discussions on Twitter stemming from a tweet about p-values. For the lucky and uninitiated, p-values are basically the go-to statistics used by giant swaths of the academic world to show whether or not their data are statistically significant. Generally, the idea is that if the value of p is less than 0.05, you have results worth publishing.

Introducing p-values

There’s a lot to unpack in p-values; a lot of history, of a lot of baggage, and a lot of reasons why it’s a pretty poor choice for quantitative analysis. I won’t unpack all of it, but I will give a brief introduction to how the statistic works. At its simplest, according to Wikipedia, “the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.” No, don’t run away! That actually makes sense, just let me explain it.

We’ll start with a coin toss. Heads. Another: tails. Repeat: heads, heads, tails, heads, heads, heads, heads, tails. Tally up the score, of the 10 tosses, 7 were heads and 3 were tails. At this point, if you were a gambler, you might start guessing the coin is weighted or rigged. I mean, seriously, you toss the coin 10 times and only 3 of those were tails? What’s the probability of that, if the coin is fair?

That’s exactly what the p-value is supposed to tell us; what’s the probability of getting 3 or fewer tails out of 10 coin tosses, assuming the coin is fair? That assumption that the coin is fair is what statisticians call the null hypothesis; if we show that it’s less than 5% likely (p < 0.05) that the we’d see 3 or fewer tails in 10 coin flips, assuming a fair coin, statisticians tell us we can safely reject the null hypothesis. If this were a scientific experiment, we’d say that because p < 0.05, our results are significant; we can reject the null hypothesis that the coin is a fair one, because it’s pretty unlikely that the experiment would have turned out this way if the coin was fair. Huzzah! Callooh! Callay! We can publish our findings that this coin just isn’t fair in the world-renowned Journal of Trivial Results and Bad Bar Bets (JTRB3).

So how is this calculated? Well, the idea is that we have to look at the universe of possible results given the null hypothesis; that is, if I had a coin that was truly fair, how often will the result of my flipping it yield 7 heads and 3 tails? Obviously, given a fair coin, most scenarios will yield about a 50/50 split on heads and tails. It is, however, not completely impossible for us to get our 7/3 results. What we have to calculate is, if we flipped a fair coin in an infinite amount of experiments, what percent of those experiments would give us 7 heads and 3 tails? If it’s under 5%, p < 0.05, we can be reasonably certain that our result probably implies a stacked coin. We can assume that because in only 5% of experiments on a fair coin will the results look like the results we see.

We use 0.05, or 5%, mostly out of a long-standing social agreement. The 5% mark is low enough that we can reject the null hypothesis beyond most reasonable doubts. That said, it’s worth remembering that the difference between 4.9% and 5.1% is itself often not statistically significant, and though 5% is our agreed upon convention, it’s still pretty arbitrary. The best thing anyone’s ever written to this point was by Rosnow & Rosenthal (1989):

“We are not interested in the logic itself, nor will we argue for replacing the .05 alpha with another level of alpha, but at this point in our discussion we only wish to emphasize that dichotomous significance testing has no ontological basis. That is, we want to underscore that, surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?”

How p-values work. via wikipedia.
How p-values work. via wikipedia.

I’m really nailing this point home because p-values are so often misunderstood. Wikipedia lists seven common misunderstandings of p-values, and I can almost guarantee that a majority of scientists who use the statistic are guilty of at least some of them. I’m going to quote the misconceptions here, because they’re really important.

  1. The p-value is not the probability that the null hypothesis is true.
    In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is Lindley’s paradox.
  2. The p-value is not the probability that a finding is “merely a fluke.”
    As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
  3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor’s fallacy.
  4. The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of P-rep (which is heavily criticized)
  5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
  6. The significance level of the test (denoted as alpha) is not determined by the p-value.
    The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)
  7. The p-value does not indicate the size or importance of the observed effect (compare with effect size). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.

One problem with p-values (and it often surprises people)

Okay, I’m glad that explanation is over, because now we can get to the interesting stuff, and they’re all based around assumptions. I worded the above coin-toss experiment pretty carefully; you keep flipping a coin until you wind up with 7 heads and 3 tails, 10 flips in all. The experimental process seems pretty straight forward: you flip a coin a bunch of times, and record what you get. Not much in the way of underlying assumptions to get in your way, right?

Wrong. There are actually a number of ways this experiment could have been designed, each yielding the same exact observed data, but each as well yielding totally different p-values. I’ll focus on two of those possible experiments here. Let’s say we went into the experiment saying “Alright, coin, I’m going to flip you 10 times, and then count how often you land on heads and how often you land on tails.” That’s probably how you assumed the experiment was run, but I never actually described it like that; all I said was you wound up flipping a coin 10 times, seeing 7 heads and 3 tails. With that information, you can also have said “I’m going to flip this coin until I see 3 tails, and then stop.” If you remember the sequence above (heads, tails, heads, heads, tails, heads, heads, heads, heads, tails), that experimental design also fits our data perfectly. We kept flipping a coin until we saw 3 tails, and in the interim, we also observed 7 heads.

The problem here is in how p-values work. When you ask what the probability is that your result came from chance alone, a bunch of assumptions underlie this question, the key one here being the halting conditions of your experiment. Calculating how often 10 coin-tosses of a fair coin will result in a 7/3 split (or 8/2 or 9/1 or 10/0) will give you a different result than if you calculated how often waiting until the third tails will give you a 7/3 split (or 8/3 or 9/3 or 10/3 or 11/3 or 12/3 or…). The space of possibilities changes, the actual p-value of your experiment changes, based on assumptions built into your experimental design. If you collect data with one set of halting conditions, and it turns out your p-value isn’t as low as you’d like it to be, you can just pretend you went into your experiment with different halting conditions and, voila!, your results become significant.

The moral of the story is that p-values can be pretty sensitive to assumptions that are often left unspoken in an experiment. The same observed data can yield wildly different values. With enough mental somersaults regarding experimental assumptions, you can find that a 4/6 split in a coin toss actually resulted from flipping a coin that isn’t fair. Much of statistics is rife with unspoken assumptions about how data are distributed, how the experiment is run, etc. It is for this reason that I’m trying desperately to get quantitative humanists using non-parametric and Bayesian methods from the very beginning, before our methodology becomes canonized and set. Bayesian methods are also, of course, rife with assumptions, but at least most of those assumptions are made explicit at the outset. Really, the most important thing is to learn not only how to run a statistic, but also what it means to do so. There are appropriate times to use p-values, but those times are probably not as frequent as is often assumed.