Submissions to Digital Humanities 2015 (pt. 1)

It’s that time of the year again! The 2015 Digital Humanities conference will take place next summer in Australia, and as per usual, I’m going to summarize what is being submitted to the conference and, eventually, how those submissions become accepted. Each year reviewers get the chance to “bid” on conference submissions, and this lets us get a peak inside the general trends in DH research. This post (pt. 1) will focus solely on this year’s submissions, and next post will compare them to previous years and locations.

It’s important to keep in mind that trends in the conference over the last three years may be temporal, geographic, or accidental. The 2013 conference took place in Nebraska, 2014 in Switzerland, 2015 in Australia, and 2016 is set to happen in Poland; it’s to be expected that regional differences will significantly inform who is submitting pieces and what topics will be discussed.

This year, 358 pieces were submitted to the conference (about as many as were submitted to Nebraska in 2013, but more on that in the follow-up post). As with previous years, authors could submit four varieties of works: long papers, short papers, posters, and panels / multi-paper sessions. Long papers comprised 54% of submissions, panels 4%, posters 15%, and short papers 30%.

In total, there were 859 named authors on submissions – this number counts authors more than once if they appear on multiple submissions. Of those, 719 authors are unique. 1 Over half the submissions are multi-authored (58%), with 2.4 authors per submission on average, a median of 2 authors per submission, and a max of 10 authors on one submission. While the majority of submissions included multiple authors, the sheer number of single-authored papers still betrays the humanities roots of DH. The histogram is below.

A histogram of authors-per-submission.
A histogram of authors-per-submission.

As with previous years, authors may submit articles in any of a number of languages. The theme of this year’s conference is “Global Digital Humanities”, but if you expected a multi-lingual conference, you might be disappointed. Of the 358 submissions, 353 are in English. The rest are in French (2), Italian (2), and German (1).

Submitting authors could select from a controlled vocabulary to tag their submissions with topics. There were 95 topics to choose from, and their distribution is not especially surprising. Two submissions each were tagged with 25 topics, suggesting they are impressively far reaching, but for the most part submissions stuck to 5-10 topics. The breakdown of submissions by topic is below, where the percentage represents the percentage of submissions which are tagged by a specific topic. My interpretation is below that.

Percentage of submissions tagged with a specific topic.
Percentage of submissions tagged with a specific topic.

A full 21% of submissions include some form of Text Analysis, and a similar number claim Text or Data Mining as a topic. Other popular methodological topics are Visualizations, Network Analysis, Corpus Analysis, and Natural Language Processing. The DH-o-sphere is still pretty text-heavy; Audio, Video, and Multimedia are pretty low on the list, GIS even lower, and Image Analysis (surprisingly) even lower still. Bibliographic methods, Linguistics, and other approaches more traditionally associated with the humanities appear pretty far down the list. Other tech-y methods, like Stylistics and Agent-Based Modeling, are near the bottom. If I had to guess, the former is on its way down, and the latter on its way up.

Unsurprisingly, regarding disciplinary affiliations, Literary Studies is at the top of the food chain (I’ll talk more about how this compares to previous years in the next post), with Archives and Repositories not far behind. History is near the top tier, but not quite there, which is pretty standard. I don’t recall the exact link, but Ben Schmidt argued pretty convincingly that this may be because there are simply fewer new people in History than in Literary Studies. Digitization seems to be gaining some ground its lost in the previous years. The information science side (UX Design, Knowledge Representation, Information Retrieval, etc.) seems reasonably strong. Cultural Studies is pretty well-represented, and Media Studies, English Studies, Art History, Anthropology, and Classics are among the other DH-inflected communities out there.

Thankfully we’re not completely an echo chamber yet; only about a tenth of the submissions are about DH itself – not great, not terrible. We still seem to do a lot of talking about ourselves, and I’d like to see that number decrease over the next few years. Pedagogy-related submissions are also still a bit lower than I’d like, hovering around 10%. Submissions on the “World Wide Web” are decreasing, which is to be expected, and TEI isn’t far behind.

All in all, I don’t really see the trend toward “Global Digital Humanities” that the conference is themed to push, but perhaps a more complex content analysis will reveal a more global DH than we’ve sen in the past. The self-written Keyword tags (as opposed to the Topic tags, not a controlled vocabulary) reveal a bit more internationalization, although I’ll leave that analysis for a future post.

It’s worth pointing out there’s a statistical property at play that makes it difficult to see deviations from the norm. Shakespeare appears prominently because many still write about him, but even if Shakespearean research is outnumbered by work on more international playwrights, it’d be difficult to catch, because I have no category for “international playwright” – each one would be siphoned off into its own category. Thus, even if the less well-known long tail topics  significantly outweigh the more popular topics, that fact would be tough to catch.

All in all, it looks like DH2015 will be an interesting continuation of the DH tradition. Perhaps the most surprising aspect of my analysis was that nothing in it surprised me; half-way around the globe, and the trends over there are pretty identical to those in Europe and the Americas. It’ll take some more searching to see if this is a function of the submitting authors being the same as previous years (whether they’re all simply from the Western world), or whether it is actually indicative of a fairly homogeneous global digital humanities.

Stay-tuned for Part 2, where I compare the analysis to previous years’ submissions, and maybe even divine future DH conference trends using tea leaves or goat entrails or predictive modeling (whichever seems the most convincing; jury’s still out).

Notes:

  1. As far as I can tell – I used all the text similarity methods I could think of to unify the nearly-duplicate names.

Acceptances to Digital Humanities 2014 (part 1)

It’s that time again! The annual Digital Humanities conference schedule has been released, and this time it’s in Switzerland. In an effort to console myself from not having the funding to make it this year, I’ve gone ahead and analyzed the nitty-gritty of acceptances and rejections to the conference. For those interested in this sort of analysis, you can find my take on submissions to DH2013, acceptances at DH2013, and submissions to DH2014. If you’re visiting this page from the future, you can find any future DH conference analyses at this tag link.

The overall acceptance rate to DH2014 was 59%, although that includes many papers and panels that were accepted as posters. There were 589 submissions this year (compared to 348 submissions last year), of which 345 were accepted. By submission medium, this is the breakdown:

  • Long papers: 62% acceptance rate (lower than last year)
  • Short papers: 52% acceptance rate (lower than last year)
  • Panels: 57% acceptance rate (higher than last year)
  • Posters: 64% acceptance rate (didn’t collect this data last year)
Acceptances to DH2014 by submission medium.
Figure 1: Acceptances to DH2014 by submission medium.

A surprising number of submitted papers switched from one medium to another when they were accepted. A number of panels became long papers, a bunch of short papers became long papers, and a punch of long papers became short papers. Although a bunch of submissions became posters, no posters wound up “breaking out” to become some other medium. I was most surprised by the short papers which became long (13 in all), which leads me to believe some of them may have been converted for scheduling reasons. This is idle speculation on my part – the organizers may reply otherwise. [Edit: the organizers did reply, and assured us this was not the case. I see no recent to doubt that, so congratulations to those 13 short papers that became long papers!]

Medium switches in DH2014 between submission and acceptance.
Figure 2: Medium switches in DH2014 between submission and acceptance.

It’s worth keeping in mind, in all analyses listed here, that I do not have access to any withdrawals; accepted papers were definitely accepted, but not accepted may have been withdrawn rather than rejected.

Figures 3 and 4 all present the same data, but shed slightly different lights on digital humanities. Each shows the acceptance rate by various topics, but they’re ordered slightly differently. All submitting authors needed to select from a limited list of topics to label their submissions, in order to aid with selecting peer reviewers and categorization.

Figure 3 sorts topics by the total amount that were accepted to DH2014. This is at odds with Figure 2 from my post on DH2014 submissions, which sorts by total number of topics submitted. The figure from my previous post gives a sense of what digital humanists are doing and submitting, whereas Figure 3 from this post gives a sense of what the visitor to DH2014 will encounter.

Figure 3. Topical acceptance to DH2014 sorted by total number of accepted papers tagged with a particular topic.
Figure 3: Topical acceptance to DH2014 sorted by total number of accepted papers tagged with a particular topic. (click to enlarge)

The visitor to DH2014 won’t see a hugely different topical landscape than the visitor to DH2013 (see analysis here). Literary studies, text analysis, and text mining still reign supreme, with archives and repositories not far behind. Visitors will see quite a bit fewer studies dedicated to the internet and the world wide web, and quite a bit more dedicated to historical and corpus-based research. More details can be seen by comparing the individual figures.

Figure 4, instead, sorts the topics by their acceptance rate. The most frequently accepted topics appear at the left, and the least frequently appear at the right. A lighter red line is used to show acceptance rates of the same topics for 2013. This graph shows what peers consider to me more credit-worthy, and how this has changed since 2013.

Figure 4:
Figure 4: Topical acceptance to DH2014 sorted by percentage of acceptance for each topic. (click to enlarge)

It’s worth pointing out that the highest and lowest acceptance rates shouldn’t be taken very seriously; with so few submitted articles, the rates are as likely random as indicative of any particularly interesting trend. Also, for comparisons with 2013, keep in mind the North American and European traditions of digital humanities may be driving the differences.

There are a few acceptance ratios worthy of note. English studies and GLAM (Galleries, Libraries, Archives, Museums) both have acceptance rates extremely above average, and also quite a bit higher than their acceptance rates from the previous year. Studies of XML are accepted slightly above the average acceptance rate, and also accepted proportionally more frequently than they were in 2013. Acceptance rates for both literary and historical studies papers are about average, and haven’t changed much since 2013 (even though there were quite a few more historical submissions than the previous year).

Along with an increase in GLAM acceptance rates, there was a big increase in rates for studies involving archives and repositories. It may be they are coming back in style, or it may be indicative of a big difference between European and North American styles. There was a pretty big drop in acceptance rates for ontology and semantic web research, as well as in pedagogy research across the board. Pedagogy had a weak foothold in DH2013, and has an even weaker foothold in 2014, with both fewer submitted articles, and a lower rate of acceptance on those submitted articles.

In the next blog post, I plan on drilling a bit into author-supplied keywords, the role of gender on acceptance rates, and the geography of submissions. As always, I’m happy to share data, but in this case I will only share sufficiently aggregated/anonymized data, because submitting authors who did not get accepted have an expectation of privacy that I intend to keep.

Submissions to Digital Humanities 2014

Submissions for the 2014 Digital Humanities conference just closed. It’ll be in Switzerland this time around, which unfortunately means I won’t be able make it, but I’ll be eagerly following along from afar. Like last year, reviewers are allowed to preview the submitted abstracts. Also like last year, I’m going to be a reviewer, which means I’ll have the opportunity to revisit the submissions to DH2013 to see how the submissions differed this time around. No doubt when the reviews are in and the accepted articles are revealed, I’ll also revisit my analysis of DH conference acceptances.

To start with, the conference organizers received a record number of submissions this year: 589. Last year’s Nebraska conference only received 348 submissions. The general scope of the submissions haven’t changed much; authors were still supposed to tag their submissions using a controlled vocabulary of 95 topics, and were also allowed to submit keywords of their own making. Like last year, authors could submit long papers, short papers, panels, or posters, but unlike last year, multilingual submissions were encouraged (English, French, German, Italian, or Spanish). [edit: Bethany Nowviskie, patient awesome person that she is, has noticed yet another mistake I’ve made in this series of posts. Apparently last year they also welcomed multilingual submissions, and it is standard practice.]

Digital Humanities is known for its collaborative nature, and not much has changed in that respect between 2013 and 2014 (Figure 1). Submissions had, on average, between two and three authors, with 60% of submissions in both years having at least two authors. This year, a few fewer papers have single authors, and a few more have two authors, but the difference is too small to be attributable to anything but noise.

Figure 1. Number of authors per paper.
Figure 1. Number of authors per paper.

The distribution of topics being written about has changed mildly, though rarely in extreme ways. Any changes visible should also be taken with a grain of salt, because a trend over a single year is hardly statistically robust to small changes, say, in the location of the event.

The grey bars in Figure 2 show what percentage of DH2014 submissions are tagged with a certain topic, and the red dotted outlines show what the percentages were in 2013. The upward trends to note this year are text analysis, historical studies, cultural studies, semantic analysis, and corpora and corpus activities. Text analysis was tagged to 15% of submissions in 2013 and is now tagged to 20% of submissions, or one out of every five. Corpus analysis similarly bumped from 9% to 13%. Clearly this is an important pillar of modern DH.

Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The dotted lines represent the percentage from DH2013.
Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The red dotted outlines represent the percentage from DH2013.

I’ve pointed out before that History is secondary compared to Literary Studies in DH (although Ted Underwood has convincingly argued, using Ben Schmidt’s data, that the numbers may merely be due to fewer people studying history). This year, however, historical studies nearly doubled in presence, from 10% to 17%. I haven’t yet collected enough years of DH conference data to see if this is a trend in the discipline at large, or more of a difference between European and North American DH. Semantic analysis jumped from 1% to 7% of the submissions, cultural studies went from 10% to 14%, and literary studies stayed roughly equivalent. Visualization, one of the hottest topics of DH2013, has become even hotter in 2014 (14% to 16%).

The most visible drops in coverage came in pedagogy, scholarly editions, user interfaces, and research involving social media and the web. At DH2013, submissions on pedagogy had a surprisingly low acceptance rate, which combined the drop in pedagogy submissions this year (11% to 8% in “Digital Humanities – Pedagogy and Curriculum” and 7% to 4% in “Teaching and Pedagogy”) might suggest a general decline in interest in the DH world in pedagogy. “Scholarly Editing” went from 11% to 7% of the submissions, and “Interface and User Experience Design” from 13% to 8%, which is yet more evidence for the lack of research going into the creation of scholarly editions compared to several years ago. The most surprising drops for me were those in “Internet / World Wide Web” (12% to 8%) and “Social Media” (8.5% to 5%), which I would have guessed would be growing rather than shrinking.

The last thing I’ll cover in this post is the author-chosen keywords. While authors needed to tag their submissions from a list of 95 controlled vocabulary words, they were also encouraged to tag their entries with keywords they could choose themselves. In all they chose nearly 1,700 keywords to describe their 589 submissions. In last year’s analysis of these keywords, I showed that visualization seemed to be the glue that held the DH world together; whether discussing TEI, history, network analysis, or archiving, all the disparate communities seemed to share visualization as a primary method. The 2014 keyword map (Figure 3) reveals the same trend: visualization is squarely in the middle. In this graph, two keywords are linked if they appear together on the same submission, thus creating a network of keywords as they co-occur with one another. Words appear bigger when they span communities.

Figure 3. Co-occurrence of DH2014 author-submitted keywords.
Figure 3. Co-occurrence of DH2014 author-submitted keywords.

Despite the multilingual conference, the large component of the graph is still English. We can see some fairly predictable patterns: TEI is coupled quite closely with XML; collaboration is another keyword that binds the community together, as is (obviously) “Digital Humanities.” Linguistic and literature are tightly coupled, much moreso than, say, linguistic and history. It appears the distant reading of poetry is becoming popular, which I’d guess is a relatively new phenomena, although I haven’t gone back and checked.

This work has been supported by an ACH microgrant to analyze DH conferences and the trends of DH through them, so keep an eye out for more of these posts forthcoming that look through the last 15 years. Though I usually share all my data, I’ll be keeping these to myself, as the submitters to the conference did so under an expectation of privacy if their proposals were not accepted.

[edit: there was some interest on twitter last night for a raw frequency of keywords. Because keywords are author-chosen and I’m trying to maintain some privacy on the data, I’m only going to list those keywords used at least twice. Here you go (Figure 4)!]

Figure 4. Keywords used in DH2014 submissions ordered by frequency.
Figure 4. Keywords used in DH2014 submissions ordered by frequency.

Acceptances to Digital Humanities 2013 (part 1)

The 2013 Digital Humanities conference in Nebraska just released its program with a list of papers and participants. As some readers may recall, when the initial round of reviews went out for the conference, I tried my hand at analyzing submissions to DH2013. Now that the schedule has been released, the data available puts us in a unique position to compare proposed against accepted submissions, thus potentially revealing how what research is being done compares with what research the DH community (through reviews) finds good or interesting. In my last post, I showed that literary studies and data/text mining submissions were at the top of the list; only half as many studies were historical rather than literary. Archive work and visualizations were also near the top of the list, above multimedia, web, and content analyses, though each of those were high as well.

A keyword analysis showed that while Visualization wasn’t necessarily at the top of the list, it was the most central concept connecting the rest of the conference together. Nobody knows (and few care) what DH really means; however, these analyses present the factors that bind together those who call themselves digital humanists and submit to its main conference. The post below explores to what extent submissions and acceptances align. I preserve anonymity wherever possible, as submitting authors did not do so with the expectation that turned down submission data would be public.

It’s worth starting out with a few basic acceptance summary statistics. As I don’t have access to poster data yet, nor do I have access to withdrawals, I can’t calculate the full acceptance rate, but there are a few numbers worth mentioning. Just take all of the percentages as a lower bounds, where withdrawals or posters might make the acceptance rate higher. Of the 144 long papers submitted, 66.6% of them (96) were accepted, although only 57.6% (83) were accepted as long papers; another 13 were accepted as short papers instead. Half of the submitted panels were accepted, although curiously, one of the panels was accepted instead as a long paper. For short papers, only 55.9% of those submitted were accepted. There were 66 poster submissions, but I do not know how many of those were accepted, or how many other submissions were accepted as posters instead. In all, excluding posters, 60.9% of submitted proposals were accepted. More long papers than short papers were submitted, but roughly equal numbers of both were accepted. People who were turned down should feel comforted by the fact that they faced some stiff competition.

As with most quantitative analyses, the interesting bits come more when comparing internal data than when looking at everything in aggregate. The first three graphs do just that, and are in fact the same data, but ordered differently. When authors submitted their papers to the conference, they could pick any number of keywords from a controlled vocabulary. Looking at how many times each keyword was submitted with a paper (Figure 1) can give us a basic sense of what people are doing in the digital humanities. From Figure 1 we see (again, as a version of this viz appeared in the last post) that “Literary Studies” and “Text Mining” are the most popular keywords among those who submitted to DH2013; the rest you can see for yourself. The total height of the bar (red + yellow) represents the number of total submissions to the conference.

Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions.
Figure 1: Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions. (click to enlarge)

Figure 2 shows the same data as Figure 1, but sorted by acceptance rates rather than the total number of submissions. As before, because we don’t know about poster acceptance rates or withdrawals, you should take these data with a grain of salt, but assuming a fairly uniform withdrawal/poster rate, we can still make some basic observations. It’s also worth pointing out that the fewer overall submissions to the conference with a certain keyword, the less statistically meaningful the acceptance rate; with only one submission, whether or not it’s accepted could as much be due to chance as due to some trend in the minds of DH reviewers.

With those caveats in mind, Figure 2 can be explored. One thing that immediately pops out is that “Literary Studies” and “Text Mining” both have higher than average acceptance rates, suggesting that not only are a lot of DHers doing that kind of research; that kind of research is still interesting enough that a large portion of it is getting accepted, as well. Contrast this with the topic of “Visualization,” whose acceptance rate is closer to 40%, significantly fewer than the average acceptance rate of 60%. Perhaps this means that most reviewers thought visualizations worked better as posters, the data for which we do not have, or perhaps it means that the relatively low barrier to entry on visualizations and their ensuing proliferation make them more fun to do than interesting to read or review.

“Digitisation – Theory and Practice” has a nearly 60% acceptance rate, yet “Digitisation; Resource Creation; and Discovery” has around 40%, suggesting that perhaps reviewers are more interested in discussions about digitisation than the actual projects themselves, even though far more “Digitisation; Resource Creation; and Discovery” papers were submitted than “”Digitisation – Theory and Practice.” The imbalance between what was submitted and what was accepted on that front is particularly telling, and worth a more in-depth exploration by those who are closer to the subject. Also tucked at the bottom of the acceptance rate list are three related keywords “Digital Humanities – Institutional Support, “Digital Humanities – Facilities,” & “Glam: Galleries; Libraries; Archives; Museums,” each with a 25% acceptance rate. It’s clear the reviewers were not nearly as interested in digital humanities infrastructure as they were in digital humanities research. As I’ve noted a few times before, “Historical Studies” is also not well-represented, with both a lower acceptance rate than average and a lower submission rate than average. Modern digital humanities, at least as it is represented by this conference, appears far more literary than historical.

Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers.
Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers. (click to enlarge)

Figure 3, once again, has the same data as Figures 2 and 1, but is this time sorted simply by accepted papers and panels. This is the front face of DH2013; the landscape of the conference (and by proxy the discipline) as seen by those attending. While this reorientation of the graph doesn’t show us much we haven’t already seen, it does emphasize the oddly low acceptance rates of infrastructural submissions (facilities, libraries, museums, institutions, etc.) While visualization acceptance rates were a bit low, attendees of the conference will still see a great number of them, because the initial submission rate was so high. Conference goers will see that DH maintains a heavy focus on the many aspects of text: its analysis, its preservation, its interfaces, and so forth. The web also appears well-represented, both in the study of it and development on it. Metadata is perhaps not as strong a focus as it once was (historical DH conference analysis would help in confirming this speculation on my part), and reflexivity, while high (nearly 20 “Digital Humanities – Nature and Significance” submissions), is far from overwhelming.

A few dozen papers will be presented on multimedia beyond simple text – a small but not insignificant subgroup. Fewer still are papers on maps, stylometry, or medieval studies, three subgroups I imagine once had greater representation. They currently each show about the same force as gender studies, which had a surprisingly high acceptance rate of 85% and is likely up-and-coming in the DH world. Pedagogy was much better represented in submissions than acceptances, and a newcomer to the field coming to the conference for the first time would be forgiven in thinking pedagogy was less of an important subject in DH than veterans might think it is.

Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)
Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)

As what’s written so far is already a giant wall of text, I’ll go ahead and leave it at this for now. When next I have some time I’ll start analyzing some networks of keywords and titles to find which keywords tend to be used together, and whatever other interesting things might pop up. Suggestions and requests, as always, are welcome.

 

Analyzing submissions to Digital Humanities 2013

Digital Humanities 2013 is on its way; submissions are closed, peers will be reviewing them shortly, and (most importantly for this post) the people behind the conference are experimenting with a new method of matching submissions to reviewers. It’s a bidding process; reviewers take a look at the many submissions and state their reviewing preferences or, when necessary, conflicts of interest. It’s unclear the extent to which these preferences will be accommodated, as this is an experiment on their part. Bethany Nowviskie describes it here. As a potential reviewer, I just went through the process of listing my preferences, and managed to do some data scraping while I was there. How could I not? All 348 submission titles were available to me, as well as their authors, topic selections, and keywords, and given that my submission for this year is all about quantitatively analyzing DH, it was an opportunity I could not pass up. Given that these data are sensitive, and those who submitted did so under the assumption that rejected submissions would remain private, I’m opting not to release the data or any non-aggregated information. I’m also doing my best not to actually read the data in the interest of the privacy of my peers; I suppose you’ll all just have to trust me on that one, though.

So what are people submitting? According to the topics authors assigned to their 348 submissions, 65 submitted articles related to “literary studies,” trailed closely by 64 submissions which pertained to “data mining/ text mining.” Work on archives and visualizations are also up near the top, and only about half as many authors submitted historical studies (37) as those who submitted literary ones (65). This confirms my long suspicion that our current wave of DH (that is, what’s trending and exciting) focuses quite a bit more on literature than history. This makes me sad.  You can see the breakdown in Figure 1 below, and further analysis can be found after.

Figure 1: Number of documents with each topic authors assigned to submissions for DH2013 (click to enlarge).

The majority of authors attached fewer than five topics to their submissions; a small handful included over 15.  Figure 2 shows the number of topics assigned to each document.

Figure 2: The number of topics attached to each document, in order of rank.

I was curious how strongly each topic coupled with other topics, and how topics tended to cluster together in general, so I extracted a topic co-occurrence network. That is, whenever two topics appear on the same document, they are connected by an edge (see Networks Demystified Pt. 1 for a brief introduction to this sort of network); the more times two topics co-occur, the stronger the weight of the edge between them.

Topping off the list at 34 co-occurrences were “Data Mining/ Text Mining” and “Text Analysis,” not terrifically surprising as the the latter generally requires the former, followed by “Data Mining/ Text Mining” and “Content Analysis” at 23 co-occurrences, “Literary Studies” and “Text Analysis” at 22 co-occurrences, “Content Analysis” and “Text Analysis” at 20 co-occurrences, and “Data Mining/ Text Mining” and “Literary Studies” at 19 co-occurrences. Basically what I’m saying here is that Literary Studies, Mining, and Analysis seem to go hand-in-hand.

Knowing my readers, about half of you are already angry with me counting co-occurrences, and rightly so. That measurement is heavily biased by the sheer total number of times a topic is used; if “literary studies” is attached to 65 submissions, it’s much more likely that it will co-occur with any particular topic than topics (like “teaching and pedagogy”) which simply appear more infrequently. The highest frequency topics will co-occur with one another simply by an accident of magnitude.

To account for this, I measured the neighborhood overlap of each node on the topic network. This involves first finding the number of other topics  a pair of two topics shares. For example, “teaching and pedagogy” and “digital humanities – pedagogy and curriculum” each co-occur with several other of the same topics, including “programming,” “interdisciplinary collaboration,” and “project design, organization, management.” I summed up the number topical co-occurrences between each pair of topics, and then divided that total by the number of co-occurrences each node in the pair had individually. In short, I looked at which pairs of topics tended to share similar other topics, making sure to take into account that some topics which are used very frequently might need some normalization. There are better normalization algorithms out there, but I opt to use this one for its simplicity for pedagogical reasons. The method does a great job leveling the playing field between pairs of infrequently-used topics compared to pairs of frequently-used topics, but doesn’t fair so well when looking at a pair where one topic is popular and the other is not. The algorithm is well-described in Figure 3, where the darker the edge, the higher the neighborhood overlap.

Figure 3: The neighborhood overlap between two nodes is how many neighbors (or connections) that pair of nodes shares. As such, A and B share very few connections, so their overlap is low, whereas D and E have quite a high overlap. Via Jaroslav Kuchar.

Neighborhood overlap paints a slightly different picture of the network. The pair of topics with the largest overlap was “Internet / World Wide Web” and “Visualization,” with 90% of their neighbors overlapping. Unsurprisingly, the next-strongest pair was “Teaching and Pedagogy” and “Digital Humanities – Pedagogy and Curriculum.” The data might be used to suggest multiple topics that might be merged into one, and this pair seems to be a pretty good candidate. “Visualization” also closely overlaps “Data Mining/ Text Mining”, which itself (as we saw before) overlaps with “Cultural Studies” and “Literary Studies.” What we see from this close clustering both in overlap and in connection strength is the traces of a fairly coherent subfield out of DH, that of quantitative literary studies. We see a similarly tight-knit cluster between topics concerning archives, databases, analysis, the web, visualizations, and interface design, which suggests another genre in the DH community: the (relatively) recent boom of user interfaces as workbenches for humanists exploring their archives. Figure 4 represents the pairs of topics which overlap to the highest degree; topics without high degrees of pair correspondence don’t appear on the network graph.

Figure 4: Network of topical neighborhood overlap. Edges between topics are weighted according to how structurally similar the two topics are. Topics that are structurally isolated are not represented in this network visualization.

The topics authors chose for each submission were from a controlled vocabulary. Authors also had the opportunity to attach their own keywords to submissions, which unsurprisingly yielded a much more diverse (and often redundant) network of co-occurrences. The resulting network revealed a few surprises: for example, “topic modeling” appears to be much more closely coupled with “visualization” than with “text analysis” or “text mining.” Of course some pairs are not terribly surprising, as with the close connection between “Interdisciplinary” and “Collaboration.” The graph also shows that the organizers have done a pretty good job putting the curated topic list together, as a significant chunk of the high thresholding keywords are also available in the topic list, with a few notable exceptions. “Scholarly Communication,” for example, is a frequently used keyword but not available as a topic – perhaps next year, this sort of analysis can be used to help augment the curated topic list. The keyword network appears in Figure 5. I’ve opted not to include a truly high resolution image to dissuade readers from trying to infer individual documents from the keyword associations.

Figure 5: Which keywords are used together on documents submitted to DH2013? Nodes are colored by cluster, and edges are weighted by number of co-occurrences. Click to enlarge.

There’s quite a bit of rich data here to be explored, and anyone who does have access to the bidding can easily see that the entire point of my group’s submission is exploring the landscape of DH, so there’s definitely more to come on the subject from this blog. I especially look forward to seeing what decisions wind up being made in the peer review process, and whether or how that skews the scholarly landscape at the conference.

On a more reflexive note, looking at the data makes it pretty clear that DH isn’t as fractured as some occasionally suggest (New Media vs. Archives vs. Analysis, etc.). Every document is related to a few others, and they are all of them together connected in a rich family, a network, of Digital Humanities. There are no islands or isolates. While there might be no “The” Digital Humanities, no unifying factor connecting all research, there are Wittgensteinian family resemblances  connecting all of these submissions together, in a cohesive enough whole to suggest that yes, we can reasonably continue to call our confederation a single community. Certainly, there are many sub-communities, but there still exists an internal cohesiveness that allows us to differentiate ourselves from, say, geology or philosophy of mind, which themselves have their own internal cohesiveness.