Categories
personal research

Submissions to Digital Humanities 2014

Submissions for the 2014 Digital Humanities conference just closed. It’ll be in Switzerland this time around, which unfortunately means I won’t be able make it, but I’ll be eagerly following along from afar. Like last year, reviewers are allowed to preview the submitted abstracts. Also like last year, I’m going to be a reviewer, which means I’ll have the opportunity to revisit the submissions to DH2013 to see how the submissions differed this time around. No doubt when the reviews are in and the accepted articles are revealed, I’ll also revisit my analysis of DH conference acceptances.

To start with, the conference organizers received a record number of submissions this year: 589. Last year’s Nebraska conference only received 348 submissions. The general scope of the submissions haven’t changed much; authors were still supposed to tag their submissions using a controlled vocabulary of 95 topics, and were also allowed to submit keywords of their own making. Like last year, authors could submit long papers, short papers, panels, or posters, but unlike last year, multilingual submissions were encouraged (English, French, German, Italian, or Spanish). [edit: Bethany Nowviskie, patient awesome person that she is, has noticed yet another mistake I’ve made in this series of posts. Apparently last year they also welcomed multilingual submissions, and it is standard practice.]

Digital Humanities is known for its collaborative nature, and not much has changed in that respect between 2013 and 2014 (Figure 1). Submissions had, on average, between two and three authors, with 60% of submissions in both years having at least two authors. This year, a few fewer papers have single authors, and a few more have two authors, but the difference is too small to be attributable to anything but noise.

Figure 1. Number of authors per paper.
Figure 1. Number of authors per paper.

The distribution of topics being written about has changed mildly, though rarely in extreme ways. Any changes visible should also be taken with a grain of salt, because a trend over a single year is hardly statistically robust to small changes, say, in the location of the event.

The grey bars in Figure 2 show what percentage of DH2014 submissions are tagged with a certain topic, and the red dotted outlines show what the percentages were in 2013. The upward trends to note this year are text analysis, historical studies, cultural studies, semantic analysis, and corpora and corpus activities. Text analysis was tagged to 15% of submissions in 2013 and is now tagged to 20% of submissions, or one out of every five. Corpus analysis similarly bumped from 9% to 13%. Clearly this is an important pillar of modern DH.

Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The dotted lines represent the percentage from DH2013.
Figure 2. Topics from DH2014 ordered by the percent of submissions which fall in that category. The red dotted outlines represent the percentage from DH2013.

I’ve pointed out before that History is secondary compared to Literary Studies in DH (although Ted Underwood has convincingly argued, using Ben Schmidt’s data, that the numbers may merely be due to fewer people studying history). This year, however, historical studies nearly doubled in presence, from 10% to 17%. I haven’t yet collected enough years of DH conference data to see if this is a trend in the discipline at large, or more of a difference between European and North American DH. Semantic analysis jumped from 1% to 7% of the submissions, cultural studies went from 10% to 14%, and literary studies stayed roughly equivalent. Visualization, one of the hottest topics of DH2013, has become even hotter in 2014 (14% to 16%).

The most visible drops in coverage came in pedagogy, scholarly editions, user interfaces, and research involving social media and the web. At DH2013, submissions on pedagogy had a surprisingly low acceptance rate, which combined the drop in pedagogy submissions this year (11% to 8% in “Digital Humanities – Pedagogy and Curriculum” and 7% to 4% in “Teaching and Pedagogy”) might suggest a general decline in interest in the DH world in pedagogy. “Scholarly Editing” went from 11% to 7% of the submissions, and “Interface and User Experience Design” from 13% to 8%, which is yet more evidence for the lack of research going into the creation of scholarly editions compared to several years ago. The most surprising drops for me were those in “Internet / World Wide Web” (12% to 8%) and “Social Media” (8.5% to 5%), which I would have guessed would be growing rather than shrinking.

The last thing I’ll cover in this post is the author-chosen keywords. While authors needed to tag their submissions from a list of 95 controlled vocabulary words, they were also encouraged to tag their entries with keywords they could choose themselves. In all they chose nearly 1,700 keywords to describe their 589 submissions. In last year’s analysis of these keywords, I showed that visualization seemed to be the glue that held the DH world together; whether discussing TEI, history, network analysis, or archiving, all the disparate communities seemed to share visualization as a primary method. The 2014 keyword map (Figure 3) reveals the same trend: visualization is squarely in the middle. In this graph, two keywords are linked if they appear together on the same submission, thus creating a network of keywords as they co-occur with one another. Words appear bigger when they span communities.

Figure 3. Co-occurrence of DH2014 author-submitted keywords.
Figure 3. Co-occurrence of DH2014 author-submitted keywords.

Despite the multilingual conference, the large component of the graph is still English. We can see some fairly predictable patterns: TEI is coupled quite closely with XML; collaboration is another keyword that binds the community together, as is (obviously) “Digital Humanities.” Linguistic and literature are tightly coupled, much moreso than, say, linguistic and history. It appears the distant reading of poetry is becoming popular, which I’d guess is a relatively new phenomena, although I haven’t gone back and checked.

This work has been supported by an ACH microgrant to analyze DH conferences and the trends of DH through them, so keep an eye out for more of these posts forthcoming that look through the last 15 years. Though I usually share all my data, I’ll be keeping these to myself, as the submitters to the conference did so under an expectation of privacy if their proposals were not accepted.

[edit: there was some interest on twitter last night for a raw frequency of keywords. Because keywords are author-chosen and I’m trying to maintain some privacy on the data, I’m only going to list those keywords used at least twice. Here you go (Figure 4)!]

Figure 4. Keywords used in DH2014 submissions ordered by frequency.
Figure 4. Keywords used in DH2014 submissions ordered by frequency.
Categories
method

Networks Demystified 5: Communities, PageRank, and Sampling Caveats

The fifth and sixth (coming soon…) installment of Networks Demystified will be a bit more applied than the previous bunch (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis). Like many of my recent posts, this one is in response to a Twitter conversation:

If you follow a lot of people on Twitter (Michael follows over a thousand), getting a grasp of them all and organizing them can be tough. Luckily network analysis can greatly ease the task of organizing twitter follows, and this and next post will teach you how to do that using NodeXL, a plugin for Microsoft Excel that (unfortunately) only works on Windows. It’s super easy, though, so if you have access to a Windows machine with Office installed, it’s worth trying it out despite the platform limitations.

This installment will explain the concept of modularity for group detection in networks, as well as why certain metrics like centrality should be avoided when using certain kinds of datasets. I’m going to be as gentle as I can be on the math, so this tutorial is probably best-suited for those just learning network techniques, but will fall short for those hoping for more detailed or specific information.

Next installment, Networks Demystified 6, will include the actual step-by-step instructions of how to run these analyses using NodeXL. I’m posting the description first, because I strongly believe you should learn the concepts before applying the techniques. At least that’s the theory: actually I’m posting this first because Twitter is rate-limiting the download of my follower/followee network, and I’m impatient and want to post this right away.

Modularity / Community Detection

Modularity is a technique for finding which groups of nodes in a network are more similar to each other than to other groups; it lets you spot communities.

It is unfortunate (for me) that modularity is one of the more popular forms of community detection, because it also happens to be one of the methods more difficult to explain without lots of strange symbols, which I’m trying to avoid. First off, the modularity technique is not one simple algorithm, as much as it is a conceptual framework for thinking about communities in networks. There modularity you run in Gephi is different than modularity in NodeXL, because there’s more than one way to write the concept into an algorithm, and they’re not all exactly the same.

Randomness

But to describe modularity itself, let’s take a brief detour through random-network lane. Randomization is a popular tool among network scientists, statisticians, and late 20th century avant-garde music composers for a variety of reasons. Suppose you’re having a high-stakes coin-flip contest with your friend, who winds up beating you 68/32. Before you run away crying that your friend cheated, because a fair coin should always land 50/50, remember that the universe is a random place. The 68/32 score could’ve appeared by chance alone, so you write up a quick computer program to flip a thousand coins a hundred times each, and if in those thousand computational coin-flip experiments, a decent amount come up around 68/32, you can reasonably assume your friend didn’t cheat.

The use of a simulated random result to see if what you’ve noticed is surprising (or, sometimes, significant) is quite common. I used it on the Irregular when reviewing Matthew Jockers’ Macroanalysis, shown in the graphic halfway down the page and reproduced here. I asked, in an extremely simplistic way, whether the trends Jockers saw over time were plausible by creating four dummy universes where randomness ruled, to see if his results could be attributable to chance alone. By comparing his data to my fake data, I concluded that some of his results were probably very accurate, and some of them might have just been chance.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

Network analysts use the same sort of technique all the time. Do you want to know if it’s surprising that some actress is only six degrees away from Kevin Bacon (or anybody else on the network)? Generate a bunch of random networks with the same amount of nodes (actors) and edges (connections between them if they star in a movie together), and see if, in most cases, you can get from any one actor to any other in only six hops. Odds are you could; that’s just how random networks work.

What’s surprising is that in these, as well as most other social networks, people tend to be much more tightly clustered together than expected from a random network. They form little groups and cliques. It is significantly unlikely that in such cliquish networks, where the same groups of actors tend to appear with each other constantly, that everyone would still be only six degrees away from one another. It’s commonly known that social networks organize in what are called small-worlds, where people tend to be much more closely connected to one another than one would expect when they’re in such tight cliques. This is the power of random networks: they help pick out the unusual.

Modularity Explained

Which brings us back to modularity. With some careful thinking, one would come up with a quick solutions to figuring out how to find communities in networks: find clusters of nodes that have more internal edges between them than external edges to other groups.

What a network community should look like. [via]
What network communities should look like. [via]
There’s a lurking problem with this idea, though. If you were just counting the number of in-group connections vs. out-group connections, you could come up with an optimal solution very quickly if you say the entire network is one community: voila! no outgoing connections, and lots of internal connections. If instead you say in advance that you want two communities, or you only want communities of a certain size, it mitigates the problem somewhat, but then you’re stuck with needing to set the number of communities beforehand, which is a difficult constraint if you’re not sure what that number should be.

 

The key is randomness. You want to find communities of nodes for which there are more internal links than you would expect given that the graph was random, and fewer external links than you would expect given the graph was random. Mark Newman defines modularity as: “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.”

Modularity is thus a network-level measurement, and it can change based on what communities you choose in your network. For example, in the figure above, most of the edges in the network are within the Freakish Grey Blobs (hereafter FGBs), and within the FGBs the edges are very dense. In that case, we would expect the modularity to be quite high. However, imagine we drew the FGBs around different nodes in the network instead: if we made four FGBs instead of three, splitting the left group into two, we’d find that a larger fraction of the edges are falling outside of groups, thus decreasing the overall network’s modularity score.

Similarly, let’s say we made two FGBs instead of three. We merge the two groups in the right into one supergroup (group 1), and leave the group on the left (group 1) the same. What would happen to the modularity? In that case, because group 2 is now less dense (defining density as the number of edges within the group compared to the total possible number of edges within it), and we’d expect a random network to look a bit more similar, so the overall network’s modularity score would (again) decrease slightly.

That’s modularity in a nutshell. The method of finding the appropriate groupings in a network varies, but essentially, all the algorithms keep drawing FGBs around different groups of nodes until the overall modularity score of the network is as high as possible. Find the right configuration of FGBs such that the modularity score is very high, and then label the nodes in each separate FGB as their own community. In the figure above, there are three communities, and your favorite network analysis software will label them as such.

Some metrics to avoid (with caveats)

There’s a stubbornly persistent desire, when analyzing a tasty new network dataset, to just run every algorithm in the box and see what comes up. PageRank and centrality? Sure! Clustering? Sounds great! Unfortunately, each algorithm makes certain underlying assumptions about the data, and our twitter network breaks many of those assumptions.

The most important worth mentioning is that we’ve already sinned. Remember how we plan on calculating modularity, and remember how I defined it earlier? Nothing was mentioned about whether or not the edges were directed. Asymmetrical edges (like asymmetries between follower and followee) are not understood by the modularity algorithm we described, which assumes there would be no difference between a follower, a followee, or a reciprocal connection of both. Running modularity on a directed network is, in general, a bad idea: in most networks, the direction of an edge is very important for determining community involvement. We can safely ignore this issue here, as we’re dealing with the fairly low-stakes problem of letting the computer help us organize our twitter network, but in publications or higher-stakes circumstances, this would be something to avoid without thinking through the implications very carefully.

A network metric that might seem more appropriate to the forthcoming twitter dataset, PageRank, is similarly inadequate without a few key changes. As I haven’t demystified PageRank yet, here’s a short description, with the promise to expand on it later.

PageRank is Google’s algorithm for ranking websites in their search results, and it’s inspired by citation analysis, but it turns out to be useful in various other circumstances. There are two ways to explain the algorithm, both equally accurate. The first has to do with probability: what is the probability that, if someone just starts clicking links on the web at random, they’ll eventually land on your website. The higher the chance that someone clicking links at random will reach your site, the higher your PageRank.

PageRank’s other definition makes a bit more ‘on-the-ground’ sense; given a large, directed network (like websites linking to other websites), those sites that are very popular can determine another site’s score by whether or not they link to it. Say a really famous website, like BBC, links to your site; you get lots of points. If Sam’s New England Crab Shack & Duck Farm links to your site, however, you won’t get many points. Seemingly paradoxically, the more points your website has, the more points you can give to sites that you link to. Sites that get linked to a lot are considered reputable, and in turn they link to other sites and pass that reputation along. But, the clever bit is that your site can only pass a fraction of its reputation along based on how many other sites it links to, thus if your site only links to the Scottbot Irregular, the Irregular will get lots of points from it, but if it links to ten sites including the Irregular, my site would only get a tenth of the potential points.

How PageRank works(-ish).  Those sites which have more points in turn confer more points to others. [via]
How PageRank works(-ish). Those sites which have more points in turn confer more points to others. [via]
This generalizes pretty easily to all sorts of networks including, as it happens, twitter follow networks. Those who are followed by lots of people are scored highly; if one of those highly scoring individuals follows only a select few, that select few will also receive a significant increase in rank. When a user is followed by many other users with very high scores, that user is scored the highest of them all. PageRank, then, is a neat way of looking at who has the power in a twitter network. Those at the top are those who even the relatively popular find interesting and worth following.

 

Which brings us to this, the network we’re creating to organize our twitter neighborhood. The network type is right: a directed, unweighted network. The algorithm will work fine. It will tell you, for example, that you are (or are nearly) the most popular person in your twitter neighborhood. And why wouldn’t it? Most of the people in your neighborhood follow you, or follow people who follow you, so the math is inevitable.

And the problem is obvious. Your sampling strategy (the criteria you used to gather your data) inherently biases this particular network metric, and most other metrics within the same family. You’ve used what’s called snowball sampling, so-named because your sample snowballs into a huge network in relatively short order, starting from a single person: you. It’s you, then those you follow, then those they follow, and so forth. You are inevitably at the center of your snowball, and the various network centrality measurements will react accordingly.

Well, you might ask, what if you just ignore yourself when looking at the network? Nope. Because PageRank (among other algorithms) takes everyone’s score into account when calculating others’ scores; even if you close your eyes whenever your name pops up, your presence will still exert an invisible influence on the network. In the case of PageRank, because your score is so high, you’ll be conferring a much higher score to (potentially) otherwise unpopular people you happen to follow.

The short-term solution is to remove yourself from the network before you run any of your analyses. This actually still isn’t perfect, for reasons I don’t feel like getting into because the post is already too long, but it will give at least a better idea of PageRank centrality within your twitter neighborhood.

While you’re at it, you should also remove yourself before running community detection. As you might be the connection that bridges two otherwise disconnected communities together, and for the purpose of this study you’re trying to organize people separate from your own influence on them, running modularity on the network without you in it will likely give you a better sense of your neighborhood.

Continuing

Stay-tuned for the next exciting installment of Networks Demystified, wherein I’ll give step-by-step instructions on how to actually do the things I’ve described using NodeXL. If you want a head-start, go ahead and download and start playing with it.

Categories
miscellanea

Improving the Journal of Digital Humanities

Twitter and the digital humanities blogosphere has been abuzz recently over an ill-fated special issue of the Journal of Digital Humanities (JDH) on Postcolonial Digital Humanities. I won’t get too much into what happened and why, not because I don’t think it’s important, but because I respect both parties too much and feel I am too close to the story to provide an unbiased opinion. Summarizing, the guest editors felt they were treated poorly, in part because of the nature of their content, and in part because of the way the JDH handles its publications.

I wrote earlier on twitter that I no longer want to be involved in the conversation, by which I meant, I no longer want to be involved in the conversation about what happened and why. I do want to be involved in a discussion on how to get the JDH move beyond the issues of bias, poor communication, poor planning, and microaggression, whether or not any or all of those existed in this most recent issue. As James O’Sullivan wrote in a comment, “as long as there is doubt, this will be an unfortunate consequence.”

Journal of Digital Humanities
Journal of Digital Humanities

The JDH is an interesting publication, operating in part under the catch-the-good model of seeing what’s already out there and getting discussed, and aggregating it all into a quarterly journal. In some cases, that means re-purposing pre-existing videos and blog posts and social media conversations into journal “articles.” In others, it means soliciting original reviews or works that fit with the theme of a current important issue in DH. Some articles are reviewed barely at all – especially the videos – and some are heavily reviewed. The structure of the journal itself, over its five issues thus-far, has changed drastically to fit the topic and the experimental whims of editors and guest editors.

The issue that Elijah Meeks and I guest edited changed in format at least three times in the month or so we had to solidify the issue. It’s fast-paced, not always organized, and generally churns out good scholarship that seems to be cited heavily on blogs and in DH syllabi, but not yet so much in traditional press articles or books. The flexibility, I think, is part of its charm and experimental nature, but as this recent set of problems shows, it is not without its major downsides. The editors, guest editors, and invited authors are rarely certain of what the end product will look like, and if there is the slightest miscommunication, this uncertainty can lead to disaster. The variable nature of the editing process also opens the door for bias of various sorts, and because there is not a clear plan from the beginning, that bias (and the fear of bias) is hard to guard against. These are issues that need to be solved.

Roopika RisamMatt Burton, and I, among others, have all weighed in on the best way to move forward, and I’m drawing on these previous comments for this plan. It’s not without its holes and problems, and I am hoping there will be comments to improve the proposed process, but hopefully something like what I’m about to propose can let the JDH retain its flexibility while preventing further controversies of this particular variety.

  • Create a definitive set of guidelines and mission statement that is distributed to guest editors and authors before the process of publication begins. These guidelines do not need to set the publication process in stone, but can elucidate the roles of each individual and make clear the experimental nature of the JDH. This document cannot be deviated from within an issue publication cycle, but can be amended yearly. Perhaps, as with the open intent of the journal, part of this process can be crowdsourced from the previous year’s editors-at-large of DHNow.
  • Have a week at the beginning of each issue planning phase where authors (if they’ve been chosen yet), guest editors, and editors discuss what particular format the forthcoming issue will take, how it will be reviewed, and so forth. This is formalized into a binding document and will not be changed. The editorial staff has final say, but if the guest editors or authors do not like the final document, they have ample opportunity to leave.
  • Change the publication rate from quarterly to thrice-yearly. DH changes quickly, it shouldn’t be any slower than that, but quarterly seems to be a bit too tight for this process to work smoothly–especially with the proposed week-long committee session to figure out how the issue be run.
  • Make the process of picking special issue topics more open. I know the special issue I worked on came about by Elijah asking the JDH editors if they’d be interested in a topic modeling issue, and after (I imagine) some internal discussion, they agreed. The dhpoco special issue may have had a similar history. Even a public statement of “these people came to us, and this is why we thought the topic was relevant” would likely go a long way in fostering trust in the community.
  • Make the process of picking articles and authors more open; this might be the job of special issue guest editors, as Elijah and I were the ones who picked most of the content. Everyone has their part to play. What’s clear is there is a lot of confusion right now about how it works; some on Twitter recently have pointed out that, until recently, they’d assumed all articles came from the DHNow filter. Making content choice more clear in an introductory editorial would be useful.

Obviously this is not a cure for all ills, but hopefully it’s good ground to start on the path forward. If the JDH takes this opportunity to reform some of their policies, my hope is that it will be seen as an olive branch to the community, ensuring to the best of their ability that there will be no question of whether bias is taking place, implicit or otherwise. Further suggestions in the comments are welcome.

Addendum: In private communication with Matt Burton, he and I realized that the ‘special issue’ and ‘guest editor’ role is not actually one that seems to be aligned with the initial intent of the JDH, which seemed instead to be about reflecting the DH discourse from the previous quarter. Perhaps a movement away from special issues, or having a separate associated entity for special issues with its own set of rules, would be another potential path forward.

Categories
method personal research

The Historian’s Macroscope

Whelp, it appears the cat’s out of the bag. Shawn Graham, Ian Milligan, and I have signed our ICP contract and will shortly begin the process of writing The Historian’s Macroscope, a book introducing the process and rationale of digital history to a broad audience. The book will be a further experiment in live-writing: as we have drafts of the text, they will go online immediately for comments and feedback. The publishers have graciously agreed to allow us to keep the live-written portion online after the book goes on sale, and though what remains online will not be the final copy-edited and typeset version, we (both authors and publishers) feel this is a good compromise to prevent the cannibalization of book sales while still keeping much of the content open and available for those who cannot afford the book or are looking for a taste before they purchase it. Thankfully, this plan also fits well with my various pledges to help make a more open scholarly world.

Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
We’re announcing the project several months earlier than we’d initially intended. In light of the American Historical Association’s recent statement endorsing the six year embargo of dissertations on the unsupported claim that it will help career development, we wanted to share our own story to offset the AHA’s narrative. Shawn, Ian, and I have already worked together on a successful open access chapter in The Programming Historian, and have all worked separately releasing public material on our respective blogs. It was largely because of our open material that we were approached to write this book, and indeed much of the material we’ve already posted online will be integrated into the final publication. It would be an understatement to say our publisher’s liaison Alice jumped at this opportunity to experiment with a semi-open publication.

The disadvantage to announcing so early is that we don’t have any content to tease you with. Stay-tuned, though. By September, we hope to have some preliminary content up, and we’d love to read your thoughts and comments; especially from those not already aligned with the DH world.

Categories
personal research

Acceptances to Digital Humanities 2013 (part 1)

The 2013 Digital Humanities conference in Nebraska just released its program with a list of papers and participants. As some readers may recall, when the initial round of reviews went out for the conference, I tried my hand at analyzing submissions to DH2013. Now that the schedule has been released, the data available puts us in a unique position to compare proposed against accepted submissions, thus potentially revealing how what research is being done compares with what research the DH community (through reviews) finds good or interesting. In my last post, I showed that literary studies and data/text mining submissions were at the top of the list; only half as many studies were historical rather than literary. Archive work and visualizations were also near the top of the list, above multimedia, web, and content analyses, though each of those were high as well.

A keyword analysis showed that while Visualization wasn’t necessarily at the top of the list, it was the most central concept connecting the rest of the conference together. Nobody knows (and few care) what DH really means; however, these analyses present the factors that bind together those who call themselves digital humanists and submit to its main conference. The post below explores to what extent submissions and acceptances align. I preserve anonymity wherever possible, as submitting authors did not do so with the expectation that turned down submission data would be public.

It’s worth starting out with a few basic acceptance summary statistics. As I don’t have access to poster data yet, nor do I have access to withdrawals, I can’t calculate the full acceptance rate, but there are a few numbers worth mentioning. Just take all of the percentages as a lower bounds, where withdrawals or posters might make the acceptance rate higher. Of the 144 long papers submitted, 66.6% of them (96) were accepted, although only 57.6% (83) were accepted as long papers; another 13 were accepted as short papers instead. Half of the submitted panels were accepted, although curiously, one of the panels was accepted instead as a long paper. For short papers, only 55.9% of those submitted were accepted. There were 66 poster submissions, but I do not know how many of those were accepted, or how many other submissions were accepted as posters instead. In all, excluding posters, 60.9% of submitted proposals were accepted. More long papers than short papers were submitted, but roughly equal numbers of both were accepted. People who were turned down should feel comforted by the fact that they faced some stiff competition.

As with most quantitative analyses, the interesting bits come more when comparing internal data than when looking at everything in aggregate. The first three graphs do just that, and are in fact the same data, but ordered differently. When authors submitted their papers to the conference, they could pick any number of keywords from a controlled vocabulary. Looking at how many times each keyword was submitted with a paper (Figure 1) can give us a basic sense of what people are doing in the digital humanities. From Figure 1 we see (again, as a version of this viz appeared in the last post) that “Literary Studies” and “Text Mining” are the most popular keywords among those who submitted to DH2013; the rest you can see for yourself. The total height of the bar (red + yellow) represents the number of total submissions to the conference.

Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions.
Figure 1: Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of submissions. (click to enlarge)

Figure 2 shows the same data as Figure 1, but sorted by acceptance rates rather than the total number of submissions. As before, because we don’t know about poster acceptance rates or withdrawals, you should take these data with a grain of salt, but assuming a fairly uniform withdrawal/poster rate, we can still make some basic observations. It’s also worth pointing out that the fewer overall submissions to the conference with a certain keyword, the less statistically meaningful the acceptance rate; with only one submission, whether or not it’s accepted could as much be due to chance as due to some trend in the minds of DH reviewers.

With those caveats in mind, Figure 2 can be explored. One thing that immediately pops out is that “Literary Studies” and “Text Mining” both have higher than average acceptance rates, suggesting that not only are a lot of DHers doing that kind of research; that kind of research is still interesting enough that a large portion of it is getting accepted, as well. Contrast this with the topic of “Visualization,” whose acceptance rate is closer to 40%, significantly fewer than the average acceptance rate of 60%. Perhaps this means that most reviewers thought visualizations worked better as posters, the data for which we do not have, or perhaps it means that the relatively low barrier to entry on visualizations and their ensuing proliferation make them more fun to do than interesting to read or review.

“Digitisation – Theory and Practice” has a nearly 60% acceptance rate, yet “Digitisation; Resource Creation; and Discovery” has around 40%, suggesting that perhaps reviewers are more interested in discussions about digitisation than the actual projects themselves, even though far more “Digitisation; Resource Creation; and Discovery” papers were submitted than “”Digitisation – Theory and Practice.” The imbalance between what was submitted and what was accepted on that front is particularly telling, and worth a more in-depth exploration by those who are closer to the subject. Also tucked at the bottom of the acceptance rate list are three related keywords “Digital Humanities – Institutional Support, “Digital Humanities – Facilities,” & “Glam: Galleries; Libraries; Archives; Museums,” each with a 25% acceptance rate. It’s clear the reviewers were not nearly as interested in digital humanities infrastructure as they were in digital humanities research. As I’ve noted a few times before, “Historical Studies” is also not well-represented, with both a lower acceptance rate than average and a lower submission rate than average. Modern digital humanities, at least as it is represented by this conference, appears far more literary than historical.

Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers.
Figure 2. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by number of accepted papers. (click to enlarge)

Figure 3, once again, has the same data as Figures 2 and 1, but is this time sorted simply by accepted papers and panels. This is the front face of DH2013; the landscape of the conference (and by proxy the discipline) as seen by those attending. While this reorientation of the graph doesn’t show us much we haven’t already seen, it does emphasize the oddly low acceptance rates of infrastructural submissions (facilities, libraries, museums, institutions, etc.) While visualization acceptance rates were a bit low, attendees of the conference will still see a great number of them, because the initial submission rate was so high. Conference goers will see that DH maintains a heavy focus on the many aspects of text: its analysis, its preservation, its interfaces, and so forth. The web also appears well-represented, both in the study of it and development on it. Metadata is perhaps not as strong a focus as it once was (historical DH conference analysis would help in confirming this speculation on my part), and reflexivity, while high (nearly 20 “Digital Humanities – Nature and Significance” submissions), is far from overwhelming.

A few dozen papers will be presented on multimedia beyond simple text – a small but not insignificant subgroup. Fewer still are papers on maps, stylometry, or medieval studies, three subgroups I imagine once had greater representation. They currently each show about the same force as gender studies, which had a surprisingly high acceptance rate of 85% and is likely up-and-coming in the DH world. Pedagogy was much better represented in submissions than acceptances, and a newcomer to the field coming to the conference for the first time would be forgiven in thinking pedagogy was less of an important subject in DH than veterans might think it is.

Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)
Figure 3. Acceptance rates of DH2013 by Keywords attached to submissions, sorted by acceptance rate. (click to enlarge)

As what’s written so far is already a giant wall of text, I’ll go ahead and leave it at this for now. When next I have some time I’ll start analyzing some networks of keywords and titles to find which keywords tend to be used together, and whatever other interesting things might pop up. Suggestions and requests, as always, are welcome.

 

Categories
reviews

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 2

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s the review of part II (Analysis), chapter 5 (metadata). Read Part 1, Part 3, …

Part II: Analysis

Part II of Macroanalysis moves from framing the discussion to presenting a series of case studies around a theme, starting fairly simply in claims and types of analyses and moving into the complex. This section takes up 130 of the 200 pages; in a discipline (or whatever DH is) which has coasted too long on claims that the proof of its utility will be in the pudding (eventually), it’s refreshing to see a book that is at least 65% pudding. That said, with so much substance – particularly with so much new substance – Jockers opens his arguments up for specific critiques.

Aiming for more pudding-based scholarly capital in DH. via brenthor.
Aiming for more pudding-based scholarly capital in DH. via brenthor.

Quantitative arguments must by their nature be particularly explicit, without the circuitous language humanists might use to sidestep critiques. Elijah Meeks and others have been arguing for some time now that the requirement to solidify an argument in such a way will ultimately be a benefit to the humanities, allowing faster iteration and improvement on theories. In that spirit, for this section, I offer my critiques of Jockers’ mathematical arguments not because I think they are poor quality, but because I think they are particularly good, and further fine-tuning can only improve them. The review will now proceed one chapter at a time.

Metadata

Jockers begins his analysis exploring what he calls the “lowest hanging fruit of literary history.” Low hanging fruit can be pretty amazing, as Ted Underwood says, and Jockers wields some fairly simple data in impressive ways. The aim of this chapter is to show that powerful insights can be achieved using long-existing collections of library metadata, using a collection of nearly 800 Irish American works over 250 years as a sample dataset for analysis. Jockers introduces and offsets his results against the work of Charles Fanning, whom he describes as the expert in Irish American fiction in aggregate. A pre-DH scholar, Fanning was limited to looking through only the books he had time to read; an impressive many, according to Jockers, but perhaps not enough. He profiles 300 works, fewer than half of those represented in Jockers’ database.

The first claim made in this chapter is one that argues against a primary assumption of Fanning’s. Fanning expends considerable effort explaining why there was a dearth of Irish American literature between 1900-1930; Jockers’ data show this dearth barely existed. Instead, the data suggest, it was only eastern Irish men who had stopped writing. The vacuum did not exist west of the Mississippi, among men or women. Five charts are shown as evidence, one of books published over time, and the other four breaking publication down by gender and location.

Jockers is careful many times to make the point that, with so few data, the results are suggestive rather than conclusive. This, to my mind, is too understated. For the majority of dates in question, the database holds fewer than 6 books per year. When breaking down by gender and location, that number is twice cut in half. Though the explanations of the effects in the graphs are plausible, the likelihood of noise outweighing signal at this granularity is a bit too high to be able to distinguish a just-so story from a credible explanation. Had the data been aggregated in five- or ten-year intervals (as they are in a later figure 5.6), rather than simply averaged across them, the results may have been more credible. The argument may be brought up that, when aggregating across larger intervals, the question of where to break up the data becomes important; however, cutting the data into yearly chunks from January to December is no more arbitrary than cutting them into decades.

There are at least two confounding factors one needs to take into account when doing a temporal analysis like this. The first is that what actually happened in history may be causally contingent, which is to say, there’s no particularly useful causal explanation or historical narrative for a trend. It’s just accidental; the right authors were in the right place at the right time, and all happened to publish books in the same year. Generally speaking, if only around five books are published a year, though sometimes that number is zero and sometimes than number is ten, any trends that we see (say, five years with only a book or two) may credibly be considered due to chance alone, rather than some underlying effect of gender or culture bias.

The second confound is the representativeness of the data sample to some underlying ground truth. Datasets are not necessarily representative of anything, however as defined by Jockers, his dataset ought to be representative of all Irish American literature within a 250 year timespan. That’s his gold standard. The dataset obviously does not represent all books published under this criteria, so the question is how well do his publication numbers match up with the actual numbers he’s interested in. Jockers is in a bit of luck here, because what he’s interested in is whether or not there was a resounding silence among Irish authors; thus, no matter what number his charts show, if they’re more than one or two, it’s enough to disprove Fanning’s hypothesized silence. Any dearth in his data may be accidental; any large publications numbers are not.

This example chart compares a potential "real" underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the "real" dataset by some random number between 0 and 1.
This example chart compares a potential “real” underlying publication rate against several simulated potential sample datasets Jockers might have, created by multiplying the “real” dataset by some random number between 0 and 1.

I created the above graphic to better explain the second confounding factor of problematic samples. The thick black line, we can pretend, is the actual number of books published by Irish American authors between 1900 and 1925. As mentioned, Jockers would only know about a subset of those books, so each of the four dotted lines represents a possible dataset that he could be looking at in his database instead of the real, underlying data. I created these four different dotted lines by just multiplying the underlying real data by a random number between 0 and 1 1. From this chart it should be clear that it would not be possible for him to report an influx of books when there was a dearth (for example, in 1910, no potential sample dataset would show more than two books published). However, if Jockers wanted to make any other claims besides whether or not there was a dearth (as he tentatively does later on), his available data may be entirely misleading. For example, looking at the red line, Run 4, would suggest that ever-more books were being published between 1910 and 1918, when in fact that number should have decreased rapidly after about 1912.

The correction included in Macroanalysis for this potential difficulty was to use 5-year moving averages for the numbers rather than just showing the raw counts. I would suggest that, because the actual numbers are so small and a change of a small handful of books would look like a huge shift on the graph, this method of aggregation is insufficient to represent the uncertainty of the data. Though his charts show moving averages, they still shows small changes year-by-year, which creates a false sense of precision. Jockers’ chart 5.6, which aggregates by decade and does not show these little changes, does a much better job reflecting the uncertainty. Had the data showed hundreds of books per year, the earlier visualizations would have been more justifiable, as small changes would have amounted to less emphasized shifts in the graph.

It’s worth spending extra time on choices of visual representation, because we have not collectively arrived at a good visual language for humanities data, uncertain as they often are. Nor do we have a set of standard practices in place, as quantitative scientists often do, to represent our data. That lack of standard practice is clear in Macroanalysis; the graphs all have subtitles but no titles, which makes immediate reading difficult. Similarly, axis labels (“count” or “5-year average”) are unclear, and should more accurately reflect the data (“books published per year”), putting the aggregation-level in either an axis subtitle or the legend. Some graphs have no axis labels at all (e.g., 5.12-5.17). Their meanings are clear enough to those who read the text, or those familiar with ngram-style analyses, but should be more clear at-a-glance.

Questions of visual representation and certainty aside, Jockers still provides several powerful observations and insights in this chapter. Figure 5.6, which shows Irish American fiction per capita, reveals that westerners published at a much higher relative rate than easterners, which is a trend worth explaining (and Jockers does) that would not have been visible without this sort of quantitative analysis. The chapter goes on to list many other credible assessments and claims in light of the available data, as well as a litany of potential further questions that might be explored with this sort of analysis.  He also makes the important point that, without quantitative analysis, “cherry-picking of evidence in support of a broad hypothesis seems inevitable in the close-reading scholarly traditions.” Jockers does not go so far as to point out the extension of that rule in data analysis; with so many visible correlations in a quantitative study, one could also cherry-pick those which support one’s hypothesis. That said, cherry-picking no longer seems inevitable. Jockers makes the point that Fanning’s dearth thesis was false because his study was anecdotal, an issue Jockers’ dataset did not suffer from. Quantitative evidence, he claims, is not in competition with evidence from close reading; both together will result in a “more accurate picture of our subject.”

The second half of the chapter moves from publication counting to word analysis. Jockers shows, for example, that eastern authors are less likely to use words in book titles that identify their work as ‘Irish’ than western authors, suggesting lower prejudicial pressures west of the Mississippi may be the cause. He then complexifies the analysis further, looking at “lexical diversity” across titles in any given year – that is, a year is more lexically diverse if the titles of books published that year are more unique and dissimilar from one another. Fanning suggests the years of the famine were marked by a lack of imagination in Irish literature; Jockers’ data supports this claim by showing those years had a lower lexical diversity among book titles. Without getting too much into the math, as this review of a single chapter has already gone on too long, it’s worth pointing out that both the number of titles and the average length of titles in a given year can affect the lexical diversity metric. Jockers points this out in a footnote, but there should have been a graph comparing number of titles per year, length per year, and lexical diversity, to let the readers decide whether the first two variables accounted for the third, or whether to trust the graph as evidence for Fanning’s lack-of-imagination thesis.

One of the particularly fantastic qualities about this sort of research is that readers can follow along at home, exploring on their own if they get some idea from what was brought up in the text. For example, Jockers shows that the word ‘century’ in British novel titles is popular leading up to and shortly after the turn of the nineteenth century. Oddly, in the larger corpus of literature (and it seems English language books in general), we can use bookworm.culturomics.org to see that, rather than losing steam around 1830, use of ‘century’ in most novel titles actually increases until about 1860, before dipping briefly. Moving past titles (and fiction in general) to full text search, google ngrams shows us a small dip around 1810 followed by continued growth of the word ‘century’ in the full text of published books. These different patterns are interesting particularly because they suggest there was something unique about the British novelists’ use of the word ‘century’ that is worth explaining. Oppose this with Jockers’ chart of the word ‘castle’ in British book titles, whose trends actually correspond quite well to the bookworm trend until the end of the chart, around 1830. [edit: Ben Schmidt points out in the comments that bookworm searches full text, not just metadata as I assumed, so this comparison is much less credible.]

Use of the word 'castle' in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.
Use of the word ‘castle’ in the metadata of books provided by OpenLibrary.org. Compare with figure 5.14. via bookworm.

Jockers closes the chapter suggesting that factors including gender, geography, and time help determine what authors write about. That this idea is trivial makes it no less powerful within the context of this book: the chapter is framed by the hypothesis that certain factors influence Irish American literature, and then uses quantitative, empirical evidence to support those claims. It was oddly satisfying reading such a straight-forward approach in the humanities. It’s possible, I suppose, to quibble over whether geography determines what’s written about or whether the sort of person who would write about certain things is also the sort of person more likely to go west, but there can be little doubt over the causal direction of the influence of gender. The idea also fits well with the current complex systems approach to understanding the world, which mathematically suggests that environmental and situational constraints (like gender and location) will steer the unfolding of events in one direction or another. It is not a reductionist environmental determinism so much as a set of probabilities, where certain environments or situations make certain outcomes more likely.

Stay tuned for Part the Third!

Notes:

  1. If this were a more serious study, I’d have multiplied by a more credible pseudo-random value keeping the dataset a bit closer to the source, but this example works fine for explanatory value
Categories
reviews

Liveblogged Review of Macroanalysis by Matthew L. Jockers, Part 1

I just got Matthew L. Jocker’s Macroanalysis in the mail, and I’m excited enough about it to liveblog my review. Here’s my review of part I (Foundation), all chapters. Read Part 2, Part 3, …

Macroanalysis: Digital Methods & Literary History is a book whose time has come. “Individual creativity,” Matthew L. Jockers writes, “is highly constrained, even determined, by factors outside of what we consider to be a writer’s conscious control.” Although Jockers’ book is a work of impressive creativity, it also fits squarely within a larger set of trends. The scents of ‘Digital Humanities’ (DH) and ‘Big Data’ are in the air, the funding-rich smells attracting predators from all corners, and Jockers’ book floats somewhere in the center of it all. As with many DH projects, Macroanalysis attempts the double goal of explaining a new method and exemplifying the type of insights that can be achieved via this method. Unlike many projects, Jockers succeeds masterfully at both. Macroanalysis introduces its readers to large scale quantitative methods for studying literary history, and through those methods explores the nature of creativity and influence in general and the place of Irish literature within its larger context in particular.

I’ve apparently gained a bit of a reputation for being overly critical, and it’s worth pointing out at the beginning of this review that this trend will continue for Macroanalysis. That said, I am most critical of the things I love the most, and readers who focus on any nits I might pick without reading the book themselves should keep in mind that the overall work is staggering in its quality, and if it does fall short in some small areas, it is offset by the many areas it pushes impressively forward.

Macroanalysis arrives on bookshelves eight years after Franco Moretti’s Graphs, Maps, and Trees (2005), and thirteen years after Moretti’s “Conjectures on World Literature” went to press in early 2000, where he coined the phrase “distant reading.” Moretti’s distant reading is a way of seeing literature en masse, of looking at text at the widest angle and reporting what structures and forms only become visible at this scale. Moretti’s early work paved the way, but as might be expected with monograph published the same year as the initial release of Google Books, lack of available data made it stronger in theory than in computational power.

From Moretti's Graphs, Maps, and Trees
From Moretti’s Graphs, Maps, and Trees

In 2010, Moretti and Jockers, the author of Macroanalysis, co-founded the Stanford Lit Lab for the quantitative and digital research of literature. The two have collaborated extensively,  and Jockers acknowledge’s Moretti’s influence on his monograph. That said, in his book, Jockers distances himself slightly from Moretti’s notion of distant reading, and it is not the first time he has done so. His choice of “analysis” over “reading” is an attempt to show that what his algorithms are doing at this large scale is very different from our normal interpretive process of reading; it is simply gathering and aggregating data, the output of which can eventually be read and interpreted instead of or in addition to the texts themselves. The term macroanalysis was inspired by the difference between macro- and microeconomics, and Jockers does a good job justifying the comparison. Given that Jockers came up with the comparison in 2005, one does wonder if he would have decided on different terminology after our recent financial meltdown and the ensuing large-scale distrust of macroeconomic methods. The quantitative study of history, cliometrics, also had its origins in economics and suffered its own fall from grace decades ago; quantitative history still hasn’t recovered.

Part I: Foundation

I don’t know whether the allusion was intended, but lovers of science fiction and quantitative cultural studies will enjoy the title of Part I: “Foundation.” It shares a name with a series of books by Isaac Asimov, centering around the ability to combine statistics and human-centric research to understand and predict people’s behaviors. Punny titles aside, the section provides the structural base of the monograph.

The story of Foundation in a nutshell. Via c0ders.
The story of Foundation in a nutshell. Via c0ders.

Much of the introductory chapters are provocative statements about the newness of the study at hand, and they are not unwarranted. Still, I can imagine that the regular detractors of technological optimism might argue their usual arguments in response to Jockers’ pronouncements of a ‘revolution.’ The second chapter, on Evidence, raises some particularly important (and timely) points that are sure to raise some hackles. “Close reading is not only impractical as a means of evidence gathering in the digital library, but big data render it totally inappropriate as a method of studying literary history.” Jockers hammers home this point again and again, that now that anecdotal evidence based on ‘representative’ texts is no longer the best means of understanding literature, there’s no reason it should still be considered the gold standard of evidentiary support.

Not coming from a background of literary history or criticism, I do wonder a bit about these notions of representativeness (a point also often brought up by Ted Underwood, Ben Schmidt, and Jockers himself). This is probably something lit-researchers worked out in the 70s, but it strikes me that the questions being asked of a few ‘exemplary, representative texts’ are very different than the ones that ought to be asked of whole corpora of texts. Further, ‘representative’ of what? As this book appears to be aimed not only at traditional literary scholars, it would have been beneficial for Jockers to untangle these myriad difficulties.

One point worth noting is that, although Jockers calls his book Macroanalysis, his approach calls for a mixed method, the combination of the macro/micro, distant/close. The book is very careful and precise in its claims that macroanalysis augments and opens new questions, rather than replaces. It is a combination of both approaches, one informing the other, that leads to new insights. “Today’s student of literature must be adept at reading and gathering evidence from individual texts and equally adept at accessing and mining digital-text repositories.” The balance struck here is impressive: to ignore macroanalysis as a superior source of evidence for many types of large questions would be criminal, but its adoption alone does not make for good research (further, either without the other would be poorly done). For example, macroanalysis can augment close reading approaches by contextualizing a text within its broad historical and cultural moment, showing a researcher precisely where their object of research fits in the larger picture.

Historians would do well to heed this advice, though they are not the target audience. Indeed, historians play a perplexing role in Jockers’ narrative; not because his description is untrue, but because it ought not be true. In describing the digital humanities, Jockers calls it an “ambiguous and amorphous amalgamation of literary formalists, new media theorists, tool builders, coders, and linguists.” What place historians? Jockers places their role earlier, tracing the wide-angle view to the Annales historians and their focus on longue durée history. If historian’s influence ends there, we are surely in a sad state; that light, along with those of cliometrics and quantitative history, shone brightest in the 1970s before a rapid decline. Unsworth recently attributed the decline to the fallout following Time on the cross (Fogel & Engerman, 1974), putting quantitative methods in history “out of business for decades.” The ghost of cliometrics still haunts historians to such an extent that the best research in that area, to this day, comes more from information scientists and applied mathematicians than from historians. Digital humanities may yet exorcise that ghost, but it has not happened yet, as evidenced in part by the glaring void in Jockers’ introductory remarks.

It is with this framing in mind that Jockers embarks on his largely computational and empirical study of influence and landscape in British and American literature.

Categories
personal research

Topic Modeling in the Humanities

The cat is out of the bag: The Journal of Digital Humanities (2:1), special issue on topic modeling, has been released. It’s a fairly apt phrase, because the process of editing the issue felt a bit like stuffing a cat in a bag. When Elijah Meeks approached the JDH editors about he and I guest editing an issue on topic modeling, I don’t think either of us quite realized exactly what that would entail. This post is not about the issue or its contents; Elijah and I already wrote that introduction, where we trace the history of topic modeling in the humanities and frame the articles in the issue. Instead, I’d like to take a short post waxing a bit more reflexive than is usual for this blog, discussing my first experience guest editing a journal and how it all came together. Elijah’s similar post can be found here.

We began with the idea that topic modeling’s relationship to the humanities was just now reaching an important historical moment. Discussions were fast-paced, interesting, and spread across a wide array of media. Better still, humanists were contributing to the understanding of a machine learning algorithm! If that isn’t exciting to you, then… well, you’re probably a normal, well-functioning human being. But we found it exciting, and we thought the JDH, with its catch-the-good post-publication model, would be the perfect place to bring it all together. We quickly realized the difficulty in in stuffing the DH/Topic Modeling cat into the JDH bag.

Firstly, there was just so much of it out there. Discussions meandered between twitter and blogs and conferences; no snapshot of the conversation could ever be fully inclusive. We threw around a bunch of ideas, including a 20-person Google+ Hangout Panel discussing the benefits and pitfalls of the approach, but most of our ideas proved fairly untenable. Help came from the editors of the JDH,  particularly Joan Fragaszy Troyano, who tirelessly worked with us and helped us to get everything organized and together, while allowing us the freedom to take the issue where we wanted it to go. She was able to help us set up something new to the journal, a space which would aggregate tweets and comments about the issue in the month following its release, which Elijah and I will then put together and release as a community appendix in May, hoping to capture some of the rich interchange on topic modeling.

From Graham & Milligan's review of MALLET.
Topic model from Graham & Milligan’s review of MALLET.

One particularly troublesome difficulty, which we never resolved to our liking, was one of gender and representation. It has been pointed out before that the JDH was not as diverse or gender-balanced as we might want it to be, despite most of its staff being women. The editors have pointed out that DH is unfortunately homogeneous, and have worked to increase representation in their issues. Even after realizing the homogeneity in our issue (only two of our initially selected contributors were women, and all were white), we were unable to find other authors who both fit within the theme of the issue and were interested in contributing. I’m certain we must have missed someone crucial, for which I humbly apologize, but I honestly don’t know the best way to remedy this situation. Others have spoken much more eloquently on the subject and have had much better ideas than I ever could. If we had more time and space in the issue, diversity is the one area I would hope to improve.

Once the contributors were selected, the process of getting everything perfect began. Some articles, like Goldstone’s and Underwood’s piece on topic modeling the PMLA, were complete enough that we were happy putting the piece up as-is. One of our contributors was a bit worried, due to the post-publication process and the lack of standard peer-review, that this was more akin to a vanity press than a scholarly publication. I disagree (and hopefully we convinced the contributor to disagree as well); the JDH has several layers of peer review, as the editors and DH community filter the best available pieces through increasingly fine steps, until the selected articles represent the best of what was recently and publicly released. The pieces then went through a rigorous review process from the editorial staff. The original and greatly expanded posts particularly went through several iterations over a matter of months so they would fit as well as possible, and be the best they could be. Because of this process, we actually fell a bit behind schedule, but the resulting quality made the delays worth it.

I cannot stress enough how supportive the JDH editorial staff has been in making this issue work, particularly Joan, who helped Elijah and I figure out what we were doing and nudged us when we needed to be nudged, which was more frequently than I like admitting. I hope you all like the issue as much we do, and will contribute to the conversation on twitter or in blogs. If you post anything about the issue, just share a link in a tweet and comment and we’ll be sure to include you in the appendix.

Happy modeling!

p.s. I am sad that my favorite line of my and Elijah’s editorial was edited, though it was for good reason. The end of the first paragraph now reads “Were a critic of digital humanities to dream up the worst stereotype of the field, he or she would likely create something very much like this, and then name a popular implementation of it after a hammer.” The line (written by Elijah) originally read “Were Stanley Fish [emphasis added] to dream up the worst stereotype of the field, he would likely create something very much like this, and then name a popular implementation of it after a hammer.” The new version is more understandable to a wider audience, but I know some of my readers will appreciate this one more.

Categories
miscellanea

Call for Computational Folkloristics Papers

What’s this? Two CFPs at the Irregular in quick succession? That’s right, first Marten Düring’s fabulous Historical Network Research cfp comes out, and it has been followed closely by a call for papers by the great and powerful Tim Tangherlini. Those of you who don’t know him, should. Tangherlini organized the wildly successful Networks and Network Analysis for the Humanities NEH Summer Workshop and followup conference, is the co-author on a wonderful piece on computational folkloristics, and is a great guy to boot. He also dances comfortably on the bleeding edge of computational humanities research. All of these should be reason enough to either submit to or wait in eager anticipation of Tim’s forthcoming special issue of the Journal of American Folklore, the CFP for which is bellow.

I should point out that the Journal of American Folklore is not Open Access. If this is something you care about (and you should), but you’re interested in submitting an article, consider emailing the editor of JAF and asking for the journal to join the admirable ranks of Open Folklore, a Bloomington-based initiative that hopes to increase access to folklore material of all varieties. The initiative is also part of the American Folklore Society, which is responsible for the above-mentioned Journal of American Folklore.

One of Tangherlini's many neat analytic analyses of folklore.
One of Tangherlini’s many neat analytic analyses of folklore. via ACM.

——————————————————————————————————–

Over the course of the past decade, a revolution has occurred in the materials available for the study of folklore. The scope of digital archives of traditional expressive forms has exploded, and the magnitude of machine-readable materials available for consideration has increased by many orders of magnitude. Many national archives have made significant efforts to make their archival resources machine-readable, while other smaller initiatives have focused on the digitization of archival resources related to smaller regions, a single collector, or a single genre. Simultaneously, the explosive growth in social media, web logs (blogs), and other Internet resources have made previously hard to access forms of traditional expressive culture accessible at a scale so large that it is hard to fathom. These developments, coupled to the development of algorithmic approaches to the analysis of large, unstructured data and new methods for the visualization of the relationships discovered by these algorithmic approaches—from mapping to 3-D embedding, from time-lines to navigable visualizations—offer folklorists new opportunities for the analysis of traditional expressive forms. We label approaches to the study of folklore that leverage the power of these algorithmic approaches “Computational Folkloristics” (Abello, Broadwell, Tangherlini 2012).

The Journal of American Folklore invites papers for consideration for inclusion in a special issue of the journal edited by Timothy Tangherlini that focuses on “Computational Folkloristics.” The goal of the special issue is to reveal how computational methods can augment the study of folklore, and propose methods that can extend the traditional reach of the discipline. To avoid confusion, we term those approaches “computational” that make use of algorithmic methods to assist in the interpretation of relationships or structures in the underlying data. Consequently, “Computational Folkloristics” is distinct from Digital Folklore in the application of computation to a digital representation of a corpus.

We are particularly interested in papers that focus on: the automatic discovery of narrative structure; challenges in Natural Language Processing (NLP) related to unlabeled, multilingual data including named entity detection and resolution; topic modeling and other methods that explore latent semantic aspects of a folklore corpus; the alignment of folklore data with external historical datasets such as census records; GIS applications and methods; network analysis methods for the study of, among other things, propagation, community detection and influence; rapid classification of unlabeled folklore data; search and discovery on and across folklore corpora; modeling of folklore processes; automatic labeling of performance phenomena in visual data; automatic classification of audio performances. Other novel approaches to the study of folklore that make use of algorithmic approaches will also be considered.

A significant challenge of this special issue is to address these issues in a manner that is directly relevant to the community of folklorists (as opposed to computer scientists). Articles should be written in such a way that the argument and methods are accessible and understandable for an audience expert in folklore but not expert in computer science or applied mathematics. To that end, we encourage team submissions that bridge the gap between these disciplines. If you are in doubt about whether your approach or your target domain is appropriate for consideration in this special issue, please email the issue editor, Timothy Tangherlini at tango@humnet.ucla.edu, using the subject line “Computational Folkloristics—query”. Deadline for all queries is April 1, 2013.

All papers must conform to the Journal of American Folklore’s style sheet for authors. The guidelines for article submission are as follows: Essay manuscripts should be no more than 10,000 words in length, including abstract, notes, and bibliography. The article must begin with a 50- to 75-word abstract that summarizes the essential points and findings of the article. Whenever possible, authors should submit two copies of their manuscripts by email attachment to the editor of the special issue at: tango@humnet.ucla.edu. The first copy should be sent in Microsoft Word or Rich Text Format (rtf) and should include the author’s name. Figures should not be included in this document, but “call outs” should be used to designate where figures should be placed (e.g., “<insert Figure 1 here>”). A list at the end of the article (placed after the bibliography) should detail the figures to be included, along with their captions. The second copy of the manuscript should be sent in Portable Document Format (pdf). This version should not include the author’s name or any references within the text that would identify the author to the manuscript reviewers. Passages that would identify the author can be marked in the following manner to indicate excised words: (****). Figures should be embedded in this version just as they would ideally be placed in the published text. Possible supplementary materials (e.g., additional photographs, sound files, video footage, etc.) that might accompany the article in its online version should be described in a cover letter addressed to the editor. An advisory board for the special issue consisting of folklorists and computer scientists will initially consider all papers. Once accepted for the special issue, all articles will be subject to the standard refereeing procedure for the journal. Deadline for submissions for consideration is June 15, 2013. Initial decisions will be made by August 1, 2013. Final decisions will be made by October 1, 2013. We expect the issue to appear in 2014.

Categories
miscellanea

CfP: “Historical Network Research” at Sunbelt, May 21-26, Germany

Marten Düring, an altogether wonderful researcher who is responsible for this brilliant bibliography of networks in history, has issues a call for papers to participate in this year’s Sunbelt Conference, which is one of the premier social network analysis conferences in the world.

Historical network. via.
Historical network. via Marten Düring.

————————-

Call for papers “Historical Network Research” at the XXXIII. Sunbelt Conference, May 21-26 – University of Hamburg, Germany

 

The concepts and methods of social network analysis in historical research are recently being used not only as a mere metaphor but are increasingly applied in practice. In the last decades several studies in the social sciences proved that formal methods derived from social network analysis can be fruitfully applied to selected bodies of historical data as well. These studies however tend to be strongly influenced by concerns, standards of data processing, and, above all, epistemological paradigms that have their roots in the social sciences. Among historians, the term network has been used in a metaphorical sense alone for a long time. It was only recently that this has changed.
We invite papers which successfully integrate social network analysis methods and historical research methods and reflect on the added value of their methodologies. Topics could cover (but are not limited to) network analyses of correspondences, social movements, kinship or economic systems in any historical period.
Submission will be closing on December 31 at 11:59:59 EST. Please limit your abstract to 250 words. Please submit your abstract here: http://www.abstractserver.com/sunbelt2013/absmgm/
and select “Historical Network Research” as session title in the drop down box on the submission site. Please put a note in the “additional notes” box on the abstract submission form that states Marten During and Martin Stark as the session organizers.
For further information on the venue and conference registration see: http://hamburg-sunbelt2013.org/, for any questions regarding the panel, please get in touch with the session organizers.

Session organizers:
Marten During, Radboud University Nijmegen, martenduering@gmail.com
Martin Stark, University of Hamburg, martin.stark@wiso.uni-hamburg.de

Check https://sites.google.com/site/historicalnetworkresearch/ for a detailed bibliography, conferences, screencasts and other resources.

css.php