This post’s about the topical coverage of DH2015 in Australia. If you’re curious about how the landscape compares to previous years, see this post. You’ll see a lot of text, literature, and visualizations this year, as well as archives and digitisation projects. You won’t see a lot of presentations in other languages, or presentations focused on non-text sources. Gender studies is pretty much nonexistent. If you want to get accepted, submit pieces about visualization, text/data, literature, or archives. If you want to get rejected, submit pieces about pedagogy, games, knowledge representation, anthropology, or cultural studies.
I’m sorry. This post is going to contain a lot of giant pictures, because I’m in the mountains of Australia and I’d much rather see beautiful vistas than create interactive visualizations in d3. Deal with it, dweebs. You’re just going to have to do a lot of scrolling down to see the next batch of text.
This year’s conference presents a mostly-unsurprising continuations of the status quo (see 2014’s and 2013’s topical landscapes). Figure 1, below, shows the top author-chosen topic words of DH2015, as a proportion of the total presentations at the conference. For example, an impressive quarter, 24%, of presentations at DH2015 are about “text analysis”. The authors were able to choose multiple topics for each presentation, which is why the percentages add up to way more than 100%.
Scroll down for the rest of the post.
Text analysis, visualization, literary studies, data mining, and archives take top billing. History’s a bit lower, but at least there’s more history than the abysmal showing at DH2013. Only a tenth of DH2015 presentations are about DH itself, which is maybe impressive given how much we talk about ourselves? (cf. this post)
As usual, gender studies representation is quite low (1%), as are foreign language presentations and presentations not centered around text. I won’t do a lot of interpretation this post, because it’d mostly be repeat of earlier years. At any rate, acceptance rate is a bit more interesting than coverage this time around. Figure 2 shows acceptance rates of each topic, ordered by volume. Figure 3 shows the same, sorted by acceptance rate.
The topics that appear most frequently at the conference are on the far left, and the red line shows the percent of submitted articles that will be presented at DH2015. The horizontal black line is the overall acceptance rate to the conference, 72%, just to show which topics are above or below average.
Notice that all the most well-represented topics at DH2015 have a higher-than-average acceptance rate, possibly suggesting a bit of path-dependence on the part of peer reviewers or editors. Otherwise, it could mean that, since a majority peer reviewers were also authors in the conference, and since (as I’ve shown) the majority of authors have a leaning toward text, lit, and visualization, it’s also what they’re likely to rate highly in peer review.
The first dips we see under the average acceptance rate is “Interdisciplinary Studies” and “Historical Studies” (☹), but the dips aren’t all that low, and we ought not to read too much into it without comparing it to earlier conferences. More significant are the low rates for “Cultural Studies”, and even more than that are the two categories on Teaching, Pedagogy, and Curriculum. Both categories’ acceptance rates are about 20% under the average, and although they’re obviously correlated with one another, the acceptance rates are similar to 2014 and 2013. In short, DH peer reviewers or editors are more unlikely to accept submissions on pedagogy than on most other topics, even though they sometimes represent a decent chunk of submissions.
Other low points worth pointing out are “Anthropology” (huh, no ideas there), “Games and Meaningful Play” (that one came as a surprise), and “Other” (can’t help you here). Beyond that, the submission counts are too low to read any meaningful interpretations into the data. The Game Studies dip is curious, and isn’t reflected in earlier conferences, so it could just be noise for 2015. The low acceptance rates in Anthropology are consistent 2013-2015, and it’d be worth looking more into that.
Topical Co-Occurrence, 2013-2015
Figure 4, below, shows how topics appear together on submissions to DH2013, DH2014, and DH2015. Technically this has nothing to do with acceptances, and little to do with this year specifically, but the visualization should provide a little context to the above analysis. Topics connect to one another if they appear on a submission together, and the line connecting them gets thicker the more connections two topics share.
Although the “Interdisciplinary Collaboration” topic has a low acceptance rate, it understandably ties the network together; other topics that play a similar role are “Visualization”, “Programming”, “Content Analysis”, “Archives”, and “Digitisation”. All unsurprising for a conference where people come together around method and material. In fact, this reinforces our “DH identity” along those lines, at least insofar as it is represented by the annual ADHO conference.
There’s a lot to unpack in this visualization, and I may go into more detail in the next post. For now, I’ve got a date with the Blue Mountains west of Sydney.
[Update!] Melissa Terras pointed out I probably made a mistake on 2015 long paper -> short paper numbers. I checked, and she was right. I’ve updated the figures accordingly.
Part 1 is about sheer numbers of acceptances to DH2015 and comparisons with previous years. DH is still growing, but the conference locale likely prohibited a larger conference this year than last. Acceptance rates are higher this year than previous years. Long papers still reign supreme. Papers with more authors are more likely to be accepted.
It’s that time of the year again, when all the good little boys, girls, and other genders of DH gather around the scottbot irregular in pointless meta-analysis quiet self-reflection. As most of you know, the 2015 Digital Humanities conference occurs next week in Sydney, Australia. They’ve just released the final program, full of pretty exciting work, which means I can compare it to my analysis of submissions to DH2015 (1, 2, & 3) to see how DH is changing, how work gets accepted or rejected, etc. This is part of my series on analyzing DH conferences.
Part 1 will focus on basic counts, just looking at percentages of acceptance and rejection by the type of presentation, and comparing it with previous years. Later posts will cover topical, gender, geography, and kangaroos. NOTE: When I say “acceptances”, I really mean “presentations that appear on the final program.” More presentations were likely accepted and withdrawn due to the expense of traveling to Australia, so take these numbers with appropriate levels of skepticism.1
Around 270 papers, posters, and workshops are featured in this year’s conference program, down from last year’s ≈350 but up from DH2013’s ≈240. Although this is the first conference since 2010 with fewer presentations than the previous year’s, I suspect this is due largely to geographic and monetary barriers, and we’ll see a massive uptick next year in Poland and the following in (probably) North America. Whether or not the trend will continue to increase in 2018’s Antarctic locale, or 2019’s special Lunar venue, has yet to be seen. 2
As you can see from the chart above, even given this year’s dip, both DH2015 and the annual DHSI event in Victoria reveals DH is still on the rise. It’s also worth noting that last year’s DHSI was likely the first where more people attended it than the international ADHO conference.
A full 72% of submissions to DH2015 will be presented in Sydney next week. That’s significantly more inclusive than previous years: 59% of submitted manuscripts made it to DH2014 in Lausanne, and 64% to DH2013.
At first blush, the loss of exclusivity may seem a bad sign of a conference desperate for attendees, but to my mind the exact opposite is true: this is a great step forward. Conference peer review & acceptance decisions aren’t particularly objective, so using acceptance as a proxy for quality or relevance is a bit of a misdirection. And if we can’t aim for consistent quality or relevance in the peer review process, we ought to aim at least for inclusivity, or higher acceptance rates, and let the participants themselves decide what they want to attend.
Acceptance rates broken down by form (panel, poster, short paper, long paper) aren’t surprising, but are worth noting.
73% of submitted long papers were accepted, but only 45% of them were accepted as long papers. The other 28% were accepted as posters or short papers.
61% of submitted short papers were accepted, but only 51% as short papers; the other 10% became posters.
85% of posters were accepted, all of them as posters.
85% of panels were accepted, but one of them was accepted as a long paper.
A few papers/panels were converted into workshops.
Weirdly, short papers tend to have a lower acceptance rate than long papers over the last three years. I think that’s because if a long paper is rejected, it’s usually further along in the process enough that it’s more likely to be secondarily accepted-as-a-poster, but even that doesn’t account for the entire differential in the acceptance rate. Anyone have any thoughts on this?
Looking over time, we see an increasingly large slice of the DH conference pie is taken up by long papers. My guess is this is just a natural growth as authors learn the difference between long and short papers, a distinction which was only introduced relatively recently.
This is simply wrong with the updated data (tip of the hat to Melissa Terras for pointing it out); the ratio of long papers to short papers is still in flux. My “guess” from earlier was just that, a post-hoc explanation attached to an incorrect analysis. Matthew Lincoln has a great description about why we should be wary of these just-so stories. Go read it.
The breakdown of acceptance rates for each conference isn’t very informative, due in part to the fact I only have the last three years. In another few years this will probably become interesting, but for those who just can’t get enough o’ them sweet sweet numbers, here they are, special for you:
DH is still pretty single-author-heavy. It’s getting better; over the last 10 years we’ve seen an upward trend in number of authors per paper (more details in a future blog post), but the last three years have remained pretty stagnant. This year, 35% of presentations & posters will be by a single author, 25% by two authors, 13% by 3 authors, and so on down the line. The numbers are unremarkably consistent with 2013 and 2014.
We do however see an interesting trend in acceptance rates by number of authors. The more authors on your presentation, the more likely your presentation is to be accepted. This is true of 2013, 2014, and 2015. Single-authored works are 54% likely to be accepted, while works authored by two authors are 67% likely to be accepted. If your submission has more than 7 authors, you’re incredibly unlikely to get rejected.
Obviously this is pure description and correlation; I’m not saying multi-authored works are higher quality or anything else. Sometimes, works with more authors simply have more recognizable names, and thus are more likely to be accepted. That said, it is interesting that large projects seem to be favored in the peer review process for DH conferences.
Stay-tuned for parts 2, π, 16, and 4, which will cover such wonderful subjects as topicality, gender, and other things that seem neat.
The appropriate level of skepticism here is 19.27 ↩
Do you like the digital humanities? Me too! You better like it, because this is the 700th or so in a series of posts about our annual conference, and I can’t imagine why else you’d be reading it.
My last post went into some summary statistics of submissions to DH2015, concluding in the end that this upcoming conference, the first outside the Northern Hemisphere, with the theme “Global Digital Humanities”, is surprisingly similar to the DH we’ve seen before. This post will compare this year to submissions to the previous two conferences, in Switzerland and the Nebraska. Part 3 will go into some more detail of geography and globalizing trends.
I can only compare the sheer volume of submissions this year to 2013 and 2014, which is as far back as I’ve got hard data. As many pieces were submitted for DH2015 as were submitted for DH2013 in Nebraska – around 360. Submissions to DH2014 shot up to 589, and it’s not yet clear whether the subsequent dip is an accident of location (Australia being quite far away from most regular conference attendees), or whether this signifies the leveling out of what’s been fairly impressive growth in the DH world.
This graph shows a pretty significant recent upward trend in DH by volume; if acceptance rates to DH2015 are comparable to recent years (60-65%), then DH2015 will represent a pretty significant drop in presentation volume. My gut intuition is this is because of the location, and not a downward trend in DH, but only time will tell.
Replying to my most recent post, Jordan T. T-H commented on his surprise at how many single-authored works were submitted to the conference. I suggested this was of our humanistic disciplinary roots, and that further analysis would likely reveal a trend of increasing co-authorship. My prediction was wrong: at least over the last three years, co-authorship numbers have been stagnant.
Roughly 40% of submissions to DH conferences over the past three years have been single-authored; the trend has not significantly changed any further down the line, either. Nickoal Eichmann and I are looking into data from the past few decades, but it’s not ready yet at the time of this blog post. This result honestly surprised me; just from watching and attending conferences, I had the impression we’ve become more multi-authored over the past few years.
Topically, we are noticing some shifts. As a few people noted on Twitter, topics are not perfect proxies for what’s actually going on in a paper; every author makes different choices on how they they tag their submissions. Still, it’s the best we’ve got, and I’d argue it’s good enough to run this sort of analysis on, especially as we start getting longitudinal data. This is an empirical question, and if we wanted to test my assumption, we’d gather a bunch of DHers in a room and see to what extent they all agree on submission topics. It’s an interesting question, but beyond the scope of this casual blog post.
Below is the list of submission topics in order of how much topical coverage has changed since 2013. For example, this year 21% of submissions were tagged as involving Text Analysis. By contrast, only 15% were tagged as Text Analysis in 2013, resulting in a growth of 6% over the last two years. Similarly, this year Internet and World Wide Web studies comprised 7% of submissions, whereas that number was 12% in 2013, showing coverage shrunk by 5%. My more detailed evaluation of the results are below the figure.
We see, as I previously suggested, that Text Analysis (unsurprisingly) has gained a lot of ground. Given the location, it should be unsurprising as well that Asian Studies has grown in coverage, too. Some more surprising results are the re-uptake of Digitisation, which have been pretty low recently, and the growth of GLAM (Galleries, Libraries, Archives, Museums), which I suspect if we could look even further back, we’d spot a consistent upward trend. I’d guess it’s due to the proliferation of DH Alt-Ac careers within the GLAM world.
Not all of the trends are consistent: Historical Studies rose significantly between 2013 and 2014, but dropped a bit in submissions this year to 15%. Still, it’s growing, and I’m happy about that. Literary Studies, on the other hand, has covered a fifth of all submissions in 2013, 2014, and 2015, remaining quite steady. And I don’t see it dropping any time soon.
Visualizations are clearly on the rise, year after year, which I’m going to count as a win. Even if we’re not branching outside of text as much as we ought, the fact that visualizations are increasingly important means DHers are willing to move beyond text as a medium for transmission, if not yet as a medium of analysis. The use of Networks is also growing pretty well.
As Jacqueline Wernimont just pointed out, representation of Gender Studies is incredibly low. And, as the above chart shows, it’s even lower this year than it was in both previous years. Perhaps this isn’t so surprising, given the gender ratio of authors at DH conferences recently.
Some categories involving Maps and GIS are increasing, while others are decreasing, suggesting small fluctuations in labeling practices, but probably no significant upward or downward trend in their methodological use. Unfortunately, most non-text categories dropped over the past three years: Music, Film & Cinema Studies, Creative/Performing Arts, and Audio/Video/Multimedia all dropped. Image Studies grew, but only slightly, and its too soon to say if this represents a trend.
We see the biggest drops in XML, Encoding, Scholarly Editing, and Interface & UX Design. This won’t come as a surprise to anyone, but it does show how much the past generation’s giant (putting together, cleaning, and presenting scholarly collections) is making way for the new behemoth (analytics). Internet / World Wide Web is the other big coverage loss, but I’m not comfortable giving any causal explanation for that one.
This analysis offers the same conclusion as the earlier one: with the exception of the drop in submissions, nothing is incredibly surprising. Even the drop is pretty well-expected, given how far the conference is from the usual attendees. The fact that the status is pretty quo is worthy of note, because many were hoping that a global DH would seem more diverse, or appreciably different, in some way. In Part 3, I’ll start picking apart geographic and deeper topical data, and maybe there we’ll start to see the difference.
It’s that time of the year again! The 2015 Digital Humanities conference will take place next summer in Australia, and as per usual, I’m going to summarize what is being submitted to the conference and, eventually, how those submissions become accepted. Each year reviewers get the chance to “bid” on conference submissions, and this lets us get a peak inside the general trends in DH research. This post (pt. 1) will focus solely on this year’s submissions, and next post will compare them to previous years and locations.
It’s important to keep in mind that trends in the conference over the last three years may be temporal, geographic, or accidental. The 2013 conference took place in Nebraska, 2014 in Switzerland, 2015 in Australia, and 2016 is set to happen in Poland; it’s to be expected that regional differences will significantly inform who is submitting pieces and what topics will be discussed.
This year, 358 pieces were submitted to the conference (about as many as were submitted to Nebraska in 2013, but more on that in the follow-up post). As with previous years, authors could submit four varieties of works: long papers, short papers, posters, and panels / multi-paper sessions. Long papers comprised 54% of submissions, panels 4%, posters 15%, and short papers 30%.
In total, there were 859 named authors on submissions – this number counts authors more than once if they appear on multiple submissions. Of those, 719 authors are unique. 1 Over half the submissions are multi-authored (58%), with 2.4 authors per submission on average, a median of 2 authors per submission, and a max of 10 authors on one submission. While the majority of submissions included multiple authors, the sheer number of single-authored papers still betrays the humanities roots of DH. The histogram is below.
As with previous years, authors may submit articles in any of a number of languages. The theme of this year’s conference is “Global Digital Humanities”, but if you expected a multi-lingual conference, you might be disappointed. Of the 358 submissions, 353 are in English. The rest are in French (2), Italian (2), and German (1).
Submitting authors could select from a controlled vocabulary to tag their submissions with topics. There were 95 topics to choose from, and their distribution is not especially surprising. Two submissions each were tagged with 25 topics, suggesting they are impressively far reaching, but for the most part submissions stuck to 5-10 topics. The breakdown of submissions by topic is below, where the percentage represents the percentage of submissions which are tagged by a specific topic. My interpretation is below that.
A full 21% of submissions include some form of Text Analysis, and a similar number claim Text or Data Mining as a topic. Other popular methodological topics are Visualizations, Network Analysis, Corpus Analysis, and Natural Language Processing. The DH-o-sphere is still pretty text-heavy; Audio, Video, and Multimedia are pretty low on the list, GIS even lower, and Image Analysis (surprisingly) even lower still. Bibliographic methods, Linguistics, and other approaches more traditionally associated with the humanities appear pretty far down the list. Other tech-y methods, like Stylistics and Agent-Based Modeling, are near the bottom. If I had to guess, the former is on its way down, and the latter on its way up.
Unsurprisingly, regarding disciplinary affiliations, Literary Studies is at the top of the food chain (I’ll talk more about how this compares to previous years in the next post), with Archives and Repositories not far behind. History is near the top tier, but not quite there, which is pretty standard. I don’t recall the exact link, but Ben Schmidt argued pretty convincingly that this may be because there are simply fewer new people in History than in Literary Studies. Digitization seems to be gaining some ground its lost in the previous years. The information science side (UX Design, Knowledge Representation, Information Retrieval, etc.) seems reasonably strong. Cultural Studies is pretty well-represented, and Media Studies, English Studies, Art History, Anthropology, and Classics are among the other DH-inflected communities out there.
Thankfully we’re not completely an echo chamber yet; only about a tenth of the submissions are about DH itself – not great, not terrible. We still seem to do a lot of talking about ourselves, and I’d like to see that number decrease over the next few years. Pedagogy-related submissions are also still a bit lower than I’d like, hovering around 10%. Submissions on the “World Wide Web” are decreasing, which is to be expected, and TEI isn’t far behind.
All in all, I don’t really see the trend toward “Global Digital Humanities” that the conference is themed to push, but perhaps a more complex content analysis will reveal a more global DH than we’ve sen in the past. The self-written Keyword tags (as opposed to the Topic tags, not a controlled vocabulary) reveal a bit more internationalization, although I’ll leave that analysis for a future post.
It’s worth pointing out there’s a statistical property at play that makes it difficult to see deviations from the norm. Shakespeare appears prominently because many still write about him, but even if Shakespearean research is outnumbered by work on more international playwrights, it’d be difficult to catch, because I have no category for “international playwright” – each one would be siphoned off into its own category. Thus, even if the less well-known long tail topics significantly outweigh the more popular topics, that fact would be tough to catch.
All in all, it looks like DH2015 will be an interesting continuation of the DH tradition. Perhaps the most surprising aspect of my analysis was that nothing in it surprised me; half-way around the globe, and the trends over there are pretty identical to those in Europe and the Americas. It’ll take some more searching to see if this is a function of the submitting authors being the same as previous years (whether they’re all simply from the Western world), or whether it is actually indicative of a fairly homogeneous global digital humanities.
Stay-tuned for Part 2, where I compare the analysis to previous years’ submissions, and maybe even divine future DH conference trends using tea leaves or goat entrails or predictive modeling (whichever seems the most convincing; jury’s still out).
As far as I can tell – I used all the text similarity methods I could think of to unify the nearly-duplicate names. ↩
The overall acceptance rate to DH2014 was 59%, although that includes many papers and panels that were accepted as posters. There were 589 submissions this year (compared to 348 submissions last year), of which 345 were accepted. By submission medium, this is the breakdown:
Long papers: 62% acceptance rate (lower than last year)
Short papers: 52% acceptance rate (lower than last year)
Panels: 57% acceptance rate (higher than last year)
Posters: 64% acceptance rate (didn’t collect this data last year)
A surprising number of submitted papers switched from one medium to another when they were accepted. A number of panels became long papers, a bunch of short papers became long papers, and a punch of long papers became short papers. Although a bunch of submissions became posters, no posters wound up “breaking out” to become some other medium. I was most surprised by the short papers which became long (13 in all), which leads me to believe some of them may have been converted for scheduling reasons. This is idle speculation on my part – the organizers may reply otherwise. [Edit: the organizers did reply, and assured us this was not the case. I see no recent to doubt that, so congratulations to those 13 short papers that became long papers!]
It’s worth keeping in mind, in all analyses listed here, that I do not have access to any withdrawals; accepted papers were definitely accepted, but not accepted may have been withdrawn rather than rejected.
Figures 3 and 4 all present the same data, but shed slightly different lights on digital humanities. Each shows the acceptance rate by various topics, but they’re ordered slightly differently. All submitting authors needed to select from a limited list of topics to label their submissions, in order to aid with selecting peer reviewers and categorization.
Figure 3 sorts topics by the total amount that were accepted to DH2014. This is at odds with Figure 2 from my post on DH2014 submissions, which sorts by total number of topics submitted. The figure from my previous post gives a sense of what digital humanists are doing and submitting, whereas Figure 3 from this post gives a sense of what the visitor to DH2014 will encounter.
The visitor to DH2014 won’t see a hugely different topical landscape than the visitor to DH2013 (see analysis here). Literary studies, text analysis, and text mining still reign supreme, with archives and repositories not far behind. Visitors will see quite a bit fewer studies dedicated to the internet and the world wide web, and quite a bit more dedicated to historical and corpus-based research. More details can be seen by comparing the individual figures.
Figure 4, instead, sorts the topics by their acceptance rate. The most frequently accepted topics appear at the left, and the least frequently appear at the right. A lighter red line is used to show acceptance rates of the same topics for 2013. This graph shows what peers consider to me more credit-worthy, and how this has changed since 2013.
It’s worth pointing out that the highest and lowest acceptance rates shouldn’t be taken very seriously; with so few submitted articles, the rates are as likely random as indicative of any particularly interesting trend. Also, for comparisons with 2013, keep in mind the North American and European traditions of digital humanities may be driving the differences.
There are a few acceptance ratios worthy of note. English studies and GLAM (Galleries, Libraries, Archives, Museums) both have acceptance rates extremely above average, and also quite a bit higher than their acceptance rates from the previous year. Studies of XML are accepted slightly above the average acceptance rate, and also accepted proportionally more frequently than they were in 2013. Acceptance rates for both literary and historical studies papers are about average, and haven’t changed much since 2013 (even though there were quite a few more historical submissions than the previous year).
Along with an increase in GLAM acceptance rates, there was a big increase in rates for studies involving archives and repositories. It may be they are coming back in style, or it may be indicative of a big difference between European and North American styles. There was a pretty big drop in acceptance rates for ontology and semantic web research, as well as in pedagogy research across the board. Pedagogy had a weak foothold in DH2013, and has an even weaker foothold in 2014, with both fewer submitted articles, and a lower rate of acceptance on those submitted articles.
In the next blog post, I plan on drilling a bit into author-supplied keywords, the role of gender on acceptance rates, and the geography of submissions. As always, I’m happy to share data, but in this case I will only share sufficiently aggregated/anonymized data, because submitting authors who did not get accepted have an expectation of privacy that I intend to keep.
This post is mostly just thinking out loud, musing about two related barriers to scholarship: a stigma related to self-plagiarism, and various copyright concerns. It includes a potential way to get past them.
When Jonah Lehrer’s plagiarism scandal first broke, it sounded a bit silly. Lehrer, it turned out, had taken some sentences he’d used in earlier articles, and reused them in a few New Yorker blog posts. Without citing himself. Oh no, I thought. Surely, this represents the height of modern journalistic moral depravity.
Of course, later it was revealed that he’d bent facts, and plagiarized from others without reference, and these were all legitimately upsetting. And plagiarizing himself without reference was mildly annoying, though certainly not something that should have attracted national media attention. But it raises an interesting question: why is self-plagiarism wrong? And it’s as wrong in academia as it is in journalism.
It’s wrong to directly lift from any source without adequate citation. This only applies to non-cited self-plagiarism, obviously.
It’s wrong to double-dip. The currency of the academy is publications / CV lines, and if you reuse work to fill your CV, you’re getting an unfair advantage.
Confusion. Which version should people reference if you have so many versions of a similar work?
Copyright. You just can’t reuse stuff, because your previous publishers own the copyright on your earlier work.
That about covers it. Let’s pretend academics always cite their own works (because, hell, it gives them more citations), so we can do away with #1. Regular readers will know my position on publisher-owned copyright, so I just won’t get into #4 here to save you my preaching. The others are a bit more difficult to write off, but before I go on to try to do that, I’d like to talk a bit about my own experience of self-plagiarism as a barrier to scholarship.
I was recently invited to speak at the Universal Decimal Classification seminar, where I presented on the history of trees as a visual metaphor for knowledge classification. It’s not exactly my research area, but it was such a fun subject, I’ve decided to write an article about it. The problem is, the proceedings of the UDC seminar were published, and about 50% of what I wanted to write is already sitting in a published proceedings that, let’s face it, not many people will ever read. And if I ever want to add to it, I have to change the already-published material significantly if I want to send it out again.
Since I presented, my thesis has changed slightly, I’ve added a good chunk of more material, and I fleshed out the theoretical underpinnings. I now have a pretty good article that’s ready to be sent out for peer review, but if I want to do that, I can’t just have a reference saying “half of this came from a published proceeding.” Well, I could, but apparently there’s a slight taboo against this. I was told to “be careful,” that I’d have to “rephrase” and “reword.” And, of course, I’d have to cite my earlier publication.
I imagine most of this comes from the fear of scholars double-dipping, or padding their CVs. Which is stupid. Good scholarship should come first, and our methods of scholarly attribution should mold itself to it. Right now, scholarship is enslaved to the process of attribution and publication. It’s why we willingly donate our time and research to publishing articles, and then have our universities buy back our freely-given scholarship in expensive subscription packages, when we could just have the universities pay for the research upfront and then release it for free.
The question of copyright is pretty clear: how much will the publisher charge if I want my to reuse a significant portion of my work somewhere else? The publisher to which I refer, Ergon Verlag, I’ve heard is pretty lenient about such things, but what if I were reprinting from a different publish?
There’s an additional, more external, concern about my materials. It’s a history of illustrations, and the manuscript itself contains 48 illustrations in all. If I want to use them in my article, for demonstrative purposes, I not only need to cite the original sources (of course), I need to get permission to use the illustrations from the publishers who scanned them – and this can be costly and time consuming. I priced a few of them so-far, and they range from free to hundreds of dollars.
A Potential Solution – Iterative Writing
To recap, there are two things currently preventing me from sending out a decent piece of scholarship for peer-review:
A taboo against self-plagiarism, which requires quite a bit of time for rewriting, permission from the original publisher to reuse material, and/or the dissolution of such a taboo.
The cost and time commitment of tracking down copyright holders to get permission to reproduce illustrations.
I believe the first issue is largely a historical artifact of print-based media. Scholars have this sense of citing the source because, for hundreds of years, nearly every print of a single text was largely identical. Sure, there were occasionally a handful of editions, some small textual changes, some page number changes, but citing a text could easily be done, and so we developed a huge infrastructure around citations and publications that exists to this day. It was costly and difficult to change a printed text, and so it wasn’t done often, and now our scholarly practices are based around the idea scholarly material has to be permanent and unchanging, finished, if they are to enter into the canon and become citeable sources.
In the age of Wikipedia, this is a weird idea. Texts grow organically, they change, they revert. Blog posts get updated. A scholarly article, though, is relatively constant, even those in online-only publications. One of the major exceptions are ArXiv-like pre-print repositories, which allow an article to go through several versions before the final one goes off to print. But generally, once the final version goes to print, no further changes are made.
The reasons behind this seem logical: it’s the way we’ve always done it, so why change a good thing? It’s hard to cite something that’s constantly changing; how do we know the version we cited will be preserved?
In an age of cheap storage and easily tracked changes, this really shouldn’t be a concern. Wikipedia does this very well: you can easily cite the version of an article from a specific date and, if you want, easily see how the article changed between then and any other date.
This would be more difficult to implement in academia because article hosting isn’t centralized. It’s difficult to be certain that the URL hosting a journal article now will persist for 50 years, both because of ownership and design changes, and it’s difficult to trust that whomever owns the article or the site won’t change the content and not preserve every single version, or a detailed description of changes they’ve made.
There’s an easy solution: don’t just reference everything you cite, embed everything you cite. If you cite a picture, include the picture. If you cite a book, include the book. If you cite an article, include the article. Storage is cheap: if your book cites a thousand sources, and includes a copy of every single one, it’ll be at most a gigabyte. Probably, it would be quite a deal smaller. That way, if the material changes down the line, everyone reading your research will till be able to refer to the original material. Further, because you include a full reference, people can go and look the material up to see if it has changed or updated in the time since you cited it.
Of course, this idea can’t work – copyright wouldn’t let it. But again, this is a situation where the industry of academia is getting in the way of potential improvements to the way scholarship can work.
The important thing, though, is that self-plagiarization would become a somewhat irrelevant concept. Want to write more about what you wrote before? Just iterate your article. Add some new references, a paragraph here or there, change the thesis slightly. Make sure to keep a log of all your changes.
I don’t know if this is a good solution, but it’s one of many improvements to scholarship – or at least, a removal of barriers to publishing interesting things in a timely and inexpensive fashion – which is currently impossible because of copyright concerns and institutional barriers to change. Cameron Neylon, from PLOS, recently discussed how copyright put up some barriers to his own interesting ideas. Academia is not a nimble beast, and because of it, we are stuck with a lot of scholarly practices which are, in part, due to the constraints of old media.
In short: academic writing is tough. There are ways it could be easier, that would allow good scholarship to flow more freely, but we are constrained by path dependency from choices we made hundreds of years ago. It’s time to be a bit more flexible and be more willing to try out new ideas. This isn’t anywhere near a novel concept on my part, but it’s worth repeating.
The last big barrier to self-plagiarism, double dipping to pad one’s CV, still seems tricky to get past. I’m not thrilled with the way we currently assess scholarship, and “CV size” is just one of the things I don’t like about it, but I don’t have any particularly clever fixes on that end.
Submissions for the 2014 Digital Humanities conference just closed. It’ll be in Switzerland this time around, which unfortunately means I won’t be able make it, but I’ll be eagerly following along from afar. Like last year, reviewers are allowed to preview the submitted abstracts. Also like last year, I’m going to be a reviewer, which means I’ll have the opportunity to revisit the submissions to DH2013 to see how the submissions differed this time around. No doubt when the reviews are in and the accepted articles are revealed, I’ll also revisit my analysis of DH conference acceptances.
To start with, the conference organizers received a record number of submissions this year: 589. Last year’s Nebraska conference only received 348 submissions. The general scope of the submissions haven’t changed much; authors were still supposed to tag their submissions using a controlled vocabulary of 95 topics, and were also allowed to submit keywords of their own making. Like last year, authors could submit long papers, short papers, panels, or posters, but unlike last year, multilingual submissions were encouraged (English, French, German, Italian, or Spanish). [edit: Bethany Nowviskie, patient awesome person that she is, has noticed yet another mistake I’ve made in this series of posts. Apparently last year they also welcomed multilingual submissions, and it is standard practice.]
Digital Humanities is known for its collaborative nature, and not much has changed in that respect between 2013 and 2014 (Figure 1). Submissions had, on average, between two and three authors, with 60% of submissions in both years having at least two authors. This year, a few fewer papers have single authors, and a few more have two authors, but the difference is too small to be attributable to anything but noise.
The distribution of topics being written about has changed mildly, though rarely in extreme ways. Any changes visible should also be taken with a grain of salt, because a trend over a single year is hardly statistically robust to small changes, say, in the location of the event.
The grey bars in Figure 2 show what percentage of DH2014 submissions are tagged with a certain topic, and the red dotted outlines show what the percentages were in 2013. The upward trends to note this year are text analysis, historical studies, cultural studies, semantic analysis, and corpora and corpus activities. Text analysis was tagged to 15% of submissions in 2013 and is now tagged to 20% of submissions, or one out of every five. Corpus analysis similarly bumped from 9% to 13%. Clearly this is an important pillar of modern DH.
I’ve pointed out before that History is secondary compared to Literary Studies in DH (although Ted Underwood has convincingly argued, using Ben Schmidt’s data, that the numbers may merely be due to fewer people studying history). This year, however, historical studies nearly doubled in presence, from 10% to 17%. I haven’t yet collected enough years of DH conference data to see if this is a trend in the discipline at large, or more of a difference between European and North American DH. Semantic analysis jumped from 1% to 7% of the submissions, cultural studies went from 10% to 14%, and literary studies stayed roughly equivalent. Visualization, one of the hottest topics of DH2013, has become even hotter in 2014 (14% to 16%).
The most visible drops in coverage came in pedagogy, scholarly editions, user interfaces, and research involving social media and the web. At DH2013, submissions on pedagogy had a surprisingly low acceptance rate, which combined the drop in pedagogy submissions this year (11% to 8% in “Digital Humanities – Pedagogy and Curriculum” and 7% to 4% in “Teaching and Pedagogy”) might suggest a general decline in interest in the DH world in pedagogy. “Scholarly Editing” went from 11% to 7% of the submissions, and “Interface and User Experience Design” from 13% to 8%, which is yet more evidence for the lack of research going into the creation of scholarly editions compared to several years ago. The most surprising drops for me were those in “Internet / World Wide Web” (12% to 8%) and “Social Media” (8.5% to 5%), which I would have guessed would be growing rather than shrinking.
The last thing I’ll cover in this post is the author-chosen keywords. While authors needed to tag their submissions from a list of 95 controlled vocabulary words, they were also encouraged to tag their entries with keywords they could choose themselves. In all they chose nearly 1,700 keywords to describe their 589 submissions. In last year’s analysis of these keywords, I showed that visualization seemed to be the glue that held the DH world together; whether discussing TEI, history, network analysis, or archiving, all the disparate communities seemed to share visualization as a primary method. The 2014 keyword map (Figure 3) reveals the same trend: visualization is squarely in the middle. In this graph, two keywords are linked if they appear together on the same submission, thus creating a network of keywords as they co-occur with one another. Words appear bigger when they span communities.
Despite the multilingual conference, the large component of the graph is still English. We can see some fairly predictable patterns: TEI is coupled quite closely with XML; collaboration is another keyword that binds the community together, as is (obviously) “Digital Humanities.” Linguistic and literature are tightly coupled, much moreso than, say, linguistic and history. It appears the distant reading of poetry is becoming popular, which I’d guess is a relatively new phenomena, although I haven’t gone back and checked.
This work has been supported by an ACH microgrant to analyze DH conferences and the trends of DH through them, so keep an eye out for more of these posts forthcoming that look through the last 15 years. Though I usually share all my data, I’ll be keeping these to myself, as the submitters to the conference did so under an expectation of privacy if their proposals were not accepted.
[edit: there was some interest on twitter last night for a raw frequency of keywords. Because keywords are author-chosen and I’m trying to maintain some privacy on the data, I’m only going to list those keywords used at least twice. Here you go (Figure 4)!]
[edit: I’ve been told the word I’m looking for is actually preservation, not sustainability. Whoops.]
Sustainability’s a tricky word. I don’t mean whether the scottbot irregular is carbon neutral, or whether it’ll make me enough money to see me through retirement. This post is about whether scholarly blog posts will last beyond their author’s ability or willingness to sustain them technically and financially.
A colleague approached me at a conference last week, telling me she loved one of my blog posts, had assigned it to her students, and then had freaked out when my blog went down and she didn’t have a backup of the post. She framed it as being her fault, for not thinking to back up the material.
Of course, it wasn’t her fault that my site was down. As a grad student trying to save some money, I use the dirt-cheap bluehost for hosting my site. It goes down a lot. At this point, now that I’m blogging more seriously, I know I should probably migrate to a more serious hosting solution, but I just haven’t found the time, money, or inclination to do so.
This is not a new issue by any means, but my colleague’s comment brought it home to me for the first time. A lot has already been written on this subject by archivists, I know, but I’m not directly familiar with any of the literature. As someone who’s attempting to seriously engage with the scholarly community via my blog (excepting the occasional Yoda picture), I’m only now realizing how much of the responsibility of sustainability in these situations lies with the content creator, rather than with an institution or library or publishing house. If I finally decide to drop everything and run away with the circus (it sometimes seems like the more financially prudent option in this academic job market), *poof* the bulk of my public academic writings go the way of Keyser Söze.
So now I’m going to you for advice. If we’re aiming to make blogs good enough to cite, to make them countable units in the scholarly economy that can be traded in for things like hiring and tenure, to make them lasting contributions to the development of knowledge, what are the best practices for ensuring their sustainability? I feel like I haven’t been treating this bluehost-hosted blog with the proper respect it needs, if the goal of academic respectability is to be achieved. Do I self-archive every blogpost in my institution’s dspace? Does the academic community need to have a closer partnership with something like archive.org to ensure content persistence?
So this is awkward. I’ve published Networks Demystified 7: Doing Citation Analyses before Networks Demystified 6: Organizing Your Twitter Lists. What depraved lunatic would do such a thing? The kind of depraved lunatic that is teaching this very subject twice in the next two weeks: deal with it, you’ll get your twitterstructions soon, internet. In the meantime, enjoy the irregular nature of the scottbot irregular.
And this is part 7 of my increasingly inaccurately named trilogy of instructional network analysis posts (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis, 5 communities and PageRank, 6 this space left intentionally blank). I’m covering how to actually do citation analyses, so it’s a continuation of part 4 of the series. If you want to know what citation analysis is and why to do it, as well as a laundry list of previous examples in the humanities and social sciences, go read that post. If you want to just finally be able to analyze citations, like you’ve always dreamed, read on. 1
You’re going to need two things for these instructions: The Sci2 Tool, and either a subscription to the multi-gazillion dollar ISI Web of Science database, or this sample dataset. The Sci2 (Science of Science) Tool is a fairly buggy program (I’m allowed to say that because I’m kinda off-and-on the development team and I wrote half the user manual) that specializes in ingesting data of various formats and turning them into networks for analysis and visualization. It’s a good tool to use before you run to Gephi to make your networks pretty, and has a growing list of available plugins. If you already have the Sci2 Tool, download it again, because there’s a new version and it doesn’t auto-update. Go download it. It’s 80mb, I’ll wait.
Once you’ve registered for (not my decision, don’t blame me!) and downloaded the tool, extract the zip folder wherever you want, no install necessary. The first thing to do is increase the amount of memory available to the program, assuming you have at least a gig of RAM on your computer. We’re going to be doing some intensive analysis, so you’ll need the extra space. Edit sci2.ini; on Windows, that can be done by right-clicking on the file and selecting ‘edit’; on Mac, I dunno, elbow-click and press ‘CHANGO’? I have no idea how things work on Macs. (Sorry Mac-folk! We’ve actually documented in more detail how to increase memory – on both Windows and Mac – here)
Once editing the file, you’ll see a nigh-unintelligble string of letters and numbers that end in “-Xmx350m”. Assuming you have more than a gig of RAM on your computer, change that to “-Xmx1000m”. If you don’t have more RAM, really, you should go get some. Or use only a quarter of the dataset provided. Save it and close the text editor.
Run Sci2.exe We didn’t pay Microsoft to register the app, so if you’re on Windows, you may get a OHMYGODWARNING sign. Click ‘run anyway’ and safely let my team’s software hack your computer and use it to send pictures of cats to famous network scientists. (No, we’ll be good, promise). You’ll get to a screen remarkably like Figure 7. Leave it open, and if you’re at an institution that pays ISI Web of Science the big bucks, head there now. Otherwise ignore this and just download the sample dataset.
I’m a historian of science, so let’s look for history of science articles. Search for ‘Isis‘ as a ‘Publication Name’ from the drop-down menu (see Figure 1) and notice that, as of 9/23/2013, there are 14,858 results (see Figure 2).
This is a list of every publication in the journal ISIS. Each individual record includes bibliographic material, abstract, and the list of references that are cited in the article. To get a reasonable dataset to work with, we’re going to download every article ever published in ISIS, of which there are 1,189. The rest of the records are book reviews, notes, etc. Select only the articles by clicking the checkbox next to ‘articles’ on the left side of the results screen and clicking ‘refine’.
The next step is to download all the records. This web service limits you to 500 records per download, so you’re going to need to download 3 separate files (records 1-500, 501-1000, and 1001-1189) and combine them together, which is a fairly complicated step, so pay close attention. There’s a little “Send to:” drop-down menu at the top of the search results (Figure 3). Click it, and click ‘Other File Formats’.
At the pop-up box, check the radio box for records 1 to 500 and enter those numbers, change the record content to ‘Full Record and Cited References’, and change the file format to ‘Plain Text’ (Figure 4). Save the file somewhere you’ll be able to find it. Do this twice more, changing the numbers to 501-1000 and 1001-1189, saving these files as well.
You’ll end up with three files, possibly named: savedrecs.txt, savedrecs(1).txt, and savedrecs(2).txt. If you open one up (Figure 5), you’ll see that each individual article gets its own several-dozen lines, and includes information like author, title, keywords, abstract, and (importantly in our case) cited references.
You’ll also notice (Figures 5 & 6) that first two lines and last line of every file are special header and footer lines. If we want to merge the three files so that the Sci2 Tool can understand it, we have to delete the footer of the first file, the header and footer of the second file, and the header of the last file, so that the new text file only has one header at the beginning, one footer at the end, and none in between. Those of you who are familiar enough with a text editor (and let’s be honest, it should be everyone reading this) go ahead and copy the three files into one huge file with only one header and footer. If you’re feeling lazy, just download it here.
Creating a Citation Network
Now open the Sci2 Tool (Figure 7) and go to File->Load in the drop-down menu. Find your super file with all of ISIS and open it, loading it as an ‘ISI flat format’ file (Figure 8).
If all goes correctly, two new files should appear in the Data Manager, the pane on the right-hand side of the software. I’ll take a bit of a detour here to explain the Sci2 Tool.
The main ‘Console’ pane on the top-left will include a complete log of your workflow, including all the various algorithms you use, what settings and parameters you use with them, and how to cite the various ones you use. When you close the program, a copy of the text in the ‘Console’ pain will save itself as a log file in the program directory so you can go back to it later and see what exactly you did.
The ‘Scheduler’ pane on the bottom is just that: it shows you what algorithms are currently running and what already ran. You can safely ignore it.
Along with the drop-down menus at the top, the already-mentioned ‘Data Manager’ pane on the right is where you’ll be spending most of your time. Every time you load a file, it will appear in the data manager. Every time you run an algorithm on or manipulate that file in some way, a copy of it with the new changes will appear hierarchically nested below the original file. This is so, if you make a mistake, want to use an earlier version of the file, or want to run run a different set of analyses, you can still do so. You can right-click on files in the data manager to view or save them in various file formats. It is important to remember to make sure that the appropriate file is selected in the data manager when you run an analysis, as it’s easy to accidentally run an algorithm on some other random data file.
With that in mind, once your file is loaded, make sure to select (by left-clicking) the ‘1189 Unique ISI Records’ data file in the data manager. If you right-click and view the file, it should open up in Excel (Figure 9) or whatever your default *.csv viewer is, and you’ll see that the previous text file has been converted to a spreadsheet. You can look through it to see what the data look like.
When you’re done ogling at all the pretty data, close the spreadsheet and go back to the tool. Making sure the ‘1189 Unique ISI Records’ file is selected, go to ‘Data Preparation -> Extract Paper Citation Network’ in the drop-down menu.
Voilà! You now have a history of science citation network. The algorithm spits out two files: ‘Extracted paper-citation network’, which is the network file itself, and ‘Paper information’, which is a spreadsheet that includes all the nodes in the network (in this case, articles that either were published in ISIS or are cited by them). It includes a ‘localCitationCount’ column, which tells you how frequently a work is cited within the dataset (Shapin’s Leviathan and the Air Pump‘ is cited 16 times, you’ll see if you open up the file), and a ‘globalCitationCount’ column, which is how many times ISI Web of Science thinks the article has been cited overall, not just within the dataset (Merton’s ” The Matthew effect in science II” is cited 183 times overall). ‘globalCitationCount’ statistics are of course only available for the records you downloaded, so you have them for ISIS published articles, but none of the other records.
Select ‘Extracted paper-citation network’ in the data manager. From the drop-down menu, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’. It’s a good idea to run this on any network you have, just to see the basic statistics of what you’re working with. The details will appear in the console window (Figure 10).
There are a few things worth noting right away. The first is that there are 52,479 nodes; that means that our adorable little dataset of 1,189 articles actually referenced over 50,000 other works between them, about 50 refs/article. The second fact worth noting is that there are 54,915 directed edges, which is the total number of direct citations in the dataset. One directed edge is a citation from a citing node (an ISIS article) to a cited node (either an ISIS article, or a book, or whatever the author decides to reference).
The last bit worth pointing out is the number of weakly connected components, and the size of the largest connected component. Each weakly connected component is a chunk of the network connected by citation chains: if article A and B are the only articles which cite article C, if article C cites nothing else, and if A and B are uncited by any other articles, they together make a weakly connected component. As soon as another citation link comes from or to them, it becomes part of that component. In our case, the biggest component is 46,971 nodes, which means that most of the nodes in the network are connected to each other. That’s important, it means history of science as represented by ISIS is relatively cohesive. There are 215 weakly connected components in all, small islands that are disconnected from the mainland.
If you have Gephi installed, you can visualize the network by selecting ‘Extracted paper-citation network’ in the data manager and clicking ‘Visualization -> Networks -> Gephi’, though what you do from there is beyond the scope of these instructions. It also probably won’t make a heck of a lot of sense: there aren’t many situations where visualizing a citation network are actually useful. It’s what’s called a Directed Acyclic Graph, which are generally the most visually boring graphs around (don’t cite me on this).
I do have a very important warning. You can tell it’s important because it’s bold. The Sci2 Tool was made by my advisor Katy Börner as a tool for people with similar research to her own, whose interests lie in modeling and predicting the spread of information on a network. As such, the direction of citation edges created by the tool are oppositewhat many expect. They go from the cited source to the citing source, because the idea is that’s the direction that information flows, rather than from the citing source to the cited source. As a historian, I’m more interested in considering the network in the reverse direction: citing to cited, as that gives more agency to the author. More details in the footnote. 2
Great, now that that’s out of the way, let’s get to the more interesting analyses. Select ‘Extracted paper-citation network’ in the data manager and run ‘Data Preparation -> Extract Document Co-Citation Network’. And then wait. Have you waited for a while? Good, wait some more. This is a process. And 50,000 articles is a lot of articles. While you’re waiting, re-read Networks Demystified 4: Co-Citation Analysis to get an idea of what it is you’re doing and why you want to do it.
Okay, we’re done (assuming you increased the allotted memory to the tool like we discussed earlier). You’re no presented the ‘Co-citation Similarity Network’ in the data manager, and you should, once again, run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ in the Data Manager. This as well will take some time, and you’ll see why shortly.
Notice that while there are the same number of nodes (citing or cited articles) as before, 52,479, the number of edges went from 54,915 to 2,160,275, a 40x increase. Why? Because every time two articles are cited together, they get an edge between them and, according to the ‘Average degree’ in the console pane, each article or book is cited alongside an average of 82 other works.
In order to make the analysis and visualization of this network easier we’re going to significantly cut its size. Recall that document co-citation networks connect documents that are cited alongside each other, and that the weight of that connection is increased the more often the two documents appear together in a bibliography. What we’re going to do here is drastically reduce the network’s size deleting any edge between documents unless they’ve been cited together more than once. Select ‘Co-citation Similarity Network’ and run ‘Preprocessing -> Networks -> Extract Edges Above or Below Value’. Use the default settings (Figure 12).
Note that when you’re doing a scholarly citation analysis, cutting all the edges below a certain value (called ‘thresholding’) is usually a bad idea unless you know exactly how it will affect your study. We’re doing it here to make the walkthrough easier.
Run ‘Analysis -> Networks -> Network Analysis Toolkit (NAT)’ on the new ‘Edges above 1 by weight’ dataset, and note that the network has been reduced from two million edges to three thousand edges, a much more manageable number for our purposes. You’ll also see that there are 51,313 isolated nodes: nodes that are no longer connected to the network because we cut so many edges in our mindless rampage. Who cares about them? Let’s delete them too! Select ‘Edges above 1 by weight’ and run ‘Preprocessing -> Networks -> Delete Isolates’, and watch as fifty thousand precious history of science citations vanish in a puff of metadata. Gone.
If you run the Network Analysis Toolkit on the new network, you’ll see that we’re left with a small co-citation net of 1,166 documents and 3,344 co-citations between them. The average degree tells us that each document is connected to, on average, 6 other documents, and that the largest connected component contains 476 documents.
So now’s the moment of truth, the time to visualize all your hard work. If you know how to use Gephi, and have it installed, select ‘With isolates removed’ in the data manager and run ‘Visualization -> Networks -> Gephi’. If you don’t, run ‘Visualization -> Networks -> GUESS’ instead, and give it a minute to load. You will be presented with this stunning work of art vaguely reminiscent of last night’s spaghetti and meatball dinner (Figure 13).
Fear not! The first step to prettifying the network is to run ‘Layout -> GEM’ and then ‘Layout -> Bin Pack’. Better already, right? Then you can make edits using the graph modifier below (or using python commands in the interpreter), but the friendly folks at my lab have put together a script for you that will do that automatically. Run ‘Script -> Run Script’.
When you do, you will be presented with a godawful java applet that automatically sticks you in some horrible temp directory that you have to find your way out of. In the ‘Look In:’ navigation drop-down, find your way back to your desktop or your documents directory and then find wherever you installed the Sci2 Tool. In the Sci2 directory, there’s a folder called ‘scripts’, and in the ‘scripts’ folder, there’s a ‘GUESS’ folder, and in the ‘GUESS’ folder you will find the holy grail. Select ‘reference-co-occurrence-nw.py’ and press ‘open’.
Magic! Your document co-citation network is now all green and pretty, and you can zoom in and out using either the +/- button on the left, or using your mouse wheel and clicking and dragging on the network itself. It’ll look a bit like Figure 14.
If you feel more dangerous and cool, you can try visualizing the same network in Gephi, and it might come out something like Figure 15.
That’s it! You’ve co-cited a dataset. I hope you feel proud of yourself, because you should. And all without breaking a sweat. If you want (and you should want), you can save your results by right clicking the various files in the data manager you want to save. I’d recommend saving the most recent file, ‘With isolates removed’, and saving it as an NWB file, which is fairly easy to read and is the Sci2 Tool’s native format.
Stay-tuned for the paradoxically earlier-numbered Networks Demystified 6, on organizing your twitter feed.
Part 4 also links to a few great tutorials on how to do this with programming, but if you don’t know the first thing about programming, start here instead. ↩
Those of you who know network basics, keep this in mind when running your analyses: PageRank, In & Out Degree, etc., may be opposite of what you expect, with the papers that cite the most sources as those with the highest In-Degree and PageRank. If this is opposite your workflow, you can fairly easily change the data by hand in a spreadsheet editor or with regular expressions. ↩
The fifth and sixth (coming soon…) installment of Networks Demystified will be a bit more applied than the previous bunch (1 network basics, 2 degree, 3 power laws, 4 co-citation analysis). Like many of my recent posts, this one is in response to a Twitter conversation:
Some day, I need to go back through my lists of ppl I follow and organize them better.
If you follow a lot of people on Twitter (Michael follows over a thousand), getting a grasp of them all and organizing them can be tough. Luckily network analysis can greatly ease the task of organizing twitter follows, and this and next post will teach you how to do that using NodeXL, a plugin for Microsoft Excel that (unfortunately) only works on Windows. It’s super easy, though, so if you have access to a Windows machine with Office installed, it’s worth trying it out despite the platform limitations.
This installment will explain the concept ofmodularity for group detection in networks, as well as why certain metrics like centrality should be avoided when using certain kinds of datasets. I’m going to be as gentle as I can be on the math, so this tutorial is probably best-suited for those just learning network techniques, but will fall short for those hoping for more detailed or specific information.
Next installment, Networks Demystified 6, will include the actual step-by-step instructions of how to run these analyses using NodeXL. I’m posting the description first, because I strongly believe you should learn the concepts before applying the techniques. At least that’s the theory: actually I’m posting this first because Twitter is rate-limiting the download of my follower/followee network, and I’m impatient and want to post this right away.
Modularity / Community Detection
Modularity is a technique for finding which groups of nodes in a network are more similar to each other than to other groups; it lets you spot communities.
It is unfortunate (for me) that modularity is one of the more popular forms of community detection, because it also happens to be one of the methods more difficult to explain without lots of strange symbols, which I’m trying to avoid. First off, the modularity technique is not one simple algorithm, as much as it is a conceptual framework for thinking about communities in networks. There modularity you run in Gephi is different than modularity in NodeXL, because there’s more than one way to write the concept into an algorithm, and they’re not all exactly the same.
But to describe modularity itself, let’s take a brief detour through random-network lane. Randomization is a popular tool among network scientists, statisticians, and late 20th century avant-garde music composers for a variety of reasons. Suppose you’re having a high-stakes coin-flip contest with your friend, who winds up beating you 68/32. Before you run away crying that your friend cheated, because a fair coin should always land 50/50, remember that the universe is a random place. The 68/32 score could’ve appeared by chance alone, so you write up a quick computer program to flip a thousand coins a hundred times each, and if in those thousand computational coin-flip experiments, a decent amount come up around 68/32, you can reasonably assume your friend didn’t cheat.
The use of a simulated random result to see if what you’ve noticed is surprising (or, sometimes, significant) is quite common. I used it on the Irregular when reviewing Matthew Jockers’ Macroanalysis, shown in the graphic halfway down the page and reproduced here. I asked, in an extremely simplistic way, whether the trends Jockers saw over time were plausible by creating four dummy universes where randomness ruled, to see if his results could be attributable to chance alone. By comparing his data to my fake data, I concluded that some of his results were probably very accurate, and some of them might have just been chance.
Network analysts use the same sort of technique all the time. Do you want to know if it’s surprising that some actress is only six degrees away from Kevin Bacon (or anybody else on the network)? Generate a bunch of random networks with the same amount of nodes (actors) and edges (connections between them if they star in a movie together), and see if, in most cases, you can get from any one actor to any other in only six hops. Odds are you could; that’s just how random networks work.
What’s surprising is that in these, as well as most other social networks, people tend to be much more tightly clustered together than expected from a random network. They form little groups and cliques. It is significantly unlikely that in such cliquish networks, where the same groups of actors tend to appear with each other constantly, that everyone would still be only six degrees away from one another. It’s commonly known that social networks organize in what are called small-worlds, where people tend to be much more closely connected to one another than one would expect when they’re in such tight cliques. This is the power of random networks: they help pick out the unusual.
Which brings us back to modularity. With some careful thinking, one would come up with a quick solutions to figuring out how to find communities in networks: find clusters of nodes that have more internal edges between them than external edges to other groups.
There’s a lurking problem with this idea, though. If you were just counting the number of in-group connections vs. out-group connections, you could come up with an optimal solution very quickly if you say the entire network is one community: voila! no outgoing connections, and lots of internal connections. If instead you say in advance that you want two communities, or you only want communities of a certain size, it mitigates the problem somewhat, but then you’re stuck with needing to set the number of communities beforehand, which is a difficult constraint if you’re not sure what that number should be.
The key is randomness. You want to find communities of nodes for which there are more internal links than you would expect given that the graph was random, and fewer external links than you would expect given the graph was random. Mark Newman defines modularity as: “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.”
Modularity is thus a network-level measurement, and it can change based on what communities you choose in your network. For example, in the figure above, most of the edges in the network are within the Freakish Grey Blobs (hereafter FGBs), and within the FGBs the edges are very dense. In that case, we would expect the modularity to be quite high. However, imagine we drew the FGBs around different nodes in the network instead: if we made four FGBs instead of three, splitting the left group into two, we’d find that a larger fraction of the edges are falling outside of groups, thus decreasing the overall network’s modularity score.
Similarly, let’s say we made two FGBs instead of three. We merge the two groups in the right into one supergroup (group 1), and leave the group on the left (group 1) the same. What would happen to the modularity? In that case, because group 2 is now less dense (defining density as the number of edges within the group compared to the total possible number of edges within it), and we’d expect a random network to look a bit more similar, so the overall network’s modularity score would (again) decrease slightly.
That’s modularity in a nutshell. The method of finding the appropriate groupings in a network varies, but essentially, all the algorithms keep drawing FGBs around different groups of nodes until the overall modularity score of the network is as high as possible. Find the right configuration of FGBs such that the modularity score is very high, and then label the nodes in each separate FGB as their own community. In the figure above, there are three communities, and your favorite network analysis software will label them as such.
Some metrics to avoid (with caveats)
There’s a stubbornly persistent desire, when analyzing a tasty new network dataset, to just run every algorithm in the box and see what comes up. PageRank and centrality? Sure! Clustering? Sounds great! Unfortunately, each algorithm makes certain underlying assumptions about the data, and our twitter network breaks many of those assumptions.
The most important worth mentioning is that we’ve already sinned. Remember how we plan on calculating modularity, and remember how I defined it earlier? Nothing was mentioned about whether or not the edges were directed. Asymmetrical edges (like asymmetries between follower and followee) are not understood by the modularity algorithm we described, which assumes there would be no difference between a follower, a followee, or a reciprocal connection of both. Running modularity on a directed network is, in general, a bad idea: in most networks, the direction of an edge is very important for determining community involvement. We can safely ignore this issue here, as we’re dealing with the fairly low-stakes problem of letting the computer help us organize our twitter network, but in publications or higher-stakes circumstances, this would be something to avoid without thinking through the implications very carefully.
A network metric that might seem more appropriate to the forthcoming twitter dataset, PageRank, is similarly inadequate without a few key changes. As I haven’t demystified PageRank yet, here’s a short description, with the promise to expand on it later.
PageRank is Google’s algorithm for ranking websites in their search results, and it’s inspired by citation analysis, but it turns out to be useful in various other circumstances. There are two ways to explain the algorithm, both equally accurate. The first has to do with probability: what is the probability that, if someone just starts clicking links on the web at random, they’ll eventually land on your website. The higher the chance that someone clicking links at random will reach your site, the higher your PageRank.
PageRank’s other definition makes a bit more ‘on-the-ground’ sense; given a large, directed network (like websites linking to other websites), those sites that are very popular can determine another site’s score by whether or not they link to it. Say a really famous website, like BBC, links to your site; you get lots of points. If Sam’s New England Crab Shack & Duck Farm links to your site, however, you won’t get many points. Seemingly paradoxically, the more points your website has, the more points you can give to sites that you link to. Sites that get linked to a lot are considered reputable, and in turn they link to other sites and pass that reputation along. But, the clever bit is that your site can only pass a fraction of its reputation along based on how many other sites it links to, thus if your site only links to the Scottbot Irregular, the Irregular will get lots of points from it, but if it links to ten sites including the Irregular, my site would only get a tenth of the potential points.
This generalizes pretty easily to all sorts of networks including, as it happens, twitter follow networks. Those who are followed by lots of people are scored highly; if one of those highly scoring individuals follows only a select few, that select few will also receive a significant increase in rank. When a user is followed by many other users with very high scores, that user is scored the highest of them all. PageRank, then, is a neat way of looking at who has the power in a twitter network. Those at the top are those who even the relatively popular find interesting and worth following.
Which brings us to this, the network we’re creating to organize our twitter neighborhood. The network type is right: a directed, unweighted network. The algorithm will work fine. It will tell you, for example, that you are (or are nearly) the most popular person in your twitter neighborhood. And why wouldn’t it? Most of the people in your neighborhood follow you, or follow people who follow you, so the math is inevitable.
And the problem is obvious. Your sampling strategy (the criteria you used to gather your data) inherently biases this particular network metric, and most other metrics within the same family. You’ve used what’s called snowball sampling, so-named because your sample snowballs into a huge network in relatively short order, starting from a single person: you. It’s you, then those you follow, then those they follow, and so forth. You are inevitably at the center of your snowball, and the various network centrality measurements will react accordingly.
Well, you might ask, what if you just ignore yourself when looking at the network? Nope. Because PageRank (among other algorithms) takes everyone’s score into account when calculating others’ scores; even if you close your eyes whenever your name pops up, your presence will still exert an invisible influence on the network. In the case of PageRank, because your score is so high, you’ll be conferring a much higher score to (potentially) otherwise unpopular people you happen to follow.
The short-term solution is to remove yourself from the network before you run any of your analyses. This actually still isn’t perfect, for reasons I don’t feel like getting into because the post is already too long, but it will give at least a better idea of PageRank centrality within your twitter neighborhood.
While you’re at it, you should also remove yourself before running community detection. As you might be the connection that bridges two otherwise disconnected communities together, and for the purpose of this study you’re trying to organize people separate from your own influence on them, running modularity on the network without you in it will likely give you a better sense of your neighborhood.
Stay-tuned for the next exciting installment of Networks Demystified, wherein I’ll give step-by-step instructions on how to actually do the things I’ve described using NodeXL. If you want a head-start, go ahead and download and start playing with it.