Barriers to Scholarship & Iterative Writing

This post is mostly just thinking out loud, musing about two related barriers to scholarship: a stigma related to self-plagiarism, and various copyright concerns. It includes a potential way to get past them.

Self-Plagiarism

When Jonah Lehrer’s plagiarism scandal first broke, it sounded a bit silly. Lehrer, it turned out, had taken some sentences he’d used in earlier articles, and reused them in a few New Yorker blog posts. Without citing himself. Oh no, I thought. Surely, this represents the height of modern journalistic moral depravity.

Of course, later it was revealed that he’d bent facts, and plagiarized from others without reference, and these were all legitimately upsetting. And plagiarizing himself without reference was mildly annoying, though certainly not something that should have attracted national media attention. But it raises an interesting question: why is self-plagiarism wrong? And it’s as wrong in academia as it is in journalism.

Lehrer chart from Slate [via].
Lehrer chart from Slate. [via]
I can’t speak for journalists (though Alberto Cairo can, and he lists some of the good reasons why non-referenced self-plagiarism is bad and links to not one, but two articles about it, and), but for academia, the reasons behind the wrongness seem pretty clear.

  1. It’s wrong to directly lift from any source without adequate citation. This only applies to non-cited self-plagiarism, obviously.
  2. It’s wrong to double-dip. The currency of the academy is publications / CV lines, and if you reuse work to fill your CV, you’re getting an unfair advantage.
  3. Confusion. Which version should people reference if you have so many versions of a similar work?
  4. Copyright. You just can’t reuse stuff, because your previous publishers own the copyright on your earlier work.

That about covers it. Let’s pretend academics always cite their own works (because, hell, it gives them more citations), so we can do away with #1. Regular readers will know my position on publisher-owned copyright, so I just won’t get into #4 here to save you my preaching. The others are a bit more difficult to write off, but before I go on to try to do that, I’d like to talk a bit about my own experience of self-plagiarism as a barrier to scholarship.

I was recently invited to speak at the Universal Decimal Classification seminar, where I presented on the history of trees as a visual metaphor for knowledge classification. It’s not exactly my research area, but it was such a fun subject, I’ve decided to write an article about it. The problem is, the proceedings of the UDC seminar were published, and about 50% of what I wanted to write is already sitting in a published proceedings that, let’s face it, not many people will ever read. And if I ever want to add to it, I have to change the already-published material significantly if I want to send it out again.

Since I presented, my thesis has changed slightly, I’ve added a good chunk of more material, and I fleshed out the theoretical underpinnings. I now have a pretty good article that’s ready to be sent out for peer review, but if I want to do that, I can’t just have a reference saying “half of this came from a published proceeding.” Well, I could, but apparently there’s a slight taboo against this. I was told to “be careful,” that I’d have to “rephrase” and “reword.” And, of course, I’d have to cite my earlier publication.

I imagine most of this comes from the fear of scholars double-dipping, or padding their CVs. Which is stupid. Good scholarship should come first, and our methods of scholarly attribution should mold itself to it. Right now, scholarship is enslaved to the process of attribution and publication. It’s why we willingly donate our time and research to publishing articles, and then have our universities buy back our freely-given scholarship in expensive subscription packages, when we could just have the universities pay for the research upfront and then release it for free.

Copyright

The question of copyright is pretty clear: how much will the publisher charge if I want my to reuse a significant portion of my work somewhere else? The publisher to which I refer, Ergon Verlag, I’ve heard is pretty lenient about such things, but what if I were reprinting from a different publish?

There’s an additional, more external, concern about my materials. It’s a history of illustrations, and the manuscript itself contains 48 illustrations in all. If I want to use them in my article, for demonstrative purposes, I not only need to cite the original sources (of course), I need to get permission to use the illustrations from the publishers who scanned them – and this can be costly and time consuming. I priced a few of them so-far, and they range from free to hundreds of dollars.

A Potential Solution – Iterative Writing

To recap, there are two things currently preventing me from sending out a decent piece of scholarship for peer-review:

  1. A taboo against self-plagiarism, which requires quite a bit of time for rewriting, permission from the original publisher to reuse material, and/or the dissolution of such a taboo.
  2. The cost and time commitment of tracking down copyright holders to get permission to reproduce illustrations.

I believe the first issue is largely a historical artifact of print-based media. Scholars have this sense of citing the source because, for hundreds of years, nearly every print of a single text was largely identical. Sure, there were occasionally a handful of editions, some small textual changes, some page number changes, but citing a text could easily be done, and so we developed a huge infrastructure around citations and publications that exists to this day. It was costly and difficult to change a printed text, and so it wasn’t done often, and now our scholarly practices are based around the idea scholarly material has to be permanent and unchanging, finished, if they are to enter into the canon and become citeable sources.

In the age of Wikipedia, this is a weird idea. Texts grow organically, they change, they revert. Blog posts get updated. A scholarly article, though, is relatively constant, even those in online-only publications. One of the major exceptions are ArXiv-like pre-print repositories, which allow an article to go through several versions before the final one goes off to print. But generally, once the final version goes to print, no further changes are made.

The reasons behind this seem logical: it’s the way we’ve always done it, so why change a good thing? It’s hard to cite something that’s constantly changing; how do we know the version we cited will be preserved?

In an age of cheap storage and easily tracked changes, this really shouldn’t be a concern. Wikipedia does this very well: you can easily cite the version of an article from a specific date and, if you want, easily see how the article changed between then and any other date.

Changes between versions of the Wikipedia entry on History.
Changes between versions of the Wikipedia entry on History.

This would be more difficult to implement in academia because article hosting isn’t centralized. It’s difficult to be certain that the URL hosting a journal article now will persist for 50 years, both because of ownership and design changes, and it’s difficult to trust that whomever owns the article or the site won’t change the content and not preserve every single version, or a detailed description of changes they’ve made.

There’s an easy solution: don’t just reference everything you cite, embed everything you cite. If you cite a picture, include the picture. If you cite a book, include the book. If you cite an article, include the article. Storage is cheap: if your book cites a thousand sources, and includes a copy of every single one, it’ll be at most a gigabyte. Probably, it would be quite a deal smaller. That way, if the material changes down the line, everyone reading your research will till be able to refer to the original material. Further, because you include a full reference, people can go and look the material up to see if it has changed or updated in the time since you cited it.

Of course, this idea can’t work – copyright wouldn’t let it. But again, this is a situation where the industry of academia is getting in the way of potential improvements to the way scholarship can work.

The important thing, though, is that self-plagiarization would become a somewhat irrelevant concept. Want to write more about what you wrote before? Just iterate your article. Add some new references, a paragraph here or there, change the thesis slightly. Make sure to keep a log of all your changes.

I don’t know if this is a good solution, but it’s one of many improvements to scholarship – or at least, a removal of barriers to publishing interesting things in a timely and inexpensive fashion – which is currently impossible because of copyright concerns and institutional barriers to change. Cameron Neylon, from PLOS, recently discussed how copyright put up some barriers to his own interesting ideas. Academia is not a nimble beast, and because of it, we are stuck with a lot of scholarly practices which are, in part, due to the constraints of old media.

In short: academic writing is tough. There are ways it could be easier, that would allow good scholarship to flow more freely, but we are constrained by path dependency from choices we made hundreds of years ago. It’s time to be a bit more flexible and be more willing to try out new ideas. This isn’t anywhere near a novel concept on my part, but it’s worth repeating.

The last big barrier to self-plagiarism, double dipping to pad one’s CV, still seems tricky to get past. I’m not thrilled with the way we currently assess scholarship, and “CV size” is just one of the things I don’t like about it, but I don’t have any particularly clever fixes on that end.

A quick note on blog sustainability

[edit: I’ve been told the word I’m looking for is actually preservation, not sustainability. Whoops.]

Sustainability’s a tricky word. I don’t mean whether the scottbot irregular is carbon neutral, or whether it’ll make me enough money to see me through retirement. This post is about whether scholarly blog posts will last beyond their author’s ability or willingness to sustain them technically and financially.

A colleague approached me at a conference last week, telling me she loved one of my blog posts, had assigned it to her students, and then had freaked out when my blog went down and she didn’t have a backup of the post. She framed it as being her fault, for not thinking to back up the material.

[via]
[via]
Of course, it wasn’t her fault that my site was down. As a grad student trying to save some money, I use the dirt-cheap bluehost for hosting my site. It goes down a lot. At this point, now that I’m blogging more seriously, I know I should probably migrate to a more serious hosting solution, but I just haven’t found the time, money, or inclination to do so.

This is not a new issue by any means, but my colleague’s comment brought it home to me for the first time. A lot has already been written on this subject by archivists, I know, but I’m not directly familiar with any of the literature. As someone who’s attempting to seriously engage with the scholarly community via my blog (excepting the occasional Yoda picture), I’m only now realizing how much of the responsibility of sustainability in these situations lies with the content creator, rather than with an institution or library or publishing house. If I finally decide to drop everything and run away with the circus (it sometimes seems like the more financially prudent option in this academic job market), *poof* the bulk of my public academic writings go the way of Keyser Söze.

So now I’m going to you for advice. If we’re aiming to make blogs good enough to cite, to make them countable units in the scholarly economy that can be traded in for things like hiring and tenure, to make them lasting contributions to the development of knowledge, what are the best practices for ensuring their sustainability? I feel like I haven’t been treating this bluehost-hosted blog with the proper respect it needs, if the goal of academic respectability is to be achieved. Do I self-archive every blogpost in my institution’s dspace? Does the academic community need to have a closer partnership with something like archive.org to ensure content persistence?

Improving the Journal of Digital Humanities

Twitter and the digital humanities blogosphere has been abuzz recently over an ill-fated special issue of the Journal of Digital Humanities (JDH) on Postcolonial Digital Humanities. I won’t get too much into what happened and why, not because I don’t think it’s important, but because I respect both parties too much and feel I am too close to the story to provide an unbiased opinion. Summarizing, the guest editors felt they were treated poorly, in part because of the nature of their content, and in part because of the way the JDH handles its publications.

I wrote earlier on twitter that I no longer want to be involved in the conversation, by which I meant, I no longer want to be involved in the conversation about what happened and why. I do want to be involved in a discussion on how to get the JDH move beyond the issues of bias, poor communication, poor planning, and microaggression, whether or not any or all of those existed in this most recent issue. As James O’Sullivan wrote in a comment, “as long as there is doubt, this will be an unfortunate consequence.”

Journal of Digital Humanities
Journal of Digital Humanities

The JDH is an interesting publication, operating in part under the catch-the-good model of seeing what’s already out there and getting discussed, and aggregating it all into a quarterly journal. In some cases, that means re-purposing pre-existing videos and blog posts and social media conversations into journal “articles.” In others, it means soliciting original reviews or works that fit with the theme of a current important issue in DH. Some articles are reviewed barely at all – especially the videos – and some are heavily reviewed. The structure of the journal itself, over its five issues thus-far, has changed drastically to fit the topic and the experimental whims of editors and guest editors.

The issue that Elijah Meeks and I guest edited changed in format at least three times in the month or so we had to solidify the issue. It’s fast-paced, not always organized, and generally churns out good scholarship that seems to be cited heavily on blogs and in DH syllabi, but not yet so much in traditional press articles or books. The flexibility, I think, is part of its charm and experimental nature, but as this recent set of problems shows, it is not without its major downsides. The editors, guest editors, and invited authors are rarely certain of what the end product will look like, and if there is the slightest miscommunication, this uncertainty can lead to disaster. The variable nature of the editing process also opens the door for bias of various sorts, and because there is not a clear plan from the beginning, that bias (and the fear of bias) is hard to guard against. These are issues that need to be solved.

Roopika RisamMatt Burton, and I, among others, have all weighed in on the best way to move forward, and I’m drawing on these previous comments for this plan. It’s not without its holes and problems, and I am hoping there will be comments to improve the proposed process, but hopefully something like what I’m about to propose can let the JDH retain its flexibility while preventing further controversies of this particular variety.

  • Create a definitive set of guidelines and mission statement that is distributed to guest editors and authors before the process of publication begins. These guidelines do not need to set the publication process in stone, but can elucidate the roles of each individual and make clear the experimental nature of the JDH. This document cannot be deviated from within an issue publication cycle, but can be amended yearly. Perhaps, as with the open intent of the journal, part of this process can be crowdsourced from the previous year’s editors-at-large of DHNow.
  • Have a week at the beginning of each issue planning phase where authors (if they’ve been chosen yet), guest editors, and editors discuss what particular format the forthcoming issue will take, how it will be reviewed, and so forth. This is formalized into a binding document and will not be changed. The editorial staff has final say, but if the guest editors or authors do not like the final document, they have ample opportunity to leave.
  • Change the publication rate from quarterly to thrice-yearly. DH changes quickly, it shouldn’t be any slower than that, but quarterly seems to be a bit too tight for this process to work smoothly–especially with the proposed week-long committee session to figure out how the issue be run.
  • Make the process of picking special issue topics more open. I know the special issue I worked on came about by Elijah asking the JDH editors if they’d be interested in a topic modeling issue, and after (I imagine) some internal discussion, they agreed. The dhpoco special issue may have had a similar history. Even a public statement of “these people came to us, and this is why we thought the topic was relevant” would likely go a long way in fostering trust in the community.
  • Make the process of picking articles and authors more open; this might be the job of special issue guest editors, as Elijah and I were the ones who picked most of the content. Everyone has their part to play. What’s clear is there is a lot of confusion right now about how it works; some on Twitter recently have pointed out that, until recently, they’d assumed all articles came from the DHNow filter. Making content choice more clear in an introductory editorial would be useful.

Obviously this is not a cure for all ills, but hopefully it’s good ground to start on the path forward. If the JDH takes this opportunity to reform some of their policies, my hope is that it will be seen as an olive branch to the community, ensuring to the best of their ability that there will be no question of whether bias is taking place, implicit or otherwise. Further suggestions in the comments are welcome.

Addendum: In private communication with Matt Burton, he and I realized that the ‘special issue’ and ‘guest editor’ role is not actually one that seems to be aligned with the initial intent of the JDH, which seemed instead to be about reflecting the DH discourse from the previous quarter. Perhaps a movement away from special issues, or having a separate associated entity for special issues with its own set of rules, would be another potential path forward.

The Historian’s Macroscope

Whelp, it appears the cat’s out of the bag. Shawn Graham, Ian Milligan, and I have signed our ICP contract and will shortly begin the process of writing The Historian’s Macroscope, a book introducing the process and rationale of digital history to a broad audience. The book will be a further experiment in live-writing: as we have drafts of the text, they will go online immediately for comments and feedback. The publishers have graciously agreed to allow us to keep the live-written portion online after the book goes on sale, and though what remains online will not be the final copy-edited and typeset version, we (both authors and publishers) feel this is a good compromise to prevent the cannibalization of book sales while still keeping much of the content open and available for those who cannot afford the book or are looking for a taste before they purchase it. Thankfully, this plan also fits well with my various pledges to help make a more open scholarly world.

Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
Microscope / Telescope / Macroscope [via The Macroscope by Joël de Rosnay]
We’re announcing the project several months earlier than we’d initially intended. In light of the American Historical Association’s recent statement endorsing the six year embargo of dissertations on the unsupported claim that it will help career development, we wanted to share our own story to offset the AHA’s narrative. Shawn, Ian, and I have already worked together on a successful open access chapter in The Programming Historian, and have all worked separately releasing public material on our respective blogs. It was largely because of our open material that we were approached to write this book, and indeed much of the material we’ve already posted online will be integrated into the final publication. It would be an understatement to say our publisher’s liaison Alice jumped at this opportunity to experiment with a semi-open publication.

The disadvantage to announcing so early is that we don’t have any content to tease you with. Stay-tuned, though. By September, we hope to have some preliminary content up, and we’d love to read your thoughts and comments; especially from those not already aligned with the DH world.

Another Step in Keeping Pledges

Long-time readers of this blog might remember that, a while ago, I pledged to do pretty much Open Everything. Last week, a friend in my department asked how I managed that without having people steal my ideas. It’s a tough question, and I’m still not certain whether my answer has more to do with idealist naïveté or actual forward-thought. Time will tell. As it is, the pool of people doing similar work to mine is small, and they pretty much all know about this blog, so I’m confident the crowd of rabid academics will keep each other in check. Still, I suppose we all have to be on guard for the occasional evil professor, wearing his white lab coat, twirling his startling mustachio,  and just itching to steal the idle musings of a still-very-confused Ph.D. student.

In the interest of keeping up my pledge, I’ve decided to open up yet another document, this time for the purpose of student guidance. In 2010, I applied for the NSF Graduate Research Fellowship Program, a shockingly well-paying program that’ll surely help with the rising (and sometimes prohibitive) costs of graduate school. By several strokes of luck and (I hope) a decent project, the NSF sent the decision to fund me later that year, and I’ve had more time to focus on research ever since. In the interest of helping future applicants, I’ve posted my initial funding proposal on figshare. Over the next few weeks, there are a few other documents and datasets I plan on making public, and I’ll start a new page on this blog that consolidates all the material that I’ve opened, inspired by Ted Underwood’s similar page.

Click to get my NSF proposal.

Do you have grants or funding applications that’ve been accepted? Do you have publications out that are only accessible behind a drastic paywall? I urge you to post preprints, drafts, or whatever else you can to make scholarship a freer and more open endeavor for the benefit of all.

The Networked Structure of Scientific Growth

Well, it looks like Digital Humanities Now scooped me on posting my own article. As some of you may have read, I recently did not submit a paper on the Republic of Letters, opting instead to hold off until I could submit it to a journal which allowed authorial preprint distribution. Preprints are a vital part of rapid knowledge exchange in our ever-quickening world, and while some disciplines have embraced the preprint culture, many others have yet to. I’d love the humanities to embrace that practice, and in the spirit of being the change you want to see in the world, I’ve decided to post a preprint of my Republic of Letters paper, which I will be submitting to another journal in the near future. You can read the full first draft here.

The paper, briefly, is an attempt to contextualize the Republic of Letters and the Scientific Revolution using modern computational methodologies. It draws from secondary sources on the Republic of Letters itself, especially from my old mentor R.A. Hatch, some network analysis from sociology and statistical physics, modeling, human dynamics, and complexity theory. All of this is combined through datasets graciously donated by the Dutch Circulation of Knowledge group and Oxford’s Cultures of Knowledge project, totaling about 100,000 letters worth of metadata. Because it favors large scale quantitative analysis over an equally important close and qualitative analysis, the paper is a contribution to historiopgraphic methodology rather than historical narrative; that is, it doesn’t say anything particularly novel about history, but it does offer a (fairly) new way of looking at and contextualizing it.

A visualization of the Dutch Republic of Letters using Sci2 & Gephi

At its core, the paper suggests that by looking at how scholarly networks naturally grow and connect, we as historians can have new ways to tease out what was contingent upon the period and situation. It turns out that social networks of a certain topology are basins of attraction similar to those I discussed in Flow and Empty Space. With enough time and any of a variety of facilitating social conditions and technologies, a network similar in shape and influence to the Republic of Letters will almost inevitably form. Armed with this knowledge, we as historians can move back to the microhistories and individuated primary materials to find exactly what those facilitating factors were, who played the key roles in the network, how the network may differ from what was expected, and so forth. Essentially, this method is one base map we can use to navigate and situate historical narrative.

Of course, I make no claims of this being the right way to look at history, or the only quantitative base map we can use. The important point is that it raises new kinds of questions and is one mechanism to facilitate the re-integration of the individual and the longue durée, the close and the distant reading.

The project casts a necessarily wide net. I do not yet, and probably could not ever, have mastery over each and every disciplinary pool I draw from. With that in mind, I welcome comments, suggestions, and criticisms from historians, network analysts, modelers, sociologists, and whomever else cares to weigh in. Whomever helps will get a gracious acknowledgement in the final version, good scholarly karma, and a cookie if we ever meet in person. The draft will be edited and submitted in the coming months, and if you have ideas, please post them in the comment section below. Also, if you use ideas from the paper, please cite it as an unpublished manuscript or, if it gets published, cite that version instead.

On Keeping Pledges

A few months back, I posted a series of pledges about being a good scholarly citizen. Among other things, I pledged to keep my data and code open whenever possible, and to fight to retain the right to distribute materials pending and following their publication. I also signed the Open Access Pledge. Since then, a petition boycotting Elsevier cropped up with very similar goals, and as of this writing has nearly 7,000 signatures.

As a young scholar with as-yet no single authored publications (although one is pending in the forward-thinking Journal of Digital Humanities, which you should all go and peer review), I had to think very carefully in making these pledges. It’s a dangerous world out there for people who aren’t free to publish in whatever journal they like; reducing my publication options is not likely to win me anything but good karma.

With that in mind, I actually was careful never to pledge explicitly that I would not publish in closed access venues; rather, I pledged to “Freely distribute all published material for which I have the right, and to fight to retain those rights in situations where that is not the case.” The pressure of the eventual job market prevented me from saying anything stronger.

Today, my resolve was tested. A recent CFP solicited papers about “Shaping the Republic of Letters: Communication, Correspondence and Networks in Early Modern Europe.” This is, essentially, the exact topic that I’ve been studying and analyzing for the past several years, and I recently finished a draft of a paper on this topic precisely. The paper utilizes methodologies not-yet prevalent in the humanities, and I’d like the opportunity to spread the technique as quickly and widely as possible, in the hopes that some might find it useful or at least interesting. I also feel strongly that the early and open dissemination of scholarly production is paramount to a healthy research community.

I e-mailed the editor asking about access rights, and he sent a very kind reply, saying that, unfortunately, any article in the journal must be unpublished (even on the internet), and cannot be republished for two years following its publication. The journal itself is part of a small press, and as such is probably trying to get itself established and sold to libraries, so their reticence is (perhaps) understandable. However, I was faced with a dilemma: submit my article to them, going against the spirit – though not the letter – of my pledge, or risk losing a golden opportunity to submit my first single-authored article to a journal where it would actually fit.

In the end, it was actually the object of my study itself – the Republic of Letters – that convinced me to make a stand and not submit my article. The Republic, a self-titled community of 17th century scholars communicating widely by post, was embodied by the ideal of universal citizenship and the free flow of knowledge. While they did not live up to this ideal, in large part because of the technologies of the time, we now are closer to being able to do so. I need to do my part in bringing about this ideal by taking a stand on the issues of open access and dissemination.

The below was my e-mail to the editor:

Many thanks for your fast reply.

Unfortunately, I cannot submit my article unless those conditions are changed. I fear they represent a policy at odds with the past ideals and present realities of scholarly dissemination. The ideals of the Republic of Letters, regarding the free flow of information and universal citizenship, are finally becoming attainable (at least in some parts of the world) with nigh-ubiquitous web access. In a world as rapidly changing as our own, immediate access to the materials of scholarly production is becoming an essential element not just of science, in the English sense of the word, but wissenschaft at large. Numerous studies have shown that the open availability of electronic prints for an article increases readership and citations (both to the author and to the journal), reduces the time to the adoption of new ideas, and facilitates a more rapidly innovating and evolving literature in the scholarly world. While I empathize that you represent a fairly small press and may be worried that the availability of pre-prints would affect 1 sales, I have seen no studies showing this to be the case, although I would of course be open to reading such research if you know of some. In either case, it has been shown that pre-prints at worst do not affect scholarly use and dissemination in the least, and at best increase readership, citation, and impact by up to 250%.

Good luck with your journal, and I look forward to reading the upcoming issue when it becomes available.

It’s a frightening world out there. I considered not posting about this interaction, for fear of the possibility of angering or being blacklisted by the editorial or advisory board of the press, some of whom are respected names in my intended field of study. However, fear is the enemy of change, and the support of Bethany Nowviskie and a host of tweeters convinced me that this was the right thing to do.

With that in mind, I herewith post a draft of my article analyzing the Republic of Letters, currently titled The Networked Structure of Scientific Growth. Please feel free to share it for non-commercial use, citing it if you use it (but making sure to cite the published version if it eventually becomes so), and I’d love your comments if you have any. I’ll dedicate a separate post to this release later, but I figured you all deserved this after reading the whole post.

Notes:

  1. Big thanks to Andrew Simpson for pointing out the error of my ways!

Early Modern Letters Online

Early modern history! Science! Letters! Data! Four of my favoritest things have been combined in this brand new beta release of Early Modern Letters Online from Oxford University.

EMLO Logo

Summary

EMLO (what an adorable acronym, I kind of what to tickle it) is Oxford’s answer to a metadata database (metadatabase?) of, you guessed it, early modern letters. This is pretty much a gold standard metadata project. It’s still in beta, so there are some interface kinks and desirable features not-yet-implemented, but it has all the right ingredients for a great project:

  • Information is free and open; I’m even told it will be downloadable at some point.
  • Developed by a combination of historians (via Cultures of Knowledge) and librarians (via the Bodleian Library) working in tandem.
  • The interface is fast, easy, and includes faceted browsing.
  • Has a fantastic interface for adding your own data.
  • Actually includes citation guidelines thank you so much.
  • Visualizations for at-a-glance understanding of data.
  • Links to full transcripts, abstracts, and hard-copies where available.
  • Lots of other fantastic things.

Sorry if I go on about how fantastic this catalog is – like I said, I love letters so much. The index itself includes roughly 12,000 people, 4,000 locations, 60,000 letters, 9,000 images, and 26,000 additional comments. It is without a doubt the largest public letters database currently available. Between the data being compiled by this group, along with that of the CKCC in the Netherlands, the Electronic Enlightenment Project at Oxford, Stanford’s Mapping the Republic of Letters project, and R.A. Hatch‘s research collection, there will without a doubt soon be hundreds of thousands of letters which can be tracked, read, and analyzed with absolute ease. The mind boggles.

Bodleian Card Catalogue Summaries

Without a doubt, the coolest and most unique feature this project brings to the table is the digitization of Bodleian Card Catalogue, a fifty-two drawer index-card cabinet filled with summaries of nearly 50,000 letters held in the library, all compiled by the Bodleian staff many years ago. In lieu of full transcriptions, digitizations, or translations, these summary cards are an amazing resource by themselves. Many of the letters in the EMLO collection include these summaries as full-text abstracts.

One of the Bodleian summaries showing Heinsius looking far and wide for primary sources, much like we’re doing right now…

The collection also includes the correspondences of John Aubrey (1,037 letters), Comenius (526), Hartlib (4,589 many including transcripts), Edward Lhwyd (2,139 many including transcripts), Martin Lister (1,141), John Selden (355), and John Wallis (2,002). The advanced search allows you to look for only letters with full transcripts or abstracts available. As someone who’s worked with a lot of letters catalogs of varying qualities, it is refreshing to see this one being upfront about unknown/uncertain values. It would, however, be nice if they included the editor’s best guess of dates and locations, or perhaps inferred locations/dates from the other information available. (For example, if birth and death dates are known, it is likely a letter was not written by someone before or after those dates.)

Visualizations

In the interest of full disclosure, I should note that, much like with the CKCC letters interface, I spent some time working with the Cultures of Knowledge team on visualizations for EMLO. Their group was absolutely fantastic to work with, with impressive resources and outstanding expertise. The result of the collaboration was the integration of visualizations in metadata summaries, the first of which is a simple bar chart showing the numbers of letters written, received, and mentioned in per year of any given individual in the catalog. Besides being useful for getting an at-a-glance idea of the data, these charts actually proved really useful for data cleaning.

Sir Robert Crane (1604-1643)

In the above screenshot from previous versions of the data, Robert Crane is shown to have been addressed letters in the mid 1650s, several years after his reported death. While these could also have been spotted automatically, there are many instances where a few letters are dated very close to a birth or death date, and they often turn out to miss-reported. Visualizations can be great tools for data cleaning as a form of sanity test. This is the new, corrected version of Robert Crane’s page. They are using d3.js, a fantastic javascript library for building visualizations.

Because I can’t do anything with letters without looking at them as a network, I decided to put together some visualizations using Sci2 and Gephi. In both cases, the Sci2 tool was used for data preparation and analysis, and the final network was visualized in GUESS and Gephi, respectively. The first graph shows network in detail with edges, and names visible for the most “central” correspondents. The second visualization is without edges, with each correspondent clustered according to their place in the overall network, with the most prominent figures in each cluster visible.

Built with Sci2/Guess
Built with Sci2/Gephi

The graphs show us that this is not a fully connected network. There are many islands of one or two letters or a small handful of letters. These can be indicative of a prestige bias in the data. That is, the collection contains many letters from the most prestigious correspondents, and increasingly fewer as the prestige of the correspondent decreases. Put in another way, there are many letters from a few, and few letters from many. This is a characteristic shared with power law and other “long tail” distributions. The jumbled community structure at the center of the second graph is especially interesting, and it would be worth comparing these communities against institutions and informal societies at the time. Knowledge of large-scale patterns in a network can help determine what sort of analyses are best for the data at hand. More on this in particular will be coming in the next few weeks.

It’s also worth pointing out these visualizations as another tool for data-checking. You may notice, on the bottom left-hand corner of the first network visualization, two separate Edward Lhwyds with virtually the same networks of correspondence. This meant there were two distinct entities in their database referring to the same individual – a problem which has since been corrected.

More Letters!

Notice that the EMLO site makes it very clear that they are open to contributions. There are many letters datasets out there, some digitized, some still languishing idly on dead trees, and until they are all combined, we will be limited in the scope of the research possible. We can always use more. If you are in any way responsible for an early-modern letters collection, meta-data or full-text, please help by opening that collection up and making it integrable with the other sets out there. It will do the scholarly world a great service, and get us that much closer to understanding the processes underlying scholarly communication in general. The folks at Oxford are providing a great example, and I look forward to watching this project as it grows and improves.

Pledges

I know I’m a little late to the game, but open access is important year-round, and I only just recently got the chance to write these up. Below are my pledges to open access, which can also be found on the navigation tab above.

The system of pay-to-subscribe journals that spent so many centuries helping the scholarly landscape coordinate and collaborate is now obsolete; a vestigial organ in the body of science.

These days, most universities offer free web access and web hosting. These two elements are necessary, though not sufficient, for a free knowledge economy. We also need peer review (or some other, better form of quality control), improved reputation management (citations++), and some assurance that data/information will last. These come at a cost, but those costs can be paid by the entire scholarly market, and the fruits enjoyed within and without.

If you think open access is important, you should also consider pledging to support open access. Publishing companies have a lot of money invested in keeping things as they are, and only a concerted effort on behalf of the scholars feeding and using the system will be able to change it.

Scholarship is no longer local, and it’s about time our distribution system followed suit.

—-

I pledge to be a good scholarly citizen. This includes:

  • Opening all data generated by me for the purpose of a publication at the time of publication. 1
  • Opening all code generated by me for the purpose of a publication at the time of publication.
  • Freely distributing all published material for which I have the right, and fighting to retain those rights in situations where that is not the case.
  • Fighting for open access of all materials worked on as a co-author, participant in a grant, or consultant on a project.
I pledge to support open access by:
  • Only reviewing for journals which plan to release their publications openly.
  • Donating to free open source software initiatives where I would otherwise have paid for proprietary software.
  • Citing open publications if there is a choice between two otherwise equivalent sources.
I pledge never to let work get in the way of play.
I pledge to give people chocolate occasionally if I think they’re awesome.
_

Notes:

  1. unless there are human subjects and privacy concerns