So, blogging a project that didn’t work – good idea or not? Let’s see…
The project was to get the content of the TOCS-IN citation database into the free, open-access bibliographic software Zotero (which David Pettegrew discusses today; his post kicked me over my hesitation about blogging this project). I wanted to do this for two reasons: to draw increased attention to TOCS-IN, which is an excellent, open-access bibliographic resource for Classicists, and make it especially accessible to Zotero users; and to make the TOCS-IN content potentially available as Linked Open Data, because Zotero can export files in BIBO, a linked open data format for bibliographic citations.
My steps were:
1. Get permission from P.M.W. Matheson of the University of Toronto, the manager of the volunteer-driven TOCS-IN project, to use the available data files for this purpose. She was helpful and supportive – thank you!
2. Write a Python script to convert the data file formatting from a custom SGML markup to RIS format, a common format for bibliographic citations (used by Zotero as well as EndNote, which created it.) I am not a programmer, but happily my husband is; this piece would not have been possible without his help, although I did big chunks of it All By Myself.
3. Add the RIS-formatted citations to a Zotero Group library. This turned out to be the problem. In theory, there is no limit to the number of bibliographic citations that can be stored by a Zotero user. In practice, once I got about 40,000 (of the ca. 80,000) citations uploaded my Zotero standalone software began freezing every time I attempted to do anything (like stubbornly add another several thousand citations), and refusing to sync with the online Group Library. A question posted in the Zotero forums got the swift and helpful confirmation that the sync process simply cannot handle such large datasets well, and that I alone would not be affected; any users who tried to use this large group library would start crashing their Zotero instances as well.
It’s possible that Zotero, which is actively under development, will make it possible to create very large citation libraries. Zotero used to not be able to handle a couple of thousand citations in one library, and now it can do that with ease (as, for example, the ASCSA Group Library of 2553 items demonstrates). But it may not be a priority for Zotero’s developers to move in that direction; most people use Zotero for personal citation libraries, not as de facto mirror sites for large bibliographic indices.
I have looked at BibSoup/BibServer, related projects that allow the open-access presentation of bibliographic data online, deal with a wide variety of formats (bibtex, MARC, RIS, BibJSON, RDF), and are relevant to the Linked Open Data goal of this project (full RESTful API). I really liked Zotero simply because it is already very popular with humanities-oriented users and likely to become more so (it seems especially popular among graduate students). BibSoup is geared toward STEM academics, and currently only has about 17,000 citations total (and I’m a little hesitant about breaking things after my Zotero experience!); BibServer requires a server and IT chops which I lack. I do think these applications have a lot of potential, but I don’t think they will work for my project right now. I’d welcome an argument on this point, or any other suggestions.
Finally, I’d like to add a quick recap and appreciation of what TOCS-IN is and comprises. TOCS-IN is a bibliographic database that is fully open-access (searchable at Toronto and at Louvain) and entirely crowd-sourced – that is to say, made possible by the contributions of volunteers who transcribe or copy and paste journal tables of contents and format them for inclusion in the database. A list of volunteers is available at the site, as is a list of journals currently needing a volunteer. Do consider joining us; I am currently covering three journals, and the time burden is minimal, especially if the journal publishes its table of contents online (much less typing!)
The basic portion of TOCS-IN is about 80,000 citations, comprising the tables of contents of about 180 journals, all among those indexed by the subscription database L’Annee Philologique. The project began in 1992, so chronological coverage mostly starts there. A comprehensive list of titles, volumes, and issue numbers is available at the Toronto site. TOCS-IN at Toronto and Louvain currently also searches an additional ca. 56,000 citations, including tables of contents of some TOCS-IN journals dating before 1992 (listed at Louvain), and edited volumes, festschrifts, etc. (listed at Toronto).