I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.

There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.

To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix=arXiv, set=cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs”. Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.

Merging DBLP and arXiv records Merging DBLP and arXiv records.

To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).

Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.

Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set is there, so feel encouraged to find better heuristic.

One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes. Oh well, that’s another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.