Open data essentials for scholarly publishers

An issue with which every scholarly publisher is familiar is that of uniquely identifying an entity (be that an organisation, an individual or a piece of content) which may be referred to in different ways in different data sources. Fortunately, a number of free and open reference datasets exist which can help with this.

Identifying organisations

For the past few years, the GRID database has provided the best free coverage of research organisations worldwide. It contains over 101,000 records, with location data (aiding the disambiguation of organisations with the same name), alternative names (helping identify cases where the same organisation has been referred to a different way), and information about parent/child relationship (placing the organisation in a hierarchical context).

More recently, ROR (the Research Organization Registry) has been set up to provide a standard way for the scholarly community to reference the organisations that employ, fund or publish them. This is very similar to GRID (not least because it has so far been based on seed data from GRID), but with the key difference that it is community-led, with the ultimate aim that organisations will maintain their own records. The founders of ROR (which included Digital Science, producers of GRID) always intended that it would eventually replace GRID in the public domain, and have just announced that the last release of GRID will be in the last quarter of 2021.

A third dataset, Crossref’s Funder Registry, was developed with the aim of standardising the way funders are referred to in research papers. Its approach is similar to that of GRID/ROR, but it differs in scope because of its concentration on funding bodies. Although much smaller than GRID/ROR (currently just under 28,000 records), it contains records they don’t and provides a useful supplement to those datasets.

Identifying individuals

The ORCID registry of researchers provides a similar point of reference for individuals: researchers register for their own personal ID, which they can then use to identify themselves in their article submissions, funding applications, and so on. This too helps with identification (e.g. where first names have been abbreviated differently, or a Chinese name is reordered in western style) and disambiguation (the ID, as well as organisational affiliations and published papers, distinguishes two people of the same name).

Identifying content

For the final piece of the puzzle, Crossref acts as a registration agency which assigns DOIs (digital object identifiers) to pieces of scholarly content. Again, this helps address issues of identification (article titles changing between first draft and publication) and disambiguation (articles with identical or similar titles).

How can DataSalon help with this?

All these datasets can be integrated into DataSalon products, and our specially developed set of tools will help you to make the most of them.

Using the IDs from these datasets, we can link up data from various sources and provide a single standardised view for each organisation, individual or piece of content.
We can represent the different levels of data, linking pieces of content with their authors and reviewers, and in turn linking those authors and reviewers with the organisations they work for.
In many cases where IDs are missing, our automatching tools still allow data to be linked, through the use of fuzzy searching, email and web domains, and our own lists of abbreviations, synonyms and alternative names.
We can handle all the different data formats involved (e.g. ROR’s JSON files and Funder Registry’s RDF files).
We can use APIs to add additional information to source data (e.g. the ORCID ID of an author or researcher can be used to retrieve full affiliation data from the ORCID API, while the DOI of an article can be used to pull its final publication date from the Crossref API).

We’re constantly refining these tools and keeping up with updates and changes to these datasets. So please do get in touch to discuss how we can use our expertise to help you – whether that’s advising on integrating these datasets with your data, or handling your transition from GRID to ROR.

Related