A review of OpenAlex

Here at DataSalon, we’re always keeping an eye out for new sources that we can use to enhance our clients’ data in our MasterVision and PaperStack services, and recently we’ve been doing some initial exploration of OpenAlex, a dataset whose beta version was launched last year.

What is OpenAlex?

OpenAlex (named after the ancient library of Alexandria) is a free and open catalogue of scholarly papers, researchers, journals and institutions, and of the connections between them. It’s intended as a replacement for Microsoft Academic Graph (which was retired at the end of 2021) and is funded by a charitable grant. It aims to cover all scholarly research worldwide, and currently contains over 20 million records.

What are its sources?

Its main sources of information are Microsoft Academic Graph and Crossref, but there are also others, including ORCID, ROR, and subject and institutional repositories.

How can it be accessed?

The data can be accessed free of charge either via an API (no authentication is required) or by downloading a database snapshot in JSON Lines format, updated about once a month. The snapshot is extremely large but is divided into date-stamped zip files so that, once set up, it can be updated incrementally rather than having to download the entire dataset each time.

What types of record does it contain?

OpenAlex has five types of record:

works (journal articles, books, datasets, theses)
authors
venues (where works are hosted – journals, conferences, repositories)
institutions
concepts (providing a hierarchical subject framework for works)

What fields does it have?

The short answer is that each type of record contains large amounts of metadata about the entity it describes. Potentially of the greatest interest to users of PaperStack are:

citations data, at the works, author and institutional level
retraction data – although incomplete, this could have uses in article fraud detection
links between record types – for example, lists of authors’ works give a fuller picture of activity than is available from an individual publisher’s submissions data

We’re still reviewing if/how we could use the OpenAlex dataset in PaperStack, as well as keeping an eye on its sustainability models (we want to be sure it will continue to be available before building any new functionality that relies on it). If you have any thoughts to contribute to this review process, we’d love to hear from you!

Related