DataSalon’s secret sauce revealed

Both of DataSalon’s core services, MasterVision and PaperStack, join together data from multiple sources for complete insight into customer activity and the publishing process. The ‘sauce’ that brings together all these ‘ingredients’ and enriches the whole is automatic matching of records to a reference data set – either Ringgold (where the relevant licence is in place) or ROR (the free alternative).

Matching source data against these reference sets isn’t straightforward for various reasons:

  • Different forms of institution name: free text entry leads to many versions of the same name – different word order, different languages, different combinations of full text and acronyms, and so on.
  • Different forms of state/country name: even if these have been populated from selectable lists, there may be differences such as whether they use codes or full versions, English or the vernacular.
  • Different data structures: different data sources may have different fields containing the relevant information, or even lump it all together into one field (for example, institution name and location data all appearing in a single address field).
  • Data entry errors: there may be typos or incorrect location information.
  • Missing location data: sources don’t always have city, state, or even country fields, but location information can be needed for disambiguation purposes

With our years of experience in this area, we’ve developed a clever set of ‘automatching’ tools to address these issues – these use a combination of strategies to work their magic:

  • fuzzy matching to address differences in word order
  • lists of alternative names to supplement the ones in Ringgold and ROR
  • lists of alternative names to ignore so that common acronyms shared by more than one organisation don’t lead to incorrect matches
  • lists of synonyms for typos, US/UK spellings, and so on
  • mappings of state and country codes to full text
  • inferring missing country information from the organisation name
  • matching email and web domains against the organisation URLs in Ringgold or ROR

The various strategies can be applied to each different data source as appropriate, and the relevant fields for the automatching process can be specified separately for each data source.

This automatching underpins all the functionality in MasterVision and PaperStack – it supports:

  • Single view of all activity by an institution – we can link up records by Ringgold or ROR ID even if they don’t include any other IDs, or if multiple customer IDs have mistakenly been assigned to the same institution.
  • Data standardisation – we use the name and location information from Ringgold or ROR, so that the format is consistent across all data sources.
  • Rolling up data – we can link up each individual record to its associated institution, so that data available at the individual level (such as article submissions or pay per view purchases) can be analysed at the institutional level.
  • Hierarchical view – we can show the relationships between records, in a family-tree style display of parent and child organisations.

To find out more about how automatching can add value to your data, do get in touch to arrange a demo.