Cleaning up your data for AI

AI tools are becoming increasingly commonplace. Whatever you may think about the usefulness (or advisability) of some applications of AI, there’s no doubt that it can be a powerful tool, and it’s one with many potential applications in scholarly publishing. Publishers may use AI to improve their submissions processes, find new authors and reviewers, or spot plagiarism, reviewer fraud and potential conflicts of interest.

Unfortunately the content these applications rely on was generally not built with AI in mind, and even the most sophisticated AI models won’t function well if there are issues with the underlying data. For optimum performance, data needs to be:

Well-structured, to make it easier to identify patterns and relationships, and to speed up information processing
Complete, so that patterns and insights are based on the full picture
Consistent, to avoid misleading or even conflicting conclusions
Relevant, to ensure that models are trained on appropriate data

The efficient processing and reduced training times enabled by good-quality data not only lead to faster results but can also significantly reduce the high energy consumption for which AI is well-known.

This is where DataSalon comes in. With our long-standing expertise in data handling, there’s lots we can do to help you get your data into shape:

We make sure data is well-structured by analysing all new data sources to identify quality issues, then applying a range of tools to address those issues. Some data can be collected automatically via APIs, to ensure cleaner structures, and we can also ‘roll up’ data to provide summaries for quicker analysis.
We set up a complete data infrastructure by linking up data from the various platforms used by publishers, drawing together all the information about individuals and institutions into a single repository.
We take care of consistency by matching against reference datasets (to standardise key fields and provide persistent identifiers), normalising date fields, and aligning similar information that has been input differently in different systems.
We ensure relevancy by removing invalid data, dropping irrelevant fields, and removing duplicates.

Our MasterVision and PaperStack products already offer a huge range of insights and analysis, but if you want to use AI to explore further outside these systems, then we can supply your data back to you with all the linking and tidying in place – not just as a one-off job, but after each data update, to support ongoing data governance.

To find out more, please get in touch for a chat with our Client Services Director.

Related