Data quality for the bewildered

Over the past year, we’ve been doing a lot of research & development into the challenges publishers face in relation to data quality, and working with publishers to address their real-world data issues.

The word ‘bewildered’ feels like a pretty good fit for the current mood around data quality. And that’s not an insult! It’s a complex area, everyone knows they have issues, but it’s very hard to work out what to do about it, or what kind of benefits might be achieved.

We think we’ve now come up with a really nice solution to this challenge: enabling publishers to experiment and feel their way with data clean-up, without knowing what the answer will look like at the outset.

We’re now busy preparing this new product for launch, but in the meantime we’re keen to share the key principles which we think make it something special:

1. Publishers need a ‘toolkit’ approach to data quality, so they can get hands-on with the data, and configure, check and control clean-up rules directly. The idea of outsourcing the whole thing doesn’t feel very comfortable, when you’re not entirely clear what needs to be done.

2. The tools need to support being ‘bewildered’ as a starting point! They need to allow users to explore and analyse each table and field interactively, to start to get a feel for what might be wrong, and what sort of changes might help.

3. In line with that concept of ‘feeling your way’, clean-up rules need to be small and modular, so users can begin by taking small steps to apply some improvements, and then add more rules incrementally. Rules might include tidying case, removing test/dummy values, formatting dates and numbers, etc. Adding those small steps one-by-one can slowly build up a great set of clean-up rules.

4. Applying changes to your customer data feels risky, and so ‘report as you go’ is essential. For each rule added, it must be possible to preview and check exactly which changes are going to be made, and make adjustments accordingly. Only with that sort of immediate feedback can publishers begin to feel confident that the right rules are being applied.

5. And once a good set of clean-up rules has been configured, the process of running those changes needs to be exactly repeatable. Often source systems cannot be changed directly, so plugging in an automated clean-up step which can be repeated for each extract is essential, including a full audit report of every change made.

We already have this new approach fully working in beta, and the results are impressive. In a recent demo we managed – from scratch – to identify, configure, and test rules which made over 20,000 useful improvements to a real customer data table within an hour. We’re very excited about this new product, and we’ll continue to share progress as we now move towards a full launch.