Auditing data quality the easy way

One of the first steps on the path to data quality enlightenment is to audit the quality of your data.  It’s useful to know the current state of play, to work out which data sources need your attention the most. There are a few different approaches you can take to auditing:

Manual audit

You will have staff who work closely with your data, and they will have a pretty good idea of where poor data quality might impact on their job performance. This might be customer service staff who, in looking up customer records, have a good feel for the level of account duplication. It could be marketing staff, who know that their response rates take a dive if they include email contacts from a certain source in their campaigns.

The point is that your staff already have a wealth of knowledge about how poor data quality impacts on their jobs. By working backwards from there, you can start to uncover some of the underlying data quality problems.

However, no single person or group of people can possibly have an in-depth understanding of all of your data sources and the quality of each one. This is where an automated auditing process can reap rewards.

Automated audit

An automated data quality audit has a number of advantages that will help you to understand the broader picture:

  • An automated audit can cover a lot of data sources at once, highlighting quality issues across multiple data sets and potentially millions of records. Many of our clients are already seeing the benefits of this type of large-scale audit.
  • Automation can also apply consistent checks to every source, resulting in a set of metrics or KPIs that you can use to get a good understanding of your overall data quality score. We use traffic light indicators and a system that takes account of priority fields within each source to make that score as clear and meaningful as possible.

overview_sm

  • If you want the detail as well as the overview, automated auditing and reporting can allow you to drill right down to see problem values in individual fields. Which emails are invalid and which names are junk entries are just two of the many types of data error we report on.
  • Automated auditing can also be repeated, so that you can track your data quality profile over time. We repeat the audit each month and provide twelve months of past data as standard in MasterVision DQ.

It’s important not to discount the manual approach to uncovering data quality issues, but to get a truly comprehensive picture of the quality of your data, an automated audit like the one offered by MasterVision DQ is the way to go. You can find out more about MasterVision DQ by taking the tour.

The data marathon

I’m currently training for a marathon, and have been putting off sorting out the tons of running data gathered by my GPS watch. I haven’t been practicing what I preach when it comes to data management, and my continued procrastination was holding back my performance.

I have a number of training routes, all measured out in my head. However, I’ve always wondered why my 10k races were never as fast as my 10k training runs. After finally biting the bullet and working through my logged data, I found that my training runs were actually 9.1k! My training strategy was based on flawed assumptions, and not informed by hard data. A sure-fire way to get disappointing results!

No pain no gain

I put a lot of effort into my running, but – just like in any area of life or business - neglecting the detail can put that end performance in jeopardy. It can be a pain to get the preparation right, and to trawl through the detail to find the best strategies, but it’s well worth it.

Tracking

So, how am I performing? Well, I could do better! Taking time out to see what my data is saying reveals a very inconsistent picture, and highlights a need to track my progress better, and monitor what my data is telling me. Is it working effectively for me? Are things improving over time? We all assume we know everything about ourselves (and our customers), but checking and tracking that data might just throw up some surprises.

It’s a marathon, not a sprint!

We all know that digging into this level of detail can be one of those jobs that gets put off in favour of other ‘sexier’ tasks, and it can take a lot of time – but giving data the attention it deserves will give you a better understanding of your performance, and help to inform future strategies to achieve more successful results.

Introducing our new tagline…

Recently we’ve been working on a new tagline – one which sums up everything we strive to achieve at DataSalon, on behalf of the many publishers we work with. We’re now pleased to unveil the result:

Better data. Better insight. Better business. 

We think this neatly captures what we believe in, and soon it will begin to appear on our website and other materials. Here’s a little ‘behind the scenes’ summary of our thought process in choosing this:

Better data

Every publisher is awash with data about authors, subscriptions, usage, and a whole lot more. But in order to make good use of it, all that data needs to be clean, correct, and trusted. This part of the tagline references the tools and expertise we provide to help solve those difficult challenges of data quality, data cleansing, and de-duplication.

Better insight

With MasterVision we help publishers turn information into insight by connecting up customer data from many different source systems into a single view. This is particularly important for management and marketing teams, who need quick access to a complete 360° view for every individual and institution, with tools which make it easy to search, segment, and visualise all of that information.

Better business

Everything we do revolves around supporting the bottom line for our clients. ‘Better business’ refers to our track record of helping publishers to mine their customer data to drive revenue: by securing renewals, identifying strong new sales opportunities, and supporting strategic planning with accurate information about broader trends in author, customer, and usage activity.

So there you have it. We’re really passionate about this stuff, and hopefully our  new tagline will help all of our clients (both present and future) to share this broader vision of what DataSalon is all about.

Our first webinar

Webinar title slideWe dipped our toes into the world of webinars recently, hosting a free session on the topic of data quality. This was the first webinar we’ve hosted and as such was a bit of an experiment and learning curve for us. So what did we learn?

On the plus side it was great to be able to address a global audience from the convenience of our office. The web truly does make the world a smaller place in some ways. Attending a webinar doesn’t represent the same commitment as attending a conference, seminar or even travelling to a meeting. Because of this we were able to attract attendees who might otherwise not have had the time to spare for a talk. We also think that the audience as a whole was more focused on the topic we wanted to talk about than might be the case at a conference where there is a variety of talks and speakers.

There were some challenges as well. As a speaker it was difficult doing a presentation over the web without the audio and visual feedback you would normally get from a ‘live’ audience. That certainly took a bit of getting used to; I couldn’t tell if anyone was laughing at my gags! We also weren’t sure how to field questions either during or at the end of the webinar, and so opted not to try – instead asking for questions to be sent through after the event. In hindsight I would like to have had the opportunity for more direct feedback and so we will consider how we might facilitate that for any future events.

Overall I enjoyed the experience and feedback from those who attended has been positive. We’ll be thinking about other topics that could make for an interesting webinar in the future, so watch this space.

Big plans for 2014

arrowWith the start of a new year inevitably comes some thinking about strategy for the next twelve months. We’re no different here at DataSalon, and in December we all sat down and had a good old chinwag about what we should focus on for the coming year. Here’s what we came up with as our ‘themes’ for 2014.

Data quality

This will come as no surprise to anyone who has been following this blog, but data quality will be a big theme for us in 2014. It’s high on our list as we’re passionate about doing our bit to raise the issue of data quality within publishing, because of the potential for it to improve the overall level of service and communication within the industry. We’re also rolling out our own data quality service MasterVision DQ to more and more of our customers, so further developing the scope and functionality of that module is something we will focus on early in 2014.

Customer identity

Customer identity is another topic that is close to our hearts. We’ve mentioned the developments underway with personal and institutional identifiers a few times on this blog, and we’re looking closely at this area in 2014. In particular we look forward to our clients making greater use of ORCIDs, and therefore feeding that data through to their MasterVision sites. We’re also looking closely at the ISNI identifier as an open and ‘bridge’ identifier for institutions. The ability of ISNIs to work in conjunction with other identifiers – and therefore potentially linking up individuals and institutions as well as connecting different metadata sets – is an exciting one. Watch this space, as there are sure to be some interesting developments in this area in 2014.

Single customer view and analysis

Lastly, we don’t want to lose track of what lies at the centre of our business – providing publishers with customer insight and intelligence via a single customer view. We’re not resting on our laurels here, and have some ideas about how to make the customer data integration already done within MasterVision even more useful for our clients. We don’t want to give away too much just yet, but we’re currently hard at work on some new visual reporting within MasterVision which will provide an even better understanding of customers, segments, and trends for our publishing clients.

Whilst we can’t predict the future, we can predict that 2014 will be a busy year for us, and we’re already hard at work on these new developments.

Forthcoming Events

calendarOur client Director, Colin Meddings, will be speaking at some forthcoming events in 2014, so don’t forget to put these dates in your diary.

ALPSP Seminar – January

First up is a reminder about the seminar organised by the Association of Learned and Professional Society Publishers entitled ‘Data, the universe and everything’. The seminar is on Wednesday  22 January in London. As well as Colin giving an introduction to why data quality matter for publishers, topics covered on the day also include personal and institutional identifiers, data in an open access world and some case studies from publishers tackling data issues. It should be an interesting event and we look forward to hearing about many data-related topics close to our hearts.

[Update: Now that this event has happened you can read all about it on the ALPSP blog.] 

UKSG Annual Conference – April

If you miss the ALPSP seminar then there will be a second chance to hear about why data quality is important at the UKSG Annual Conference in April. The conference is always one of our annual highlights and this year Colin will be jointly presenting a breakout session on ‘Scholarly publishing’s dirty secret: why data quality matters, and what you can do about it ‘. The session will again provide an overview of the importance of data quality to publishers and in addition will include a practical case study from BMJ on implementing a data quality and governance initiative.

Free Webinar – February

Finally, if you aren’t able to get out of the office to attend one of the above events we will be presenting a free webinar on Thursday 13th February at 15.00 GMT. The webinar is entitled ‘Customer Data Quality for Scholarly Publishers – why you should care and what you can do’. It will cover the value of data quality to publishers and will provide an overview of our own data quality product MasterVision DQ. Details of how to sign up for the webinar can be found on our website.

Your specialist subject?

chairWe’ve recently been analysing client data sources through our data quality module, MasterVision DQ, and one of the findings we were interested to note was the limited information available for individuals’ subject interests.

This is a key area; having access to details of a customer’s interests enables you to create more accurate lists of recipients for targeted marketing campaigns, and also helps you identify potential authors/reviewers by their areas of expertise.

It’s possible that this information is missing because customers were able to bypass the ‘interests’ field when they originally signed up – there are a couple of improvements to your registration forms you could consider to address this:

  • Changing the ‘interests’ field to a pre-populated list of subject areas will make it a quick and easy task for users to make a selection, with the added benefit to you of structured categories instead of free-text inconsistencies in your data.
  • Making this type of important field ‘required’ on registration forms will ensure the relevant subject info is supplied.
  • For existing users, introduce ‘progressive profiling’ to request the missing information when they next log in to your site.

Of course, changes such as these can take time to implement, but MasterVision can help in the meantime: by inferring subject interests from other sources relating to individuals who have interacted with a particular product, it’s possible to cross-populate customer records with this info.

For example, an individual has subscribed to the journal ‘Econometric Theory’, but their registration data contains no specific subject interests. Using the subject categorisation of that journal from the relevant subs data, we can infer that a subject interest for this user would be ‘Economics’.

This enriched view of your customer data will help you gain a better understanding of the subject interests of whole groups of customers you may otherwise have missed when identifying opportunities for effective marketing campaigns, or seeking prospective subject experts.

The wacky world of organisational IDs

We’ve been reading the recent CASRAI-UK Organisational ID report on institutional identifiers in the UK with interest. The report was commissioned by JISC to provide a landscape review of organisational identifiers currently used, and makes for interesting reading.

The XKCD cartoon included in the report seems particularly apt, given that they list no fewer than 23 different institutional identifiers currently in use by various entities within scholarly research and communications. Here are some of the points we thought were of particular interest from the report:

  • Here at DataSalon we work mostly with scholarly publishers and their customer data, and hence tend to focus on the use cases that publishers have for identifiers. This report makes it clear that there many more use cases for identifiers across the whole scholarly research community including for libraries, funders, regulators, administrators and more. It also highlights just how many different identifiers are in use. That’s worth bearing in mind when thinking about integrating identifiers for one particular purpose.
  • The report also shows the limitations of many of those identifiers in terms of worldwide coverage, with only nine listed as being global rather than UK or EU specific. It seems clear that most publishing and research is done in a global context, and so identifiers without global coverage will only ever be of limited use to the whole research and publishing community. It also suggests that a ‘linking’ ID may have a role to play in making sense of the current proliferation of different IDs.
  • The report reinforces some of the key attributes required of a successful identifier (including trust, transparency and governance), as well as making good points about how much metadata is (or isn’t) appropriate to include, and the temporal nature of institutional identities.

We see differing use cases for identifiers within MasterVision from our clients. Sometimes a simple ID that identifies that two records are actually the same institution is all that’s required. Other use cases demand a much richer set of associated metadata to identify research specialisms or other attributes.

Here, the point made by the report that ‘The authority can remain separate from the identifier (for example, it would be feasible to establish an authority list with appropriate metadata but using the ISNI as the identifier)’ is important. It means that a standard identifier could emerge that links to multiple sets of data about institutions, and hence many different metadata sources could be called upon using the same ID for varying use cases.

The current front-runner for that role as a standard linking ID is the ISNI. We’re keeping an eye on developments such as the linking of ISNI and ORCID via a forthcoming affiliations module on behalf of our clients. If you’d like to find out more about the role and use of identifiers, it will be one of the topics covered in an ALPSP seminar entitled Data, the universe, and everything at which our very own Client Director, Colin Meddings, will be speaking in January 2014.

What is a “valid” email address?

Email EnvelopeWe have recently spent a lot of time looking at data quality issues in publisher data. Email addresses are a key piece of contact info, and essential for online marketing campaigns, so it’s particularly important that these are present and correct. But what exactly is a “valid” email address?

Interestingly, the answer is not as straightforward as it might sound. As a starting point, there are the official specs (e.g. RFC 2822), which define what syntax and characters are allowed. But – would you be surprised to see that all of the following are formally valid according to those?

  • postbox@com (no dot)
  • "very.unusual.@.unusual.com"@example.com (two @ signs)
  • !#$%&'*+-/=?^_`{}|~@example.com (no alphanumeric chars in first part)
  • " "@example.com (contains a space)
  • üñîçøðé@example.com (Unicode characters in first part)

Examples courtesy of Wikipedia

In the real world, these addresses would most likely be rejected when signing up to online systems, which commonly have their own ideas about what is “valid”.

On the flip side, many “normally” formatted email addresses may in fact look suspicious on closer inspection:

  • a@example.com (single letter in first part)
  • bbb@example.com (repeated letters in first part)
  • test@example.com (test address)
  • dummy123@example.com (dummy address)

In these cases, it’s likely that users may have entered a made up address to fast track the registration process. Alternatively, they may even have been submitted automatically by a “bot” creating fake accounts, which may need some further investigation.

As an extra complication, even an email address that at surface value looks fine may not be deliverable to. For example, john.smith@company.com may not be reachable because that company no longer exists. In this case, it’s possible to check whether a domain exists and accepts mail.

But what if the individual has simply changed jobs, meaning that this account no longer exists at the company? In that case, the “validity” of the address can only be tested by actually sending an email to it, which may then be returned to the sender and flagged as a “bounce”.

However, in other cases the message could appear to have been delivered OK but not received by the intended recipient. Here, looking for a history of non-opens and non-clicks can help, but isn’t fool proof – given certain user settings messages can be opened without being tracked.

These are all issues to be aware of when using the phrase “valid” email. Since this may be interpreted differently depending on who – or which system – you are talking to, it’s important to be clear exactly what you mean.

Here at DataSalon we’ve been working on identifying “valid” email addresses for many years and have developed sophisticated rules to identify incorrect and suspect values. This is important when using email as part of a personal identifier in MasterVision, or when cleansing customer data using MasterVision DQ. You can contact us to find out more.

Hierarchical searching

Anyone who works with customer records in scholarly publishing will know that a complicating factor can be the hierarchical relationships between individuals, departments, institutions and consortia. For a long time now it has been possible to use MasterVision to search and analyse the hierarchical relationships between customer records. We are pleased to say this has now become even easier!

As we know, most institutions sit within a complex network of relationships. A search in MasterVision for the University of Oxford can be extended to include parent entities higher up the family tree (such as consortia) or subsidiary organisations (such as departments, libraries or faculties). You can also search for institutional customers and then switch to find those individual people who have a relationship with those institutions.

MasterVision’s newly enhanced ‘Find related…’ functionality allows for an easy and powerful way to search these relationships.

For example, if we were to run a search for a set of institutions, we can now follow the “Include related institutions” link to a clearly illustrated set of options (below) which allows us to define the hierarchical relationships we wish to apply to our search results and therefore include in a new search.

inc-insts-4

We can also use our set of institutional search results to find related individuals. Again, we can choose to find those who have a direct relationship with our institutions, or specify whether we would like to include those that may be connected to them at other levels in the hierarchy.

find-indivs-3

This can also be turned around so that starting from a search for individuals, we can find related institutions including parent or subsidiary organisations.

Being able to quickly and easily analyse the relationships in customer data is vital when doing market share analysis; working out who should be considered ‘sold to’, when going after new subscriptions; or relating activities, such as article submissions, back to a given institution.

We think that this is a very simple and easy to understand way to carry out very complex hierarchical searches. If you are a MasterVision client who wants to know more about this new functionality contact your account manager. If you aren’t yet a MasterVision client and would like to be able to analyse your customer relationships this easily, simply contact us for a demo.

Follow

Get every new post delivered to your Inbox.