Clever techniques for comparing lists

Comparing two or more lists is a common task in data analysis, but is not always as simple as it sounds. This month we explore five techniques that can help provide insight when performing list comparisons:

1. Intersections & non-intersections. Finding ‘intersections’ or overlaps between different lists enables the joining of data from different sources. It is also the basis of all ‘AND’ searches, which identify records that share common properties. This can allow the creation of highly targeted segments – e.g. finding contacts who are subscribers AND active AND live in the UK. Non-intersections are records that are unique to each list. These can be especially helpful for finding new prospects – if a contact is on your newsletter list, but not on your list of recent purchasers, this may represent a sales opportunity. Either can be used for data deduplication, which may retain the non-intersections only, or simply remove the intersections. The classic visualisation is Venn diagrams, which represent sets of data as overlapping circles.

2. Deduplicating using generated keys. Where data contains unique identifiers, such as email addresses, finding overlapping and unique records is simple, but the issue is complicated when there are no reliable ‘key’ fields to match on. One solution can be to generate keys by normalising free text values that have the same ‘meaning’ or by joining multiple fields together in order to create uniqueness. For example, an ‘Institution’ field may contain the values ‘Oxford University’, ‘University of Oxford’, and ‘Oxford Univ.’ which could be normalised and linked to a single common key such as ‘OXFORD_UNIVERSITY’. Alternatively, where a contact does not have an email address, their first name, surname and postcode could be joined together to construct a key that would identify them uniquely within the data.

3. Fuzzy matching using ‘edit distance’. In the absence of reliable unique identifiers, inexact or ‘fuzzy’ matching can also be used to match records together. ‘Edit distance’ is a way of assessing the similarity or difference between two terms by measuring the number of changes that would be needed to transform one into the other. For example, ‘Manchester University’ and the typo ‘Mancester University’ have a lower edit distance (requiring a change to a single character – ‘Manc[h]ester’), than ‘Manchester University’ and ‘Chichester University’ (which would require 3 characters to be changed – ‘[Chi]chester’). When used in fuzzy matching, a rule is applied in which terms with an edit distance below a certain level are considered similar enough to be equal.

4. Fuzzy matching using ‘sounds like’. Fuzzy matching can also be based on terms that sound like each other when pronounced out loud, which like edit distance will ignore minor differences in spelling. The most well known implementation of this is ‘Soundex’ matching, which maps names to a four character key based on the way that they sound when spoken (e.g. ‘flower’ and ‘flour’ are both mapped to ‘F460’). This technique was first patented in 1918 and has since been improved upon by alternatives including ‘Metaphone’, which incorporates a wider range of rules and uses variable instead of fixed length keys. One drawback of using ‘sounds like’ matching is that it is language specific – both Soundex and Metaphone are based on English pronunciation, reducing accuracy when the source data contains foreign terms.

5. Frequency analysis. Frequency analysis finds the most and least commonly occurring list items, which is useful in a range of scenarios. When faced with a list of hundreds of codes to map to plain English names, knowing which occur most frequently in the data can help focus attention on the most important cases first. It is also the basis of pie and bar charts, which represent values in segments or bars of sizes proportional to their frequency. In addition, maps with hotspots and cold areas rely on frequency analysis to indicate the density of contacts within a given region.

Related