Deduplication


What is Deduplication?

Deduplication can mean different things in different contexts. In this case we look at the role of deduplication in identifying database records which, based on programmatic rules and algorithms, can be considered to be the same.

Deduplicating name and address data was around long before computers. One of the earliest phonetic algorithms was Soundex which was created to index US census data. Soundex was developed by Robert Russell and Margaret Odell and patented in 1918. The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example; B, F, P and V are all encoded as 1. A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first. Vowels are dropped, except for the first letter of the name.

Such algorithms remove the reliance on words being spelt identically for them to be considered a match. The method used by Soundex is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which in turn are based on where you put your lips and tongue to make the sounds.

If you are considering using Soundex for a commercial system you might want to think again. Soundex is actually a pretty poor algorithm for doing fuzzy name comparisons and will return a high number of false positive matches (such as Wilson and Wilkins, Brady and Broad), so much so that in 1970 New York State commissioned a study of phonetic coding and came up with a derivative of Soundex called New York State Identification and Intelligence System (NYSIIS). The accuracy over Soundex has been tested at an average 2.7%. Soundex and therefore NYSIIS are limited to the 26 character western alphabet.


Why do we use our own phonetic algorithm for deduplication?

Based on the poor performance of existing phonetic algorithms, helpIT systems decided to develop it's own phonetic algorithm, soundIT. This has a distinct advantage over other methodologies as it takes account of vowel sounds and syllables in the name, and determines the stressed syllable. Apart from producing far less "false positives", soundIT also understands that names such as Deighton and Dayton, Shaw and Shore, sound essentially the same. soundIT works well with UK, US and multi-national data provided it uses the Western Character Set.

A phonetically encoded representation of data is rarely sufficient to provide the accuracy of match required to provide a high level of confidence when deduplicating data. helpIT systems therefore uses phonetic encoding alongside a variety of other fuzzy matching techniques in its software, to provide a set of possible matches. helpIT then employ a method of scoring the records which are compared to provide a confidence measure - the higher the score, the more likely it is that the records match.


How can I assess the power of deduplication using fuzzy matching?

You can try dedupeIT on your own data and assess the power of fuzzy and phonetic matching first hand, simply click here.

If your database has more than 50,000 records, or you want to match across multiple files, then we recommend you trial matchIT by simply clicking here.

 

For more about deduplication:         Deduplication

See our comprehensive range of other professional data cleansing software products at