Oracle 23c - SQL

Using Fuzzy Matching in Oracle Database 23c

Data quality issues plague even the most meticulously maintained databases. Typos, misspellings, and phonetic variations can create duplicate records, hindering analysis and decision-making. Fortunately, Oracle Database 23c delivers two powerful tools for fuzzy string matching: FUZZY_MATCH and PHONIC_ENCODE.

This article delves into these operators, exploring their potential and providing practical code examples to unlock their power.

Fuzzy Matching in Action: FUZZY_MATCH

Imagine searching for customers named “Michael” but encountering variations like “Michal” or “Micheal.” Here’s where FUZZY_MATCH shines. It calculates the similarity between two strings using various algorithms, returning a score indicating their closeness. Higher scores represent greater similarity. Here’s an example:

SELECT customer_id, name, FUZZY_MATCH('SOUNDEX', name, 'Michael') AS match_score
FROM customers;

OUTPUT:
customer_id | name          | match_score
----------- | ------------- | -----------
1           | Michael       | 100
2           | Michal        | 80
3           | Micheal       | 90
```

We used the SOUNDEX algorithm, which encodes names based on pronunciation. Other algorithms available include LEVENSHTEIN (edit distance) and JARO_Winkler (similarity measure).

Phoning it In: PHONIC_ENCODE

Sometimes, variations arise due to pronunciation differences, not spelling errors. In these cases, PHONIC_ENCODE is your ally. It converts strings into a phonetic representation, focusing on sound, not character sequence.

For instance, “Chris” and “Kris” might have different spellings but share the same phonetic code, allowing you to identify potential duplicates:

SELECT customer_id, name, PHONIC_ENCODE(name) AS phonetic_code
FROM customers;

OUTPUT:
customer_id | name          | phonetic_code
----------- | ------------- | -------------
1           | Michael       | MKL
2           | Michal        | MKL
3           | Micheal       | MKL
4           | Chris         | KRS
5           | Kris          | KRS
```

By comparing phonetic codes, you can efficiently uncover near-duplicate records based on pronunciation similarity.

PL/SQL Support

While FUZZY_MATCH and PHONIC_ENCODE are powerful data quality operators, direct assignment within PL/SQL blocks isn’t currently possible.

DECLARE
  my_name VARCHAR2(50);
BEGIN
  -- Attempting direct assignment (doesn't work)
  my_name := FUZZY_MATCH('SOUNDEX', 'Michael', 'Michal');
END;
/

...
PLS-00201: identifier 'FUZZY_MATCH' must be declared
...

Fortunately, we can leverage the SELECT … INTO construct to retrieve the desired output from the operator and store it in a PL/SQL variable:

DECLARE
  my_name VARCHAR2(50);
BEGIN
  -- Select the match score and store it in the variable
  SELECT FUZZY_MATCH('SOUNDEX', 'Michael', 'Michal') INTO my_name
  FROM DUAL;
END;
/

Practical Applications

  • Deduplication: Identify and merge near-duplicate customer records, product entries, or any other textual data.
  • Data cleansing: Correct typos and misspellings, improving data accuracy and consistency.
  • Fuzzy search: Enable flexible search functionalities, accommodating spelling variations in queries.

Conclusion

By embracing FUZZY_MATCH and PHONIC_ENCODE, you empower your Oracle Database 23c to handle imperfect data with agility and precision. Explore these tools to enhance data quality, streamline data management, and gain valuable insights from your information assets.

For more information:

Scroll to Top