http://zorba.io/modules/data-cleaning/hybrid-string-similarity

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace simh = "http://zorba.io/modules/data-cleaning/hybrid-string-similarity";

This library module provides hybrid string similarity functions, combining the properties of character-based string similarity functions and token-based string similarity functions.

The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.

Module code

Here is the actual XQuery module code.

Imported modules

Authors

Bruno Martins and Diogo Simões

Version Declaration

xquery version "3.0" encoding "utf-8";

Namespaces

mathhttp://www.w3.org/2005/xpath-functions/math
sethttp://zorba.io/modules/data-cleaning/set-similarity
simchttp://zorba.io/modules/data-cleaning/character-based-string-similarity
simhhttp://zorba.io/modules/data-cleaning/hybrid-string-similarity
simphttp://zorba.io/modules/data-cleaning/phonetic-string-similarity
simthttp://zorba.io/modules/data-cleaning/token-based-string-similarity
verhttp://zorba.io/options/versioning

Function Summary

soft-cosine-tokens-soundex($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-metaphone($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-edit-distance($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro-winkler($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

monge-elkan-jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity.

Functions

soft-cosine-tokens-soundex#3

declare function simh:soft-cosine-tokens-soundex(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Soundex phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Soundex keys.

Example usage :

 soft-cosine-tokens-soundex("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +") 

The function invocation in the example above returns :

 1.0 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $r

    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

  • xs:double

    The cosine similarity coefficient between the sets of Soundex keys extracted from the two strings.

Examples

soft-cosine-tokens-metaphone#3

declare function simh:soft-cosine-tokens-metaphone(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Metaphone phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Metaphone keys.

Example usage :

 soft-cosine-tokens-metaphone("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +" ) 

The function invocation in the example above returns :

 1.0 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $r

    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

  • xs:double

    The cosine similarity coefficient between the sets Metaphone keys extracted from the two strings.

Examples

soft-cosine-tokens-edit-distance#4

declare function simh:soft-cosine-tokens-edit-distance(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:integer
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Edit Distance similarity function is used to discover token identity, and tokens having an edit distance bellow a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-edit-distance("The FLWOR Foundation", "FLWOR Found.", " +", 0 ) 

The function invocation in the example above returns :

 0.408248290463863 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $r

    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

  • $t

    A threshold for the similarity function used to discover token identity.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

soft-cosine-tokens-jaro#4

declare function simh:soft-cosine-tokens-jaro(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:double
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Jaro similarity function is used to discover token identity, and tokens having a Jaro similarity above a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-jaro("The FLWOR Foundation", "FLWOR Found.", " +", 1 ) 

The function invocation in the example above returns :

 0.5 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $r

    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

  • $t

    A threshold for the similarity function used to discover token identity.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

Examples

soft-cosine-tokens-jaro-winkler#6

declare function simh:soft-cosine-tokens-jaro-winkler(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:double,
    $prefix as xs:integer?,
    $fact as xs:double?
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Jaro-Winkler similarity function is used to discover token identity, and tokens having a Jaro-Winkler similarity above a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-jaro-winkler("The FLWOR Foundation", "FLWOR Found.", " +", 1, 4, 0.1 ) 

The function invocation in the example above returns :

 0.45 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $r

    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

  • $t

    A threshold for the similarity function used to discover token identity.

  • $prefix

    The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.

  • $fact

    The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

Examples

monge-elkan-jaro-winkler#4

declare function simh:monge-elkan-jaro-winkler(
    $s1 as xs:string,
    $s2 as xs:string,
    $prefix as xs:integer,
    $fact as xs:double
) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler

similarity function to discover token identity.

Example usage :

 monge-elkan-jaro-winkler("Comput. Sci. and Eng. Dept., University of California, San Diego", "Department of Computer Scinece, Univ. Calif., San Diego", 4, 0.1) 

The function invocation in the example above returns :

 0.992 

Parameters

  • $s1

    The first string.

  • $s2

    The second string.

  • $prefix

    The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.

  • $fact

    The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

  • xs:double

    The Monge-Elkan similarity coefficient between the two strings.

Examples