http://zorba.io/modules/full-text

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace ft = "http://zorba.io/modules/full-text";
This module provides an XQuery API to full-text functions. For general information about this implementation of the XQuery and XPath Full Text 1.0 specification as well as instructions for building an installing a thesaurus, see the Full Text Thesaurus documentation.

Notes on languages

To refer to particular human languages, uses either the ISO 639-1 or ISO 639-2 languages codes. Note that only a subset of the complete list of language codes are supported and not every function supports the same subset.

Most functions in this module take a language as a parameter using the xs:language XML schema data type.

Notes on stemming

The stem() functions return the stem of a word. The stem of a word itself, however, is not guaranteed to be a word. It is best to consider a stem as an opaque byte sequence. All that is guaranteed about a stem is that, for a given word, the stem of that word will always be the same byte sequence. Hence, you should never compare the result of one of the stem() functions against a non-stemmed string, for example:
  if ( ft:stem( "apples" ) eq "apple" )             ** WRONG **
 
Instead do:
  if ( ft:stem( "apples" ) eq ft:stem( "apple" ) )  ** CORRECT **
 

Notes on the thesaurus

The thesaurus-lookup() functions have "levels" and "relationship" parameters. The values for these are implementation-defined. The default implementation uses the WordNet lexical database, version 3.0.

In WordNet, the number of "levels" that two phrases are apart are how many hierarchical meanings apart they are. For example, "canary" is 5 levels away from "vertebrate" (carary > finch > oscine > passerine > bird > vertebrate).

When using the WordNet implementation, all of the relationships (and their abbreviations) specified by ISO 2788 and ANSI/NISO Z39.19-2005 with the exceptions of "HN" (history note) and "X SN" (see scope note for) are supported. These relationships are:
Rel. Meaning WordNet Rel.
BT broader term hypernym
BTG broader term generic hypernym
BTI broader term instance instance hypernym
BTP broader term partitive part meronym
NT narrower term hyponym
NTG narrower term generic hyponym
NTI narrower term instance instance hyponym
NTP narrower term partitive part holonym
RT related term also see
SN scope note n/a
TT top term hypernym
UF non-preferred term n/a
USE preferred term n/a
Note that you can specify relationships either by their abbreviation or their meaning. Relationships are case-insensitive. In addition to the ISO 2788 and ANSI/NISO Z39.19-2005 relationships, All of the relationships offered by WordNet are also supported. These relationships are:
Relationship Meaning
also see A word that is related to another, e.g., for "varnished" (furniture) one should also see "finished."
antonym A word opposite in meaning to another, e.g., "light" is an antonym for "heavy."
attribute A noun for which adjectives express values, e.g., "weight" is an attribute for which the adjectives "light" and "heavy" express values.
cause A verb that causes another, e.g., "show" is a cause of "see."
derivationally related form A word that is derived from a root word, e.g., "metric" is a derivationally related form of "meter."
derived from adjective An adverb that is derived from an adjective, e.g., "correctly" is derived from the adjective "correct."
entailment A verb that presupposes another, e.g., "snoring" entails "sleeping."
hypernym A word with a broad meaning that more specific words fall under, e.g., "meal" is a hypernym of "breakfast."
hyponym A word of more specific meaning than a general term applicable to it, e.g., "breakfast" is a hyponym of "meal."
instance hypernym A word that denotes a category of some specific instance, e.g., "author" is an instance hypernym of "Asimov."
instance hyponym A term that donotes a specific instance of some general category, e.g., "Asimov" is an instance hyponym of "author."
member holonym A word that denotes a collection of individuals, e.g., "faculty" is a member holonym of "professor."
member meronym A word that denotes a member of a larger group, e.g., a "person" is a member meronym of a "crowd."
part holonym A word that denotes a larger whole comprised of some part, e.g., "car" is a part holonym of "engine."
part meronym A word that denotes a part of a larger whole, e.g., an "engine" is part meronym of a "car."
participle of verb An adjective that is the participle of some verb, e.g., "breaking" is the participle of the verb "break."
pertainym An adjective that classifies its noun, e.g., "musical" is a pertainym in "musical instrument."
similar to Similar, though not necessarily interchangeable, adjectives. For example, "shiny" is similar to "bright", but they have subtle differences.
substance holonym A word that denotes a larger whole containing some constituent substance, e.g., "bread" is a substance holonym of "flour."
substance meronym A word that denotes a constituant substance of some larger whole, e.g., "flour" is a substance meronym of "bread."
verb group A verb that is a member of a group of similar verbs, e.g., "live" is in the verb group of "dwell", "live", "inhabit", etc.

Notes on tokenization

For general information about the implementation of tokenization, including what constitutes a token, see the Full Text Tokenizer documentation.

Module code

Here is the actual XQuery module code.

Authors

Paul J. Lucas

Version Declaration

xquery version "3.0" encoding "utf-8";

Namespaces

errhttp://www.w3.org/2005/xqt-errors
fthttp://zorba.io/modules/full-text
verhttp://zorba.io/options/versioning
zerrhttp://zorba.io/errors

Variables

ft:LANG-DA as xs:language

Predeclared constant for the Danish xs:language.

ft:LANG-DE as xs:language

Predeclared constant for the German xs:language.

ft:LANG-EN as xs:language

Predeclared constant for the English xs:language.

ft:LANG-ES as xs:language

Predeclared constant for the Spanish xs:language.

ft:LANG-FI as xs:language

Predeclared constant for the Finnish xs:language.

ft:LANG-FR as xs:language

Predeclared constant for the French xs:language.

ft:LANG-HU as xs:language

Predeclared constant for the Hungarian xs:language.

ft:LANG-IT as xs:language

Predeclared constant for the Italian xs:language.

ft:LANG-NL as xs:language

Predeclared constant for the Dutch xs:language.

ft:LANG-NO as xs:language

Predeclared constant for the Norwegian xs:language.

ft:LANG-PT as xs:language

Predeclared constant for the Portuguese xs:language.

ft:LANG-RO as xs:language

Predeclared constant for the Romanian xs:language.

ft:LANG-RU as xs:language

Predeclared constant for the Russian xs:language.

ft:LANG-SV as xs:language

Predeclared constant for the Swedish xs:language.

ft:LANG-TR as xs:language

Predeclared constant for the Turkish xs:language.

Function Summary

current-compare-options() as object() external

Gets the current compare options.

current-lang() as xs:language external

Gets the current language : either the language specified by the declare ft-option using language statement (if any) or the one returned by ft:host-lang() (if none).

host-lang() as xs:language external

Gets the host's current language .

is-stem-lang-supported($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for stemming.

is-stop-word($word as xs:string, $lang as xs:language) as xs:boolean external

Checks whether the given word is a stop-word.

is-stop-word($word as xs:string) as xs:boolean external

Checks whether the given word is a stop-word.

is-stop-word-lang-supported($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for stop words.

is-thesaurus-lang-supported($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for look-up using the default thesaurus.

is-thesaurus-lang-supported($uri as xs:string, $lang as xs:language) as xs:boolean external

Checks whether the given language is supported for look-up using the thesaurus specified by the given URI.

is-tokenizer-lang-supported($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for tokenization.

stem($word as xs:string, $lang as xs:language) as xs:string external

Stems the given word.

stem($word as xs:string) as xs:string external

Stems the given word.

strip-diacritics($string as xs:string) as xs:string external

Strips all diacritical marks from all characters.

thesaurus-lookup($phrase as xs:string) as xs:string* external

Looks-up the given phrase in the default thesaurus.

thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language) as xs:string* external

Looks-up the given phrase in the thesaurus specified by the given URI.

thesaurus-lookup($uri as xs:string, $phrase as xs:string) as xs:string* external

Looks-up the given phrase in a thesaurus.

thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string) as xs:string* external

Looks-up the given phrase in a thesaurus.

thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string, $level-least as xs:integer, $level-most as xs:integer) as xs:string* external

Looks-up the given phrase in a thesaurus.

tokenize-node($node as node(), $lang as xs:language) as object()* external

Tokenizes the given node and all of its decendants.

tokenize-node($node as node()) as object()* external

Tokenizes the given node and all of its descendants.

tokenize-nodes($includes as node()+, $excludes as node()*) as object()* external

Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

tokenize-nodes($includes as node()+, $excludes as node()*, $lang as xs:language) as object()* external

Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

tokenize-string($string as xs:string, $lang as xs:language) as xs:string* external

Tokenizes the given string.

tokenize-string($string as xs:string) as xs:string* external

Tokenizes the given string.

tokenizer-properties($lang as xs:language) as object() external

Gets properties of the tokenizer for the given language .

tokenizer-properties() as object() external

Gets properties of the tokenizer for the language returned by ft:current-lang() .

Functions

current-compare-options#0

declare function ft:current-compare-options() as object() external

Gets the current compare options.

Returns

  • object()

    said compare options.

Examples

current-lang#0

declare function ft:current-lang() as xs:language external
Gets the current language: either the language specified by the declare ft-option using language statement (if any) or the one returned by ft:host-lang() (if none).

Returns

  • xs:language

    said language.

Examples

host-lang#0

declare function ft:host-lang() as xs:language external
Gets the host's current language. The "host" is the computer on which the software is running. The host's current language is obtained as follows:
  • For *nix systems:
    1. If setlocale(3) returns non-null, the language corresponding to that locale is used.
    2. Else, if the LANG environment variable is set, that language is ued.
    3. Otherwise, there is no default language.
  • For Windows systems, the language corresponding to the locale returned by the GetLocaleInfo() function is used.

Returns

  • xs:language

    said language.

is-stem-lang-supported#1

declare function ft:is-stem-lang-supported(
    $lang as xs:language
) as xs:boolean external
Checks whether the given language is supported for stemming.

Parameters

  • $lang

    The language to check.

Returns

  • xs:boolean

    true only if the language is supported.

Examples

is-stop-word#2

declare function ft:is-stop-word(
    $word as xs:string,
    $lang as xs:language
) as xs:boolean external

Checks whether the given word is a stop-word.

Parameters

  • $word

    The word to check.

  • $lang

    The language of $word.

Returns

  • xs:boolean

    true only if $word is a stop-word.

Errors

  • err:FTST0009

    if $lang is not supported.

Examples

is-stop-word#1

declare function ft:is-stop-word(
    $word as xs:string
) as xs:boolean external

Checks whether the given word is a stop-word.

Parameters

  • $word

    The word to check. The word's language is assumed to be the one returned by ft:current-lang().

Returns

  • xs:boolean

    true only if $word is a stop-word.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

Examples

is-stop-word-lang-supported#1

declare function ft:is-stop-word-lang-supported(
    $lang as xs:language
) as xs:boolean external
Checks whether the given language is supported for stop words.

Parameters

  • $lang

    The language to check.

Returns

  • xs:boolean

    true only if the language is supported.

Examples

is-thesaurus-lang-supported#1

declare function ft:is-thesaurus-lang-supported(
    $lang as xs:language
) as xs:boolean external
Checks whether the given language is supported for look-up using the default thesaurus.

Parameters

  • $lang

    The language to check.

Returns

  • xs:boolean

    true only if the language is supported.

is-thesaurus-lang-supported#2

declare function ft:is-thesaurus-lang-supported(
    $uri as xs:string,
    $lang as xs:language
) as xs:boolean external
Checks whether the given language is supported for look-up using the thesaurus specified by the given URI.

Parameters

  • $uri

    The URI specifying the thesaurus to use.

  • $lang

    The language to check.

Returns

  • xs:boolean

    true only if the language is supported.

Errors

  • err:FTST0018

    if $uri refers to a thesaurus that is not found in the statically known thesauri.

Examples

is-tokenizer-lang-supported#1

declare function ft:is-tokenizer-lang-supported(
    $lang as xs:language
) as xs:boolean external
Checks whether the given language is supported for tokenization.

Parameters

  • $lang

    The language to check.

Returns

  • xs:boolean

    true only if the language is supported.

stem#2

declare function ft:stem(
    $word as xs:string,
    $lang as xs:language
) as xs:string external

Stems the given word.

Parameters

  • $word

    The word to stem.

  • $lang

    The language of $word.

Returns

  • xs:string

    the stem of $word.

Errors

  • err:FTST0009

    if $lang is not supported.

Examples

stem#1

declare function ft:stem(
    $word as xs:string
) as xs:string external

Stems the given word.

Parameters

  • $word

    The word to stem. The word's language is assumed to be the one returned by ft:current-lang().

Returns

  • xs:string

    the stem of $word.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

Examples

strip-diacritics#1

declare function ft:strip-diacritics(
    $string as xs:string
) as xs:string external

Strips all diacritical marks from all characters.

Parameters

  • $string

    The string to strip diacritical marks from.

Returns

  • xs:string

    $string with diacritical marks stripped.

Examples

thesaurus-lookup#1

declare function ft:thesaurus-lookup(
    $phrase as xs:string
) as xs:string* external

Looks-up the given phrase in the default thesaurus.

Parameters

  • $phrase

    The phrase to look up. The phrase's language is assumed to be the one returned by ft:current-lang().

Returns

  • xs:string*

    the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

  • zerr:ZXQP8401

    if the thesaurus data file's version is not supported by the currently running version of the software.

  • zerr:ZXQP8402

    if the thesaurus data file's endianness does not match that of the CPU on which the software is currently running.

  • zerr:ZXQP8403

    if there was an error reading the thesaurus data.

Examples

thesaurus-lookup#3

declare function ft:thesaurus-lookup(
    $uri as xs:string,
    $phrase as xs:string,
    $lang as xs:language
) as xs:string* external

Looks-up the given phrase in the thesaurus specified by the given URI.

Parameters

  • $uri

    The URI specifying the thesaurus to use.

  • $phrase

    The phrase to look up.

  • $lang

    The language of $phrase.

Returns

  • xs:string*

    the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

Errors

  • err:FTST0009

    if $lang is not supported.

  • err:FTST0018

    if $uri refers to a thesaurus that is not found in the statically known thesauri.

  • zerr:ZOSE0001

    if the thesaurus data file could not be found.

  • zerr:ZOSE0002

    if the thesaurus data file is not a plain file.

  • zerr:ZXQP8401

    if the thesaurus data file's version is not supported by the currently running version of the software.

  • zerr:ZXQP8402

    if the thesaurus data file's endianness does not match that of the CPU on which the software is currently running.

  • zerr:ZXQP8403

    if there was an error reading the thesaurus data file.

Examples

thesaurus-lookup#2

declare function ft:thesaurus-lookup(
    $uri as xs:string,
    $phrase as xs:string
) as xs:string* external

Looks-up the given phrase in a thesaurus.

Parameters

  • $uri

    The URI specifying the thesaurus to use.

  • $phrase

    The phrase to look up. The phrase's language is assumed to be the one the one returned by ft:current-lang().

Returns

  • xs:string*

    the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

Errors

  • err:FTST0009

    if ft:current-lang() is unsupported.

  • err:FTST0018

    if $uri refers to a thesaurus that is not found in the statically known thesauri.

  • zerr:ZOSE0001

    if the thesaurus data file could not be found.

  • zerr:ZOSE0002

    if the thesaurus data file is not a plain file.

  • zerr:ZXQP8401

    if the thesaurus data file's version is not supported by the currently running version of the software.

  • zerr:ZXQP8402

    if the thesaurus data file's endianness does not match that of the CPU on which the software is currently running.

  • zerr:ZXQP8403

    if there was an error reading the thesaurus data file.

Examples

thesaurus-lookup#4

declare function ft:thesaurus-lookup(
    $uri as xs:string,
    $phrase as xs:string,
    $lang as xs:language,
    $relationship as xs:string
) as xs:string* external

Looks-up the given phrase in a thesaurus.

Parameters

  • $uri

    The URI specifying the thesaurus to use.

  • $phrase

    The phrase to look up.

  • $lang

    The language of $phrase.

  • $relationship

    The relationship the results are to have to $phrase.

Returns

  • xs:string*

    the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

Errors

  • err:FTST0018

    if $uri refers to a thesaurus that is not found in the statically known thesauri.

  • err:FTST0009

    if $lang is not supported.

  • zerr:ZOSE0001

    if the thesaurus data file could not be found.

  • zerr:ZOSE0002

    if the thesaurus data file is not a plain file.

  • zerr:ZXQP8401

    if the thesaurus data file's version is not supported by the currently running version of the software.

  • zerr:ZXQP8402

    if the thesaurus data file's endianness does not match that of the CPU on which the software is currently running.

  • zerr:ZXQP8403

    if there was an error reading the thesaurus data file.

Examples

thesaurus-lookup#6

declare function ft:thesaurus-lookup(
    $uri as xs:string,
    $phrase as xs:string,
    $lang as xs:language,
    $relationship as xs:string,
    $level-least as xs:integer,
    $level-most as xs:integer
) as xs:string* external

Looks-up the given phrase in a thesaurus.

Parameters

  • $uri

    The URI specifying the thesaurus to use.

  • $phrase

    The phrase to look up.

  • $lang

    The language of $phrase.

  • $relationship

    The relationship the results are to have to $phrase.

  • $level-least

    The minimum number of levels within the thesaurus to be traversed.

  • $level-most

    The maximum number of levels within the thesaurus to be traversed.

Returns

  • xs:string*

    the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

Errors

  • err:FOCA0003

    if either $level-least or $level-most is either negative or too large.

  • err:FTST0018

    if $uri refers to a thesaurus that is not found in the statically known thesauri.

  • err:FTST0009

    if $lang is not supported.

  • zerr:ZOSE0001

    if the thesaurus data file could not be found.

  • zerr:ZOSE0002

    if the thesaurus data file is not a plain file.

  • zerr:ZXQP8401

    if the thesaurus data file's version is not supported by the currently running version of the software.

  • zerr:ZXQP8402

    if the thesaurus data file's endianness does not match that of the CPU on which the software is currently running.

  • zerr:ZXQP8403

    if there was an error reading the thesaurus data file.

Examples

tokenize-node#2

declare function ft:tokenize-node(
    $node as node(),
    $lang as xs:language
) as object()* external

Tokenizes the given node and all of its decendants.

Parameters

  • $node

    The node to tokenize.

  • $lang

    The default language of $node.

Returns

  • object()*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if $lang is not supported.

Examples

tokenize-node#1

declare function ft:tokenize-node(
    $node as node()
) as object()* external

Tokenizes the given node and all of its descendants.

Parameters

  • $node

    The node to tokenize. The node's default language is assumed to be the one returned by ft:current-lang().

Returns

  • object()*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

Examples

tokenize-nodes#2

declare function ft:tokenize-nodes(
    $includes as node()+,
    $excludes as node()*
) as object()* external
Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

Parameters

  • $includes

    The set of nodes (and its descendants) to include. The default language is assumed to be the one returned by ft:current-lang().

  • $excludes

    The set of nodes (and its descendants) to exclude.

Returns

  • object()*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

Examples

tokenize-nodes#3

declare function ft:tokenize-nodes(
    $includes as node()+,
    $excludes as node()*,
    $lang as xs:language
) as object()* external
Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

Parameters

  • $includes

    The set of nodes (and its descendants) to include.

  • $excludes

    The set of nodes (and its descendants) to exclude.

  • $lang

    The default language for nodes.

Returns

  • object()*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if $lang is not supported.

Examples

tokenize-string#2

declare function ft:tokenize-string(
    $string as xs:string,
    $lang as xs:language
) as xs:string* external

Tokenizes the given string.

Parameters

  • $string

    The string to tokenize.

  • $lang

    The language of $string.

Returns

  • xs:string*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if $lang is not supported.

Examples

tokenize-string#1

declare function ft:tokenize-string(
    $string as xs:string
) as xs:string* external

Tokenizes the given string.

Parameters

  • $string

    The string to tokenize. The string's language is assumed to be the one returned by ft:current-lang().

Returns

  • xs:string*

    a (possibly empty) sequence of tokens.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.

Examples

tokenizer-properties#1

declare function ft:tokenizer-properties(
    $lang as xs:language
) as object() external
Gets properties of the tokenizer for the given language.

Parameters

  • $lang

    The language of the tokenizer to get the properties of.

Returns

  • object()

    said properties.

Errors

  • err:FTST0009

    if $lang is not supported. tokenization specifically.

tokenizer-properties#0

declare function ft:tokenizer-properties() as object() external
Gets properties of the tokenizer for the language returned by ft:current-lang().

Returns

  • object()

    said properties.

Errors

  • err:FTST0009

    if ft:current-lang() is not supported.