zorba::Tokenizer

#include <zorba/tokenizer.h>

A Tokenizer breaks a string into a stream of word tokens. Each token is assigned a token, sentence, and paragraph number.A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number.

Private Attributes

State *

state_

Public Functions

void

destroy() const =0

Destroys this Tokenizer.

void

properties(Properties *result) const =0

Gets the Properties of this Tokenizer.

State &

state()

Gets this Tokenizer's associated State.

State const &

state() const

Gets this Tokenizer's associated State.

void

tokenize_node(Item const &node, locale::iso639_1::type lang, Callback &callback)

Tokenizes the given node.

void

tokenize_string(char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=nullptr)=0

Tokenizes the given string.

Protected Functions

bool

find_lang_attribute(Item const &element, locale::iso639_1::type *lang)

Given an element, finds its xml:lang attribute, if any, and gets its value.

void

item(Item const &item, bool entering)

This member-function is called whenever an item that is being tokenized is entered or exited.

void

tokenize_node_impl(Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp)

Tokenizes the given node and all of its child nodes, if any.

Tokenizer(State &state)

Constructs a Tokenizer.

~Tokenizer()=0

Destroys a Tokenizer.

Public Types

size_type

unsigned size_type

Private Attributes

state_

State * state_

Public Functions

destroy

void destroy() const =0

Destroys this Tokenizer.

This function is called by Zorba when the Tokenizer is no longer needed.If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can simply be (and usually is) delete this.If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing.

properties

void properties(Properties *result) const =0

Gets the Properties of this Tokenizer.

Parameters

result The Properties to populate.

state

State & state()

Gets this Tokenizer's associated State.

Returns

Returns said State.

state

State const & state() const

Gets this Tokenizer's associated State.

Returns

Returns said State.

tokenize_node

void tokenize_node(Item const &node, locale::iso639_1::type lang, Callback &callback)

Tokenizes the given node.

Parameters

node The node to tokenize.
lang The default language to use.
callback The Callback to call once per token.

tokenize_string

void tokenize_string(char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=nullptr)=0

Tokenizes the given string.

Parameters

utf8_s The UTF-8 string to tokenize. It need not be null-terminated.
utf8_len The number of bytes in the string to be tokenized.
lang The language of the string.
wildcards If true, allows XQuery wildcard syntax characters to be part of tokens.
callback The Callback to call once per token.
item The Item this string is from, if any.

Protected Functions

find_lang_attribute

bool find_lang_attribute(Item const &element, locale::iso639_1::type *lang)

Given an element, finds its xml:lang attribute, if any, and gets its value.

Parameters

element The element to check.
lang A pointer to where to put the found language, if any.

Returns

Returns true only if an xml:lang attribute is found and the value is a known language.

item

void item(Item const &item, bool entering)

This member-function is called whenever an item that is being tokenized is entered or exited.

Parameters

item The item being entered or exited.
entering If true, the item is being entered; if false, the item is being exited.

tokenize_node_impl

void tokenize_node_impl(Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp)

Tokenizes the given node and all of its child nodes, if any.

For each node, it is required that this function call the item() member function of both this Tokenizer and of the Callback twice, once each for entrance and exit.

Parameters

node The node to tokenize.
lang The default language to use.
callback The Callback to call per token.
tokenize_acp If true, additionally tokenize all attribute, comment, and processing-instruction nodes encountered; if false, skip them.

Tokenizer

 Tokenizer(State &state)

Constructs a Tokenizer.

Parameters

state the State to use.

~Tokenizer

 ~Tokenizer()=0

Destroys a Tokenizer.