zorba::Tokenizer
#include <zorba/tokenizer.h>
A Tokenizer breaks a string into a stream of word tokens.
Each token is assigned a token, sentence, and paragraph number.A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number. Public Functionsvoid | destroy() const =0
Destroys this Tokenizer. | void | properties(Properties *result) const =0
Gets the Properties of this Tokenizer. | State & | state()
Gets this Tokenizer's associated State. | State const & | state() const
Gets this Tokenizer's associated State. | void | tokenize_node(Item const &node, locale::iso639_1::type lang, Callback &callback)
Tokenizes the given node. | void | tokenize_string(char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=nullptr)=0
Tokenizes the given string. |
Protected Functionsbool | find_lang_attribute(Item const &element, locale::iso639_1::type *lang)
Given an element, finds its xml:lang attribute, if any, and gets its value. | void | item(Item const &item, bool entering)
This member-function is called whenever an item that is being tokenized is entered or exited. | void | tokenize_node_impl(Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp)
Tokenizes the given node and all of its child nodes, if any. | | Tokenizer(State &state)
Constructs a Tokenizer. | | ~Tokenizer()=0
Destroys a Tokenizer. |
Public Typessize_typeunsigned size_type Public Functionsdestroyvoid destroy() const =0
Destroys this Tokenizer.
This function is called by Zorba when the Tokenizer is no longer needed.If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can simply be (and usually is) delete this.If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing. propertiesvoid properties(Properties *result) const =0 state
Gets this Tokenizer's associated State.
ReturnsReturns said State.
stateState const & state() const
Gets this Tokenizer's associated State.
ReturnsReturns said State.
tokenize_nodevoid tokenize_node(Item const &node, locale::iso639_1::type lang, Callback &callback)
Tokenizes the given node.
Parametersnode |
The node to tokenize. | lang |
The default language to use. | callback |
The Callback to call once per token. |
tokenize_stringvoid tokenize_string(char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=nullptr)=0
Tokenizes the given string.
Parametersutf8_s |
The UTF-8 string to tokenize. It need not be null-terminated. | utf8_len |
The number of bytes in the string to be tokenized. | lang |
The language of the string. | wildcards |
If true, allows XQuery wildcard syntax characters to be part of tokens. | callback |
The Callback to call once per token. | item |
The Item this string is from, if any. |
Protected Functionsfind_lang_attributebool find_lang_attribute(Item const &element, locale::iso639_1::type *lang)
Given an element, finds its xml:lang attribute, if any, and gets its value.
Parameterselement |
The element to check. | lang |
A pointer to where to put the found language, if any. |
ReturnsReturns true only if an xml:lang attribute is found and the value is a known language.
itemvoid item(Item const &item, bool entering)
This member-function is called whenever an item that is being tokenized is entered or exited.
Parametersitem |
The item being entered or exited. | entering |
If true, the item is being entered; if false, the item is being exited. |
tokenize_node_implvoid tokenize_node_impl(Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp)
Tokenizes the given node and all of its child nodes, if any.
For each node, it is required that this function call the item() member function of both this Tokenizer and of the Callback twice, once each for entrance and exit. Parametersnode |
The node to tokenize. | lang |
The default language to use. | callback |
The Callback to call per token. | tokenize_acp |
If true, additionally tokenize all attribute, comment, and processing-instruction nodes encountered; if false, skip them. |
Tokenizer Tokenizer(State &state) ~Tokenizer ~Tokenizer()=0 |