7.8.10. TokenDelimit#
7.8.10.1. Summary#
TokenDelimit extracts token by splitting one or more space
characters (U+0020). For example, Hello World is tokenized to
Hello and World.
TokenDelimit is suitable for tag text. You can extract groonga
and full-text-search and http as tags from groonga
full-text-search http.
7.8.10.2. Syntax#
TokenDelimit has optional parameter.
No options(Extracts token by splitting one or more space characters (U+0020)):
TokenDelimit
Specify delimiter:
TokenDelimit("delimiter", "delimiter1", "delimiter", "delimiter2", ...)
Specify delimiter with regular expression:
TokenDelimit("pattern", pattern)
The delimiter option and a pattern option are not use at the same time.
7.8.10.3. Usage#
7.8.10.4. Simple usage#
Here is an example of TokenDelimit:
Execution example:
tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "groonga",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "full-text-search",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "http",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenDelimit can also specify options.
TokenDelimit has delimiter option and pattern option.
delimiter option can split token with a specified character.
For example, Hello,World is tokenized to Hello and World
with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
pattern option can split token with a regular expression.
You can except needless space by pattern option.
For example, This is a pen. This is an apple is tokenized to This is a pen and
This is an apple with pattern option as below.
Normally, when This is a pen. This is an apple. is splitted by .,
needless spaces are included at the beginning of “This is an apple.”.
You can except the needless spaces by a pattern option as below example.
Execution example:
tokenize 'TokenDelimit("pattern", "\\\\.\\\\s*")' "This is a pen. This is an apple."
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "This is a pen",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "This is an apple",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.10.5. Advanced usage#
delimiter option can also specify multiple delimiters.
For example, Hello, World is tokenized to Hello and World.
"," and " " are delimiters in below example.
Execution example:
tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
You can extract token in complex conditions by pattern option.
For example, これはペンですか!?リンゴですか?「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "これはペンですか",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "リンゴですか",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "「リンゴです。」",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
\\s* of the end of above regular expression match 0 or more spaces after a delimiter.
[。!?]+ matches 1 or more 。 or !, ?.
For example, [。!?]+ matches !? of これはペンですか!?.
(?![)」]) is negative lookahead.
(?![)」]) matches if a character is not matched ) or 」.
negative lookahead interprets in combination regular expression of just before.
Therefore it interprets [。!?]+(?![)」]).
[。!?]+(?![)」]) matches if there are not ) or 」 after 。 or !, ?.
In other words, [。!?]+(?![)」]) matches 。 of これはペンですか。. But [。!?]+(?![)」]) doesn’t match 。 of 「リンゴです。」.
Because there is 」 after 。.
[\\r\\n]+ match 1 or more newline character.
In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s* uses 。 and ! and ?, newline character as delimiter. However, 。 and !, ? are not delimiters if there is ) or 」 after 。 or !, ?.
7.8.10.6. Parameters#
7.8.10.6.1. Optional parameter#
There are two optional parameters delimiter and pattern.
7.8.10.6.1.1. delimiter#
Split token with a specified one or more characters.
You can use one or more characters for a delimiter.
7.8.10.6.1.2. pattern#
Split token with a regular expression.