7.3.70. tokenize#
7.3.70.1. Summary#
tokenize command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.
7.3.70.2. Syntax#
This command takes many parameters.
tokenizer and string are required parameters. Others are
optional:
tokenize tokenizer
string
[normalizer=null]
[flags=NONE]
[mode=ADD]
[token_filters=NONE]
[output_style=full]
7.3.70.3. Usage#
Here is a simple example.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
It has only required parameters. tokenizer is TokenBigram and
string is "Fulltext Search". It returns tokens that is
generated by tokenizing "Fulltext Search" with TokenBigram
tokenizer. It doesn’t normalize "Fulltext Search".
7.3.70.4. Parameters#
This section describes all parameters. Parameters are categorized.
7.3.70.4.1. Required parameters#
There are required parameters, tokenizer and string.
7.3.70.4.1.1. tokenizer#
Specifies the tokenizer name. tokenize command uses the
tokenizer that is named tokenizer.
See Tokenizers about built-in tokenizers.
Here is an example to use built-in TokenTrigram tokenizer.
Execution example:
tokenize TokenTrigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Ful",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ull",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "llt",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lte",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "tex",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ext",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t S",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " Se",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Sea",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ear",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "arc",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rch",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
If you want to use other tokenizers, you need to register additional
tokenizer plugin by register command. For example, you can use
KyTea based tokenizer by
registering tokenizers/kytea.
7.3.70.4.1.2. string#
Specifies any string which you want to tokenize.
If you want to include spaces in string, you need to quote
string by single quotation (') or double quotation (").
Here is an example to use spaces in string.
Execution example:
tokenize TokenBigram "Groonga is a fast fulltext earch engine!"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Gr",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ro",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "oo",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "on",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ng",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ga",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "a ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " i",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "is",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "s ",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " a",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "a ",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " f",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "fa",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "as",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "st",
# "position": 15,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 16,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " f",
# "position": 17,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "fu",
# "position": 18,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 19,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 20,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 21,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 22,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 23,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 24,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 25,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " e",
# "position": 26,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 27,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 28,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 29,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 30,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h ",
# "position": 31,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " e",
# "position": 32,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "en",
# "position": 33,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ng",
# "position": 34,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "gi",
# "position": 35,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "in",
# "position": 36,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ne",
# "position": 37,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "e!",
# "position": 38,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!",
# "position": 39,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.3.70.4.2. Optional parameters#
There are optional parameters.
7.3.70.4.2.1. normalizer#
Specifies the normalizer name. tokenize command uses the
normalizer that is named normalizer. Normalizer is important for
N-gram family tokenizers such as TokenBigram.
Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing.
Here is an example that doesn’t use normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
All alphabets are tokenized by two characters. For example, Fu is
a token.
Here is an example that uses normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "fulltext",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "search",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Continuous alphabets are tokenized as one token. For example,
fulltext is a token.
If you want to tokenize by two characters with normalizer, use
TokenBigramSplitSymbolAlpha.
Execution example:
tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "se",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, fu is a token.
7.3.70.4.2.2. flags#
Specifies a tokenization customize options. You can specify
multiple options separated by “|”. For example,
NONE|ENABLE_TOKENIZED_DELIMITER.
Here are available flags.
Flag |
Description |
|---|---|
|
Just ignored. |
|
Enables tokenized delimiter. See Tokenizers about tokenized delimiter details. |
Here is an example that uses ENABLE_TOKENIZED_DELIMITER.
Execution example:
tokenize TokenDelimit "Fulltext Seacrch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "full",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "text sea",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "crch",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenDelimit tokenizer is one of tokenized delimiter supported
tokenizer. ENABLE_TOKENIZED_DELIMITER enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is U+FFFE. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this purpose. If
ENABLE_TOKENIZED_DELIMITER is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.
7.3.70.4.2.3. mode#
Specifies a tokenize mode. If the mode is specified ADD, the text
is tokenized by the rule that adding a document. If the mode is specified
GET, the text is tokenized by the rule that searching a document. If
the mode is omitted, the text is tokenized by the ADD mode.
The default mode is ADD.
Here is an example to the ADD mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode ADD
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
The last alphabet is tokenized by one character.
Here is an example to the GET mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode GET
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
The last alphabet is tokenized by two characters.
7.3.70.4.2.4. token_filters#
Specifies the token filter names. tokenize command uses the
tokenizer that is named token_filters.
See Token filters about token filters.
7.3.70.4.2.5. output_style#
Added in version 15.0.9.
Specifies the output style of the tokenize command.
Style |
Description |
|---|---|
|
This is the default output style. Returns an array of objects with full token attributes (value, position, flags, etc.) as usual. |
|
Shows only the token values for readability. Returns an array of
tokens like |
Here is an example with full (default) output sytle.
Execution example:
tokenize TokenNgram "Fulltext Search" --output_style full
# [
# [
# 0,
# 1746572920.136098,
# 0.0004820823669433594
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
And here is the same command with simple output style.
Execution example:
tokenize TokenNgram "Fulltext Search" --output_style simple
# [
# [
# 0,
# 1746573056.540744,
# 0.0007045269012451172
# ],
# [
# "Fu",
# "ul",
# "ll",
# "lt",
# "te",
# "ex",
# "xt",
# "t ",
# " S",
# "Se",
# "ea",
# "ar",
# "rc",
# "ch",
# "h"
# ]
# ]
7.3.70.5. Return value#
tokenize command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature:
[HEADER, tokens]
HEADER
See Output format about
HEADER.
tokens
tokensis an array of token. Token is an object that has the following attributes.
Name
Description
valueToken itself.
positionThe N-th token.