Tokenization¶
Tokenization related modeling class
-
class
pororo.tasks.tokenization.
PororoTokenizationFactory
(task: str, lang: str, model: Optional[str])[source]¶ Bases:
pororo.tasks.utils.base.PororoFactoryBase
Use the dictionary you want to use to tokenize about the sentence.
- Parameters
sent – (str) sentence to be tokenized
- Returns
tokenized token list
- Return type
List[str]
Examples
>>> tk = Pororo(task="tokenization", lang="ko", model="bpe32k.ko", ) >>> tk("하늘을 나는 새를 보았다") ["_하늘", "을", "_나는", "_새", "를", "_보", "았다"] >>> tk = Pororo(task="tokenization", lang="en", model="roberta") >>> tk("I love you") ['I', 'Ġlove', 'Ġyou'] >>> tk('''If the values aren’t unique, there is no unique inversion of the dictionary anyway or, with other words, inverting does not make sense.''') ['If', 'Ġthe', 'Ġvalues', 'Ġaren', 'âĢ', 'Ļ', 't', 'Ġunique', ',', 'Ġthere', 'Ġis', 'Ġno', 'Ġunique', 'Ġin', 'version', 'Ġof', 'Ġthe', 'Ġdictionary', 'Ġanyway', 'Ġor', ',', 'Ġwith', 'Ġother', 'Ġwords', ',', 'Ġinver', 'ting', 'Ġdoes', 'Ġnot', 'Ġmake', 'Ġsense', '.']
-
class
pororo.tasks.tokenization.
PororoTokenizerBase
(config: pororo.tasks.utils.base.TaskConfig)[source]¶ Bases:
pororo.tasks.utils.base.PororoSimpleBase
-
class
pororo.tasks.tokenization.
PororoWordTokenizer
(config)[source]¶