Tokenization¶

Tokenization related modeling class

class pororo.tasks.tokenization.PororoTokenizationFactory(task: str, lang: str, model: Optional[str])[source]¶

Bases: pororo.tasks.utils.base.PororoFactoryBase

Use the dictionary you want to use to tokenize about the sentence.

Parameters: sent – (str) sentence to be tokenized
Returns: tokenized token list
Return type: List[str]

Examples

>>> tk = Pororo(task="tokenization", lang="ko", model="bpe32k.ko", )
>>> tk("하늘을 나는 새를 보았다")
["_하늘", "을", "_나는", "_새", "를", "_보", "았다"]
>>> tk = Pororo(task="tokenization", lang="en", model="roberta")
>>> tk("I love you")
['I', 'Ġlove', 'Ġyou']
>>> tk('''If the values aren’t unique, there is no unique inversion of the dictionary anyway or, with other words, inverting does not make sense.''')
['If', 'Ġthe', 'Ġvalues', 'Ġaren', 'âĢ', 'Ļ', 't', 'Ġunique', ',', 'Ġthere', 'Ġis', 'Ġno', 'Ġunique', 'Ġin', 'version', 'Ġof', 'Ġthe', 'Ġdictionary', 'Ġanyway', 'Ġor', ',', 'Ġwith', 'Ġother', 'Ġwords', ',', 'Ġinver', 'ting', 'Ġdoes', 'Ġnot', 'Ġmake', 'Ġsense', '.']

static get_available_langs()[source]¶

static get_available_models()[source]¶

load(device: str)[source]¶

Load user-selected task-specific model

Parameters: device (str) – device information
Returns: User-selected task-specific model
Return type: object

class pororo.tasks.tokenization.PororoTokenizerBase(config: pororo.tasks.utils.base.TaskConfig)[source]¶

Bases: pororo.tasks.utils.base.PororoSimpleBase

abstract detokenize(tokens: List[str])[source]¶

abstract convert_tokens_to_ids(tokens: List[str])[source]¶

class pororo.tasks.tokenization.PororoSentTokenizer(model, config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

cj_tokenize(text: str)[source]¶

predict(text: str, **kwargs) → List[str][source]¶

class pororo.tasks.tokenization.PororoMecabKoTokenizer(model, config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]¶

predict(text: str, **kwargs) → List[str][source]¶

class pororo.tasks.tokenization.PororoMosesTokenizer(model, detok, config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]¶

predict(text: str, **kwargs) → List[str][source]¶

class pororo.tasks.tokenization.PororoJiebaTokenizer(model, config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]¶

predict(text: str, **kwargs) → List[str][source]¶

class pororo.tasks.tokenization.PororoMecabTokenizer(model, config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]¶

predict(text: str, **kwargs) → List[str][source]¶

class pororo.tasks.tokenization.PororoWordTokenizer(config)[source]¶

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str]) → str [source]¶: Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.