Tokenization

Tokenization related modeling class

class pororo.tasks.tokenization.PororoTokenizationFactory(task: str, lang: str, model: Optional[str])[source]

Bases: pororo.tasks.utils.base.PororoFactoryBase

Use the dictionary you want to use to tokenize about the sentence.

Parameters

sent – (str) sentence to be tokenized

Returns

tokenized token list

Return type

List[str]

Examples

>>> tk = Pororo(task="tokenization", lang="ko", model="bpe32k.ko", )
>>> tk("하늘을 나는 새를 보았다")
["_하늘", "을", "_나는", "_새", "를", "_보", "았다"]
>>> tk = Pororo(task="tokenization", lang="en", model="roberta")
>>> tk("I love you")
['I', 'Ġlove', 'Ġyou']
>>> tk('''If the values aren’t unique, there is no unique inversion of the dictionary anyway or, with other words, inverting does not make sense.''')
['If', 'Ġthe', 'Ġvalues', 'Ġaren', 'âĢ', 'Ļ', 't', 'Ġunique', ',', 'Ġthere', 'Ġis', 'Ġno', 'Ġunique', 'Ġin', 'version', 'Ġof', 'Ġthe', 'Ġdictionary', 'Ġanyway', 'Ġor', ',', 'Ġwith', 'Ġother', 'Ġwords', ',', 'Ġinver', 'ting', 'Ġdoes', 'Ġnot', 'Ġmake', 'Ġsense', '.']
static get_available_langs()[source]
static get_available_models()[source]
load(device: str)[source]

Load user-selected task-specific model

Parameters

device (str) – device information

Returns

User-selected task-specific model

Return type

object

class pororo.tasks.tokenization.PororoTokenizerBase(config: pororo.tasks.utils.base.TaskConfig)[source]

Bases: pororo.tasks.utils.base.PororoSimpleBase

abstract detokenize(tokens: List[str])[source]
abstract convert_tokens_to_ids(tokens: List[str])[source]
class pororo.tasks.tokenization.PororoSentTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

cj_tokenize(text: str)[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoMecabKoTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoMosesTokenizer(model, detok, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoJiebaTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoMecabTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoWordTokenizer(config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])str[source]

Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.

predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoCharTokenizer(config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoJamoTokenizer(config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoJamoPairTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs) → List[str][source]
class pororo.tasks.tokenization.PororoSPTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs)[source]
class pororo.tasks.tokenization.PororoMecabSPTokenizer(model, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

detokenize(tokens: List[str])[source]
predict(text: str, **kwargs)[source]
class pororo.tasks.tokenization.PororoRoBERTaTokenizer(model, vocab, inv_dict, config)[source]

Bases: pororo.tasks.tokenization.PororoTokenizerBase

convert_tokens_to_ids(tokens: List[str])[source]
predict(text: str, **kwargs)[source]