caikit.interfaces.nlp.data_model.text
Data structures for text representations
Attributes
Classes
Tokens here are the basic units of text. Tokens can be characters, words, |
|
Tokenization result generated from a text. |
|
Streaming tokenization result that indicates up to where in stream is processed. |
|
Streaming tokenization result that provides pointer to the input chunk processed |
Module Contents
- class caikit.interfaces.nlp.data_model.text.Token[source]
Bases:
caikit.core.DataObjectBaseTokens here are the basic units of text. Tokens can be characters, words, sub-words, or other segments of text or code, depending on the method of tokenization chosen or the task being implemented.
- start: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(1)]
- end: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(2)]
- text: py_to_proto.dataclass_to_proto.Annotated[str, FieldNumber(3)]
- class caikit.interfaces.nlp.data_model.text.TokenizationResults[source]
Bases:
caikit.core.DataObjectBaseTokenization result generated from a text.
- token_count: py_to_proto.dataclass_to_proto.Annotated[int | None, FieldNumber(4)]
- class caikit.interfaces.nlp.data_model.text.TokenizationStreamResult[source]
Bases:
TokenizationResultsStreaming tokenization result that indicates up to where in stream is processed.
- processed_index: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(2)]
- start_index: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(3)]
- class caikit.interfaces.nlp.data_model.text.ChunkerTokenizationStreamResult[source]
Bases:
TokenizationStreamResultStreaming tokenization result that provides pointer to the input chunk processed
- input_start_index: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(20)]
- input_end_index: py_to_proto.dataclass_to_proto.Annotated[int, FieldNumber(21)]