caikit.core.data_model ====================== .. py:module:: caikit.core.data_model .. autoapi-nested-parse:: Common data model containing all data structures that are passed in and out of modules. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/caikit/core/data_model/base/index /autoapi/caikit/core/data_model/data_backends/index /autoapi/caikit/core/data_model/dataobject/index /autoapi/caikit/core/data_model/enums/index /autoapi/caikit/core/data_model/job/index /autoapi/caikit/core/data_model/json_dict/index /autoapi/caikit/core/data_model/package/index /autoapi/caikit/core/data_model/prediction_status/index /autoapi/caikit/core/data_model/producer/index /autoapi/caikit/core/data_model/protobufs/index /autoapi/caikit/core/data_model/runtime_context/index /autoapi/caikit/core/data_model/streams/index /autoapi/caikit/core/data_model/timestamp/index /autoapi/caikit/core/data_model/training_status/index Attributes ---------- .. autoapisummary:: caikit.core.data_model.CAIKIT_DATA_MODEL caikit.core.data_model.PredictionJobStatus caikit.core.data_model.PACKAGE_COMMON caikit.core.data_model.JsonDictValue caikit.core.data_model.log caikit.core.data_model.error caikit.core.data_model.T caikit.core.data_model.TrainingStatus Classes ------- .. autoapisummary:: caikit.core.data_model.DataBase caikit.core.data_model.DataObjectBase caikit.core.data_model.JobStatus caikit.core.data_model.ProducerId caikit.core.data_model.AugmentorBase caikit.core.data_model.DataStream caikit.core.data_model._UtfEncodeIOWrapper Functions --------- .. autoapisummary:: caikit.core.data_model.dataobject caikit.core.data_model.render_dataobject_protos caikit.core.data_model.import_enums caikit.core.data_model.import_enum caikit.core.data_model.is_multipart_file caikit.core.data_model.stream_multipart_file Package Contents ---------------- .. py:class:: DataBase Base class for all structures in the data model. Notes: All leaves in the hierarchy of derived classes should have a corresponding protobufs class defined in the interface definitions. If not, an exception will be thrown at runtime. .. py:attribute:: PROTO_CONVERSION_SPECIAL_TYPES .. py:class:: OneofFieldVal Helper struct that backends can use to return information about values in oneofs along with which of the oneofs is currently valid .. py:attribute:: val :type: Any .. py:attribute:: which_oneof :type: str .. py:method:: __setattr__(name, val) Handle attribute setting for oneofs and named fields with delegation to backends as needed .. py:method:: get_proto_class() -> Type[google.protobuf.message.Message] :classmethod: .. py:method:: get_field_defaults() -> Type[google.protobuf.message.Message] :classmethod: Get mapping of fields to default values. Mapping will not include fields without defaults .. py:method:: get_field_message_type(field_name: str) -> Optional[type] :classmethod: Get the python type for the given field. This function relies on the metaclass to fill cls._fields_to_type. This is to avoid costly computation during runtime Args: field_name (str): Field name to check (AttributeError raised if name is invalid) Returns: field_type: type The data model class type for the given field .. py:method:: from_backend(backend) :classmethod: .. py:property:: backend :type: Optional[DataModelBackendBase] .. py:method:: which_oneof(oneof_name: str) -> Optional[str] Get the name of the oneof field set for the given oneof or None if no field is set .. py:method:: _infer_which_oneof(oneof_name: str, oneof_val: Any) -> Optional[str] :classmethod: Check each candidate field within the oneof to see if it's a type match NOTE: In the case where fields within a oneof have the same type, the first field whose type matches will be used! .. py:method:: _get_which_oneof_dict() -> Dict[str, str] .. py:method:: _get_type_for_field(field_name: str) -> type :classmethod: Helper class method to return the type hint for a particular field .. py:method:: _is_valid_type_for_field(field_name: str, val: Any) -> bool :classmethod: Check whether the given value is valid for the given field .. py:method:: from_binary_buffer(buf) :classmethod: Builds the data model object out of the binary string Args: buf: The binary buffer containing a serialized protobufs message Returns: A data model object instantiated from the protobufs message deserialized out of `buf` .. py:method:: from_proto(proto) :classmethod: Build a DataBase from protobufs. Args: proto: A protocol buffer to serialize from. Returns: protobufs: A DataBase object. .. py:method:: from_json(json_str, ignore_unknown_fields=False) :classmethod: Build a DataBase from a given JSON string. Use google's protobufs.json_format for deserialization Args: json_str (str or dict): A stringified JSON specification/dict of the data_model ignore_unknown_fields (bool): If True, ignores unknown JSON fields Returns: caikit.core.data_model.DataBase: A DataBase object. .. py:method:: from_file(file_obj: io.IOBase) :classmethod: :abstractmethod: Build a DataBase from a given file-like object. Args: file_obj IOBase: A file object that contains some representation of the dataobject Returns: caikit.core.data_model.DataBase: A DataBase object. .. py:method:: to_proto() Return a new protobufs populated with the information in this data structure. .. py:method:: to_binary_buffer() Returns a binary buffer with a serialized protobufs message of this data model .. py:method:: fill_proto(proto) Populate a protobufs with the values from this data model object. Args: proto: A protocol buffer to be populated. Returns: protobufs: The filled protobufs. Notes: The protobufs is filled in place, so the argument and the return value are the same at the end of this call. .. py:method:: to_dict() -> dict Convert to a dictionary representation. .. py:method:: to_kwargs() -> dict Convert to flat dictionary representation. (Like .to_dict, but not recursive) This keeps the attribute names of any fields backed by oneofs, instead of using the internal oneof field name .. py:method:: to_json(**kwargs) -> str Convert to a json representation. .. py:method:: to_file(file_obj: io.IOBase) -> Optional[File] :abstractmethod: Export a DataBaseObject into a file-like object `file_obj`. If the DataBase object has requirements around file name or file type it can return them via the optional "File" return object Args: file_obj IOBase: a file object to be filled Returns: file_descriptor: Optional[caikit.interfaces.common.data_mode.File] .. py:method:: __repr__() Human-friendly representation. .. py:method:: _field_to_dict_element(field) Convert field into a representation that can be placed into a dictionary. Recursively calls to_dict on other data model objects. .. py:method:: get_class_for_proto(proto: Union[google.protobuf.descriptor.Descriptor, google.protobuf.descriptor.FieldDescriptor, google.protobuf.descriptor.EnumDescriptor, google.protobuf.message.Message]) -> Type[DataBase] :staticmethod: Look up the data model class corresponding to the given protobuf If no data model is found, this raises an AttributeError Args: proto (Union[Descriptor, ProtoMessageType]) The proto name or descriptor to look up against Returns: dm_class (Type[DataBase]): The data model class corresponding to the given protobuf .. py:method:: get_class_for_name(class_name: str) -> Type[DataBase] :staticmethod: Look up the data model class corresponding to the given name This lookup attempts to encode various naming conventions that might be used, but it can fail in multiple ways: 1. No class with the given name is known 2. Multiple classes with the same name, but different qualified parents are found A ValueError will be raised if either of the above happens Args: class_name (str) The name of the class either as a fully-qualified protobuf name or as the unqualified class name Returns: dm_class (Type[DataBase]): The data model class corresponding to the given protobuf .. py:data:: CAIKIT_DATA_MODEL :value: 'caikit_data_model' .. py:class:: DataObjectBase Bases: :py:obj:`caikit.core.data_model.base.DataBase` A DataObject is a data model class that is backed by a @dataclass. Data model classes that use the @dataobject decorator must derive from this base class. .. py:function:: dataobject(*args, **kwargs) -> Callable[[_DataObjectBaseT], _DataObjectBaseT] The @dataobject decorator can be used to define a Data Model object's schema inline with the definition of the python class rather than needing to bind to a pre-compiled protobufs class. For example: @dataobject("foo.bar") class MyDataObject(DataObjectBase): '''My Custom Data Object''' foo: str bar: int NOTE: The wrapped class must NOT inherit directly from DataBase. That inheritance will be added by this decorator, but if it is written directly, the metaclass that links protobufs to the class will be called before this decorator can auto-gen the protobufs class. The `dataobject` decorator will not provide tools with enough information to perform type completion for constructions in an IDE, or static typechecking. In order to have that, the `dataclass` decorator may optionally be added, with the slight overhead of wasted effort in creating the "standard" __init__ function which then gets re-done by @dataobject. The `dataclass` must follow the `dataobject` decorator. For example: @dataobject("foo.bar") @dataclass class MyDataObject(DataObjectBase): '''My Custom Data Object''' foo: str bar: int Kwargs: package: str The package name to use for the generated protobufs class Returns: decorator: Callable[[Type], Type[DataBase]] The decorator function that will wrap the given class .. py:function:: render_dataobject_protos(interfaces_dir: str) Write out protobufs files for all proto classes generated from dataobjects to the target interfaces directory Args: interfaces_dir (str): The target directory (must already exist) .. py:function:: import_enums(current_globals) Add all enums and their reverse enum mappings a module's global symbol table. Note that we also update __all__. In general, __all__ controls the stuff that comes with a wild (*) import. Examples tend to make stuff like this easier to understand. Let's say the first name we hit is the Entity Mention Type. Then, after the first cycle through the loop below, you'll see something like: '__all__': ['import_enums', 'EntityMentionType', 'EntityMentionTypeRev'] 'EntityMentionType': { "MENTT_UNSET": 0, "MENTT_NAM": 1, ... , "MENTT_NONE": 4} 'EntityMentionTypeRev': { "0": "MENTT_UNSET", "1": "MENTT_NAM", ... , "4": "MENTT_NONE"} since this is called explicitly below, you can thank this function for automagically syncing your enums (as importable from this file) with the data model. Args: current_globals (dict): global dictionary from your data model package __init__ file. .. py:function:: import_enum(proto_enum: google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper, enum_class: Optional[Type[enum.Enum]] = None) -> Tuple[str, str] Import a single enum into the global enum module by name Args: proto_enum (EnumTypeWrapper): The enum to import enum_class (Optional[Type[Enum]]): A pre-existing enum class that this proto enum binds to Returns: name: str The name of the enum global rev_name: str The name of the reversed enum global .. py:class:: JobStatus(*args, **kwds) Bases: :py:obj:`enum.Enum` Enum to track current status of a job .. py:attribute:: QUEUED :value: 1 .. py:attribute:: RUNNING :value: 2 .. py:attribute:: COMPLETED :value: 3 .. py:attribute:: CANCELED :value: 4 .. py:attribute:: ERRORED :value: 5 .. py:property:: is_terminal .. py:data:: PredictionJobStatus .. py:data:: PACKAGE_COMMON :value: 'caikit_data_model.common' .. py:class:: ProducerId Bases: :py:obj:`caikit.core.data_model.dataobject.DataObjectBase` Information about a data structure and the module that produced it. .. py:attribute:: name :type: str .. py:attribute:: version :type: str .. py:method:: __add__(other) Add two producer ids. .. py:method:: from_proto(proto) :classmethod: Overloaded implementation for efficiency vs base introspection .. py:method:: fill_proto(proto) Overloaded implementation for efficiency vs base introspection .. py:class:: AugmentorBase(random_seed, produces_none=False) .. py:attribute:: produces_none :value: False .. py:method:: augment(inp_obj) Take an object in, give an object back. Calls ._augment in the subclass. Args: inp_obj (str | caikit.core.data_model.DataBase): Object to be augmented. Returns: str | caikit.core.data_model.DataBase: Augmented object of same type as input inp_obj. .. py:method:: reset() Reset random number generation for the current augmentor. Note that this currently assumes the augmentor is using the builtin random generator leveraged by Python; if you end up using something else, you may want to override this or restructure this base class to allow resetting of random states based on seed type. .. py:data:: JsonDictValue .. py:function:: is_multipart_file(file) -> bool Returns true if the file appears to contain a multi-part form data request .. py:function:: stream_multipart_file(file) -> Iterator[Part] Returns an iterator of Parts, where each Part comes with a content type and an io reader to stream the data from. NB: This only yields parts which are files, not other form fields. .. py:data:: log .. py:data:: error .. py:data:: T .. py:class:: DataStream(generator_func, *args, **kwargs) Bases: :py:obj:`Generic`\ [\ :py:obj:`T`\ ] A data stream is a iterable container class that is reentrant in the sense that it can be iterated over multiple times. The items produced by a data stream may be any python object and are called data items. The data items produced by an iterator over a data stream are generated lazily (unless the `.eager` method is called) so that each data item in a series of data streams is produced as it is accessed. This allows processing datasets that are too large to fit into memory. A number of functional style methods are provided for manipulating and munging data streams and the `.stream` method on modules can also be used to process data streams. The `DataStream` class is really just a generic wrapper around functions that produce python iterators or generators. .. py:attribute:: generator_func .. py:method:: from_iterable(data: Iterable[T]) -> DataStream[T] :classmethod: Create a new data stream from a python iterable, such as a list or tuple. This data stream produces a single data item for each element of the iterable.. Args: data (iterable): A list or tuple or other python iterable used to construct a new data stream where each data item contains a single data item. Returns: DataStream: A new data stream that produces data items from the elements of `data`. Examples: >>> list_stream = DataStream.from_iterable([1, 2, 3]) >>> for data_item in list_stream: >>> print(data_item) 1 2 3 .. py:method:: _from_iterable_generator(data: Iterable[T]) -> Iterator[T] :classmethod: .. py:method:: from_jsonl(filename: str) -> DataStream[Dict] :classmethod: Creates a new data stream from a path to a file with JSON lines array, where each line is a valid JSON (python dict) Args: filename (str): A path to a utf8 encode text file with JSON lines array, where each line is a valid JSON (python dict) Returns: DataStream: A new data stream that produces python dict items each containing a single JSON object corresponding to each line Notes: This class method returns a data stream over the valid JSON objects and each JSON object is on one line. https://jsonlines.org/ Examples: For a JSON lines file that looks like: {"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]} {"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]} {"name": "May", "wins": []} {"name": "Deloise", "wins": [["three of a kind", "5♣"]]} >>> jsonl_data_stream = DataStream.from_jsonl('sample.jsonl') >>> for data_item in jsonl_data_stream: >>> print(data_item) {'name': 'Gilbert', 'wins': [['straight', '7♣'], ['one pair', '10♥']]} {'name': 'Alexa', 'wins': [['two pair', '4♠'], ['two pair', '9♠']]} {'name': 'May', 'wins': []} {'name': 'Deloise', 'wins': [['three of a kind', '5♣']]} .. py:method:: _from_jsonl_generator(filename) :classmethod: .. py:method:: from_json_array(filename: str) -> DataStream[Dict] :classmethod: Creates a new data stream from a path to a file with JSON array, where each item is a valid JSON (python dict) Args: filename (str): A path to a utf8 encode text file with JSON array, where each item is a valid JSON (python dict) Returns: DataStream: A new data stream that produces python dict items each containing a single JSON object specified by 'filename' Notes: This class method returns a data stream over the valid JSON objects of a single JSON array text file. Examples: For a JSON file that looks like: [ { a: 1, b: 2, c: False }, { a: 2, b: 3 }, { a: 3, c: True } ] >>> json_data_stream = DataStream.from_json_array('sample.json') >>> for data_item in json_data_stream: >>> print(data_item) { a: 1, b: 2, c: False } { a: 2, b: 3 } { a: 3, c: True } .. py:method:: _from_json_array_file_generator(filename) :classmethod: .. py:method:: _from_json_array_buffer_generator(json_fh: IO, filename: str = '') :classmethod: .. py:method:: from_csv(filename: str, *args, skip=0, **kwargs) -> DataStream[List] :classmethod: Create a new data stream from a csv (comma separated value) file where each data item corresponds to a line of the csv file and consists of a list containing the comma separated values. Args: filename (str): A path to a csv file that has rows corresponding to data items and columns corresponding to the elements of each data item. skip (int): Number of lines to skip at the beginning of the csv file. This is often useful for skipping a header line. args, kwargs: Additional arguments passed to the `csv.reader` function. These can be used to specify the delimiter or other csv settings. Returns: DataStream: A data stream that produces a data item for each line of the csv file and where each element of the data item corresponds to a column in the csv file.Examples: For a sample.csv that looks like: a, b, c d, e, f >>> csv_stream = DataStream.from_csv('sample.csv') >>> for data_item in csv_stream: >>> print(data_item) ['a', 'b', 'c'] ['d', 'e', 'f'] .. py:method:: _from_csv_generator(filename, skip, *csv_args, **csv_kwargs) :classmethod: .. py:method:: from_header_csv(filename: str, *args, **kwargs) -> DataStream[Dict] :classmethod: Create a new data stream from a csv where the first row is a header and each subsequent row is an element. The yielded elements are tuples of dicts where each dict pairs the row values with the corresponding column headers. Args: filename (str): A path to a csv file that has rows corresponding to data items and columns corresponding to the elements of each data item. args, kwargs: Additional arguments passed to the `csv.reader` function. These can be used to specify the delimiter or other csv settings. Returns: DataStream: A data stream that produces a data item for each line of the csv file and where each element of the stream is a dict representation of the fieldsExamples: For a sample.csv that looks like: foo, bar, baz a, b, c d, e, f >>> csv_stream = DataStream.from_csv('sample.csv') >>> for data_item in csv_stream: >>> print(data_item) {"foo": "a", "bar": "b", "baz": "c"} {"foo": "d", "bar": "e", "baz": "f"} .. py:method:: _from_header_csv_generator(filename, *csv_args, **csv_kwargs) :classmethod: .. py:method:: _from_header_csv_buffer_generator(fh: IO, *csv_args, **csv_kwargs) :classmethod: .. py:method:: from_txt(filename: str) -> DataStream[str] :classmethod: Create a new data stream from a path to a utf8 encoded text file where each data item corresponds to a single line of the file. Args: filename (str): A path to a utf8 encode text file with each line corresponding to a data item. Returns: DataStream: A new data stream that produces string data items each containing a single line from the file specified by `filename`. Notes: This class method returns a data stream over the lines of a single text file. In order to construct a datastream over separate files, rather than lines, consider using `.from_txt_collection`. Examples: For a text file that looks like: first line second line third line >>> txt_line_stream = DataStream.from_file('sample.txt') >>> for data_item in txt_line_stream: >>> print(data_item) first line second line third line .. py:method:: _from_txt_generator(filename) :classmethod: .. py:method:: from_file(filename: str) -> DataStream[Union[Dict, Tuple, str]] :classmethod: Loads up a DataStream from a file. Will call the correct DataStream.from_* static constructor based on the file extension The data items returned in the data stream are: For JSON: dictionaries For all other files (besides CSV for now) strings (1 per line) Args: filename (str): Name of file Returns: DataStream: Resulting datastream from file .. py:method:: _from_collection(dirname: str, extension: str, file_opener) -> DataStream[Union[Dict, Tuple, str]] :classmethod: Create a new data stream from a path containing multiple files where each data item corresponds to the entire serialized content in a single file. The file_handler function does the serialization of individual files Args: dirname (str): A directory path containing a number of utf8 encoded text files with the `.txt` filename extension. extension (str): Extension of the file. Note that all files are read in the same utf8 encoding. file_opener (function): Function to deserialize a file on disk to memory Returns: DataStream: A new data stream that produces string data items each containing the text contained in a single file found in `dirname`. Notes: Each data item in this data stream represents the *entire* text contained in a single file and are not split by line or otherwise. .. py:method:: _from_collection_generator(dirname, extension, file_opener) :classmethod: .. py:method:: from_txt_collection(dirname: str, extension='txt') -> DataStream[str] :classmethod: Create a new data stream from a path containing multiple utf8 encoded text files where each data item corresponds to the entire text contained in a single file. Args: dirname (str): A directory path containing a number of utf8 encoded text files with the `.txt` filename extension. extension: str (Optional) Optional extension of the text file. Note that all files are read in the same utf8 encoding. Defaults to 'txt' Returns: DataStream: A new data stream that produces string data items each containing the text contained in a single `.txt` (or specified extension) file found in `dirname`. Notes: Each data item in this data stream represents the *entire* text contained in a single file and are not split by line or otherwise. .. py:method:: from_json_collection(dirname: str, extension='json') -> DataStream[Union[Dict, Tuple, List]] :classmethod: Create a new data stream from a path containing multiple JSON files where each data item corresponds to the entire serialized JSON contained in a single file. Args: dirname (str): A directory path containing a number of utf8 encoded text files with the `.txt` filename extension. extension: str (Optional) Optional extension of the JSON file. Note that all files are read in the same utf8 encoding. Defaults to 'json' Returns: DataStream: A new data stream that produces string data items each containing the text contained in a single `.json` (or specified extension) file found in `dirname`. Notes: Each data item in this data stream represents the *entire* text contained in a single file and are not split by line or otherwise. .. py:method:: from_csv_collection(dirname: str) -> DataStream[Dict] :classmethod: Create a new data stream by chaining data streams from each of the file from a path containing multiple csv files where each file can have 1 or more data item. Args: dirname (str): A directory path containing a number of csv files Returns: DataStream: A new data stream that is chained from all data streams by reading (from_header_csv) all files in all `.csv` files found in `dirname`. All data items are dicts. .. py:method:: _from_csv_collection_generator(dirname) :classmethod: .. py:method:: from_jsonl_collection(dirname: str) -> DataStream[Dict] :classmethod: Create a new data stream by chaining data streams from each of the file from a path containing multiple jsonl files where each file can have 1 or more data item. Args: dirname (str): A directory path containing a number of jsonl files Returns: DataStream: A new data stream that is chained from all data streams by reading (from_jsonl) all files in all `.jsonl` files found in `dirname`. .. py:method:: _from_jsonl_collection_generator(dirname) :classmethod: .. py:method:: from_multipart_file(filename: str) -> DataStream[JsonDictValue] :classmethod: Loads up a DataStream from a multipart file The data items returned in the data stream are determined by the content type for each part in the multipart file by calling the correct DataStream.from_* Args: filename (str): Name of file Returns: DataStream: Resulting datastream from file .. py:method:: train_test_split(test_split=0.25, seed=None) -> Tuple[DataStream[T], DataStream[T]] Split the current datastream into train/test substreams. Args: test_split (float): The fraction of examples to assign to the test substream, in [0, 1] seed (int | None): The seed for initializing the random assignment. If not provided, a randomly chosen seed will be used. Returns: tuple(DataStream, DataStream): Two substreams: a train set substream, and a test set substream .. py:method:: chain() -> DataStream Chain multiple data streams together sequentially. The returned data stream produces the data items from each passed data stream in turn. Args: args (tuple(DataStream)): A tuple containing the data streams to chain, passed as variadic arguments. Returns: DataStream: A new data stream that produces the data items from the provided data streams sequentially. .. py:method:: filter(func=lambda data_item: data_item, *args, **kwargs) -> DataStream[T] Skip elements in the data stream as identified by a passed function. Args: func (callable(data_item)): The function used to identify data items that will be filtered. The function takes a single data item as an argument and returns `True` in order to keep the element and `False` in order to skip it. The default filter function removes falsey values. Returns: DataStream: A new data stream that produces the data items from the current data stream only when `func` evaluates to true. .. py:method:: shuffle(buffer_size, seed=None) -> DataStream[T] Randomly shuffles the elements of this dataset. If buffer_size is smaller than the full size of the full data stream, it is a partial random shuffle which is similar to Tensorflow's dataset shuffle. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer. Args: buffer_size (int): the size of the buffer space, should be greater than 0 seed (int | None): The seed for initializing the random assignment. If not provided, a randomly chosen seed will be used. Returns: DataStream: A new data stream after shuffled. .. py:method:: eager() -> DataStream[T] Evaluate the data stream, place it into memory and return a new data stream over these static values. This is useful if your data stream can fit into memory, at least up to a certain point, and it will not be efficient to lazily and, potentially, re-evaluate the stream each time it is iterated over. Returns: DataStream: A new data stream that iterates over the evaluated, in- memory data items in this stream. .. py:method:: map(func, *args, **kwargs) -> DataStream Apply a function to each element in the data stream. Args: func (callable(*args, **kwargs)): A function this is lazily applied to each element in the data stream. *args, **kwargs Additional arguments to pass `func`. Returns: DataStream: A new data stream with `func` applied to each element. .. py:method:: flatten() -> DataStream Convert a 2-level nested stream into a flattened stream Returns: DataStream: A new data stream with inner stream items 'flattened' .. py:method:: zip() -> DataStream Combine the data items of multiple data streams together in tuples. Args: args (tuple(DataStream)): A tuple containing the data streams to be zip, passed as variadic arguments. Returns: DataStream: A data stream that produces the zipped data items. Notes: A `ValueError` is raised when the stream is iterated over if any of the zipped data streams do not have the same length. Since streams are evaluated lazily, however, this error condition will only be detected and raised when the stream is being iterated over. .. py:method:: peek() -> T Returns the first element of the stream, or raises IndexError if stream is empty .. py:method:: augment(augmentor, aug_cycles, *, post_augment_func=None, augment_index=None, enforce_determinism=True) -> DataStream[T] .. py:method:: __add__(other) The addition operator for data streams is equivalent to calling `.chain` and combines this data stream with another sequentially. .. py:method:: __getitem__(idx) -> T Index or slice each data item. This is valuable for creating new data streams over the elements of a stream that produces tuples, lists, arrays, et cetra. Args: idx (int or slice): The index or slice to be applied to each data item. Returns: DataStream: A new data stream with `data_item[idx]` applied to each data item. Notes: This operation may be somewhat counter intuitive since `data_stream[0]` does not return the first element of the data stream and, instead, returns a new data stream that produces `data_item[0]` for each data item. This operation may fail with a `TypeError` if the data items in the stream are not subscriptable. .. py:method:: __iter__() Return an iterator or generator over all of the data items in this data stream. Data streams are reentrant in the sense that they can be iterated over multiple times. .. py:method:: __len__() See property method self._length .. py:property:: _length Return the number of data items contained in this data stream. This requires that the data stream be iterated over, which may be time-consuming. This value is then stored internally so that subsequent calls do not iterate over the data stream again. This is implemented as a cached_property so that subclasses of DataStream which implement their own __getstate__ and __setstate__ do not have to account for the existence of self._length .. py:method:: __or__(module) Feed this data stream into the `.stream` method of a module. This is syntactic sugar that allows modules to be chained like `data_stream | module1 | module2` rather than the equivalent `module2.stream(module1.stream(data_stream))`. .. py:method:: _verify_dir(dirname) :staticmethod: .. py:class:: _UtfEncodeIOWrapper(bytes_stream: IO[bytes]) Bases: :py:obj:`io.IOBase` Lil' wrapper class to convert a bytes buffer to a string buffer .. py:attribute:: bytes_stream .. py:method:: read(*args, **kwargs) .. py:method:: readline(*args, **kwargs) Read and return a line from the stream. If size is specified, at most size bytes will be read. The line terminator is always b'\n' for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized. .. py:method:: seek(*args, **kwargs) Change the stream position to the given byte offset. offset The stream position, relative to 'whence'. whence The relative position to seek from. The offset is interpreted relative to the position indicated by whence. Values for whence are: * os.SEEK_SET or 0 -- start of stream (the default); offset should be zero or positive * os.SEEK_CUR or 1 -- current stream position; offset may be negative * os.SEEK_END or 2 -- end of stream; offset is usually negative Return the new absolute position. .. py:data:: TrainingStatus