caikit.interfaces.ts.data_model.backends._spark_backends
Core data model backends backed by pyspark.sql.DataFrame.
This module is not intended for direct importing. It’s used by the caikit ts datamodel. Directly importing this module will force a hard spark dependency which we do not want to do.
Attributes
Classes
Abstract base class for all backends of the central MultiTimeSeries data model |
|
The SparkTimeSeries is responsible for managing the standard |
Functions
|
Will ensure that a given dataframe is cached. |
Module Contents
- caikit.interfaces.ts.data_model.backends._spark_backends.log
- caikit.interfaces.ts.data_model.backends._spark_backends.error
- caikit.interfaces.ts.data_model.backends._spark_backends.ensure_spark_cached(dataframe: pyspark.sql.DataFrame) pyspark.sql.DataFrame
Will ensure that a given dataframe is cached. If dataframe is already cached it does nothing. If it’s not cached, it will cache it and then uncache the object when the ensure_spark_cached object container goes out of scope. Users must utilize the with pattern of access.
- with ensure_spark_cached(df) as _:
# do dataframey sorts of things on df # it’s guarenteed to be cached # inside this block
# that’s it, you’re done. # df remains cached if it already was # or it’s no longer cached if it wasn’t # before entering the with block above.
- class caikit.interfaces.ts.data_model.backends._spark_backends.SparkMultiTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, key_column: Iterable[str] | str, timestamp_column: str = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | Iterable[str] | None = None, producer_id: Tuple[str, str] | caikit.core.data_model.ProducerId | None = None)
Bases:
caikit.interfaces.ts.data_model.backends.base.MultiTimeSeriesBackendBaseAbstract base class for all backends of the central MultiTimeSeries data model type
- _pyspark_df: pyspark.sql.DataFrame
- _pyspark_pandas_df
- _key_column
- _timestamp_column = None
- _value_columns
- _ids = []
- _producer_id
- _key_columns
- get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model.timeseries.TimeSeries], name: str) Any
A data model backend must implement this in order to provide the frontend view the functionality needed to lazily extract data.
- Args:
- data_model_class (Type[DataBase]): The frontend data model class
that is accessing this attribute
name (str): The name of the attribute to access
- Returns:
- value: Union[Any, OneofFieldVal]
The extracted attribute value or a OneofFieldVal that wraps the field val with an indicator about the oneof field that is set.
- as_pandas() Tuple[pandas.DataFrame, Iterable[str], str, Iterable[str]]
All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)
- Returns:
- df: pd.DataFrame
The data frame itself
- key_source: Iterable[str]
the names of the columns holding key values
- timestamp_source: str
The column name (or None) indicating where the timestamp sequence can be found
- value_source: Iterable[str]
The names of the columns holding value sequences
- class caikit.interfaces.ts.data_model.backends._spark_backends.SparkTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, timestamp_column: str | None = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | None = None)
Bases:
caikit.interfaces.ts.data_model.backends.base.TimeSeriesBackendBaseThe SparkTimeSeries is responsible for managing the standard in-memory representation of a TimeSeries using a spark backend compute engine.
- _pyspark_df: pyspark.sql.DataFrame
- _pyspark_pandas_df
- _pdbackend_helper
- get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model._single_timeseries.SingleTimeSeries], name: str) Any
When fetching a data attribute from the timeseries, this aliases to the appropriate set of backend wrappers for the various fields.
- as_pandas() Tuple[pandas.DataFrame, str, Iterable[str]]
All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)
- Returns:
- df: pd.DataFrame
The data frame itself
- timestamp_source: str
The column name (or None) indicating where the timestamp sequence can be found
- value_source: Iterable[str]
The names of the columns holding value sequences