caikit.interfaces.ts.data_model.backends._spark_backends

Core data model backends backed by pyspark.sql.DataFrame.

This module is not intended for direct importing. It’s used by the caikit ts datamodel. Directly importing this module will force a hard spark dependency which we do not want to do.

Attributes

`log`
`error`

Classes

`SparkMultiTimeSeriesBackend`	Abstract base class for all backends of the central MultiTimeSeries data model
`SparkTimeSeriesBackend`	The SparkTimeSeries is responsible for managing the standard

Functions

ensure_spark_cached(→ pyspark.sql.DataFrame)

Will ensure that a given dataframe is cached.

Module Contents

caikit.interfaces.ts.data_model.backends._spark_backends.log

caikit.interfaces.ts.data_model.backends._spark_backends.error

caikit.interfaces.ts.data_model.backends._spark_backends.ensure_spark_cached(dataframe: pyspark.sql.DataFrame) → pyspark.sql.DataFrame

Will ensure that a given dataframe is cached. If dataframe is already cached it does nothing. If it’s not cached, it will cache it and then uncache the object when the ensure_spark_cached object container goes out of scope. Users must utilize the with pattern of access.

Example: ```python

with ensure_spark_cached(df) as _:
# do dataframey sorts of things on df # it’s guarenteed to be cached # inside this block

# that’s it, you’re done. # df remains cached if it already was # or it’s no longer cached if it wasn’t # before entering the with block above.

```

class caikit.interfaces.ts.data_model.backends._spark_backends.SparkMultiTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, key_column: Iterable[str] | str, timestamp_column: str = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | Iterable[str] | None = None, producer_id: Tuple[str, str] | caikit.core.data_model.ProducerId | None = None)

Bases: caikit.interfaces.ts.data_model.backends.base.MultiTimeSeriesBackendBase

Abstract base class for all backends of the central MultiTimeSeries data model type

_pyspark_df: pyspark.sql.DataFrame

_pyspark_pandas_df

_key_column

_timestamp_column = None

_value_columns

_ids = []

_producer_id

_key_columns

get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model.timeseries.TimeSeries], name: str) → Any

A data model backend must implement this in order to provide the frontend view the functionality needed to lazily extract data.

Args:

data_model_class (Type[DataBase]): The frontend data model class: that is accessing this attribute

name (str): The name of the attribute to access

Returns:

value: Union[Any, OneofFieldVal]: The extracted attribute value or a OneofFieldVal that wraps the field val with an indicator about the oneof field that is set.

as_pandas() → Tuple[pandas.DataFrame, Iterable[str], str, Iterable[str]]

All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)

Returns:

df: pd.DataFrame: The data frame itself
key_source: Iterable[str]: the names of the columns holding key values
timestamp_source: str: The column name (or None) indicating where the timestamp sequence can be found
value_source: Iterable[str]: The names of the columns holding value sequences

class caikit.interfaces.ts.data_model.backends._spark_backends.SparkTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, timestamp_column: str | None = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | None = None)

Bases: caikit.interfaces.ts.data_model.backends.base.TimeSeriesBackendBase

The SparkTimeSeries is responsible for managing the standard in-memory representation of a TimeSeries using a spark backend compute engine.

_pyspark_df: pyspark.sql.DataFrame

_pyspark_pandas_df

_pdbackend_helper

get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model._single_timeseries.SingleTimeSeries], name: str) → Any: When fetching a data attribute from the timeseries, this aliases to the appropriate set of backend wrappers for the various fields.

as_pandas() → Tuple[pandas.DataFrame, str, Iterable[str]]

All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)

Returns:

df: pd.DataFrame: The data frame itself
timestamp_source: str: The column name (or None) indicating where the timestamp sequence can be found
value_source: Iterable[str]: The names of the columns holding value sequences