caikit.interfaces.ts.data_model.backends._spark_backends

Core data model backends backed by pyspark.sql.DataFrame.

This module is not intended for direct importing. It’s used by the caikit ts datamodel. Directly importing this module will force a hard spark dependency which we do not want to do.

Attributes

log

error

Classes

SparkMultiTimeSeriesBackend

Abstract base class for all backends of the central MultiTimeSeries data model

SparkTimeSeriesBackend

The SparkTimeSeries is responsible for managing the standard

Functions

ensure_spark_cached(→ pyspark.sql.DataFrame)

Will ensure that a given dataframe is cached.

Module Contents

caikit.interfaces.ts.data_model.backends._spark_backends.log
caikit.interfaces.ts.data_model.backends._spark_backends.error
caikit.interfaces.ts.data_model.backends._spark_backends.ensure_spark_cached(dataframe: pyspark.sql.DataFrame) pyspark.sql.DataFrame

Will ensure that a given dataframe is cached. If dataframe is already cached it does nothing. If it’s not cached, it will cache it and then uncache the object when the ensure_spark_cached object container goes out of scope. Users must utilize the with pattern of access.

Example: ```python

with ensure_spark_cached(df) as _:

# do dataframey sorts of things on df # it’s guarenteed to be cached # inside this block

# that’s it, you’re done. # df remains cached if it already was # or it’s no longer cached if it wasn’t # before entering the with block above.

```

class caikit.interfaces.ts.data_model.backends._spark_backends.SparkMultiTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, key_column: Iterable[str] | str, timestamp_column: str = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | Iterable[str] | None = None, producer_id: Tuple[str, str] | caikit.core.data_model.ProducerId | None = None)

Bases: caikit.interfaces.ts.data_model.backends.base.MultiTimeSeriesBackendBase

Abstract base class for all backends of the central MultiTimeSeries data model type

_pyspark_df: pyspark.sql.DataFrame
_pyspark_pandas_df
_key_column
_timestamp_column = None
_value_columns
_ids = []
_producer_id
_key_columns
get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model.timeseries.TimeSeries], name: str) Any

A data model backend must implement this in order to provide the frontend view the functionality needed to lazily extract data.

Args:
data_model_class (Type[DataBase]): The frontend data model class

that is accessing this attribute

name (str): The name of the attribute to access

Returns:
value: Union[Any, OneofFieldVal]

The extracted attribute value or a OneofFieldVal that wraps the field val with an indicator about the oneof field that is set.

as_pandas() Tuple[pandas.DataFrame, Iterable[str], str, Iterable[str]]

All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)

Returns:
df: pd.DataFrame

The data frame itself

key_source: Iterable[str]

the names of the columns holding key values

timestamp_source: str

The column name (or None) indicating where the timestamp sequence can be found

value_source: Iterable[str]

The names of the columns holding value sequences

class caikit.interfaces.ts.data_model.backends._spark_backends.SparkTimeSeriesBackend(data_frame: pyspark.sql.DataFrame, timestamp_column: str | None = None, value_columns: Iterable[str] | None = None, ids: Iterable[int] | None = None)

Bases: caikit.interfaces.ts.data_model.backends.base.TimeSeriesBackendBase

The SparkTimeSeries is responsible for managing the standard in-memory representation of a TimeSeries using a spark backend compute engine.

_pyspark_df: pyspark.sql.DataFrame
_pyspark_pandas_df
_pdbackend_helper
get_attribute(data_model_class: Type[caikit.interfaces.ts.data_model._single_timeseries.SingleTimeSeries], name: str) Any

When fetching a data attribute from the timeseries, this aliases to the appropriate set of backend wrappers for the various fields.

as_pandas() Tuple[pandas.DataFrame, str, Iterable[str]]

All backends must implement the ability to coerce their underlying data into a pandas DataFrame and provide the pointers to the timeseries source and value source(s)

Returns:
df: pd.DataFrame

The data frame itself

timestamp_source: str

The column name (or None) indicating where the timestamp sequence can be found

value_source: Iterable[str]

The names of the columns holding value sequences