Skip to main content

base

Base interface for data persistence implementations.

Classes

BulkResult

class BulkResult(    file_name_column: str, cached: Optional[pd.DataFrame], misses: list[Path],):

Container for the results of a bulk_get result.

Variables

  • static cached : Optional[pd.DataFrame]
  • static file_name_column : str
  • static misses : list[Path]
  • data - Ordered DataFrame with cached data, excluding the file names.
  • hits - Ordered Series of file name hits, possibly including duplicates.

Methods


get_cached_by_filename

def get_cached_by_filename(self, file_name: str)> Optional[pd.DataFrame]:

Dataframe with cached data for a single file.

May contain multiple lines (e.g. for e2e files that contain several images).

DataPersister

class DataPersister(    file_name_column: str,    lock: Optional[_Lock] = None,    bulk_partition_size: Optional[int] = None,):

Abstract interface for data persistence/caching implementations.

Ancestors

Static methods


prep_data_for_caching

def prep_data_for_caching(    data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,)> pd.DataFrame:

Prepares data ready for caching.

This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.

Does not mutate input dataframe.

Methods


bulk_get

def bulk_get(self, files: list[Union[str, Path]])> BulkResult:

Get the persisted data for several files.

Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.

bulk_set

def bulk_set(    self, data: pd.DataFrame, original_file_col: str = '_original_filename',)> None:

Bulk set a bunch of cache entries from a dataframe.

The dataframe must indicate the original file that each row is associated with. This is the _original_filename column by default.

get

def get(self, file: Union[str, Path])> Optional[pd.DataFrame]:

Get the persisted data for a given file.

Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.

set

def set(self, file: Union[str, Path], data: pd.DataFrame)> None:

Set the persisted data for a given file.

If existing data is already set, it will be overwritten.

The data should only be the data that is related to that file.

unset

def unset(self, file: Union[str, Path])> None:

Deletes the persisted data for a given file.