base

Base interface for data persistence implementations.

Classes

BulkResult

class BulkResult(    file_name_column: str, cached: Optional[pd.DataFrame], misses: list[Path],):

Container for the results of a bulk_get result.

Variables

static cached : Optional[pd.DataFrame]

static file_name_column : str

static misses : list[Path]

data - Ordered DataFrame with cached data, excluding the file names.

hits - Ordered Series of file name hits, possibly including duplicates.

Methods

get_cached_by_filename

def get_cached_by_filename(self, file_name: str) ‑> Optional[pd.DataFrame]:

Dataframe with cached data for a single file.

May contain multiple lines (e.g. for e2e files that contain several images).

DataPersister

class DataPersister(    file_name_column: str,    lock: Optional[_Lock] = None,    bulk_partition_size: Optional[int] = None,):

Abstract interface for data persistence/caching implementations.

Ancestors

abc.ABC

Subclasses

SQLiteDataPersister

Static methods

prep_data_for_caching

def prep_data_for_caching(    data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,) ‑> pd.DataFrame:

Prepares data ready for caching.

This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.

Does not mutate input dataframe.

Methods

bulk_get

def bulk_get(self, files: Sequence[Union[str, Path]]) ‑> BulkResult:

Get the persisted data for several files.

Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.

bulk_set

def bulk_set(    self, data: pd.DataFrame, original_file_col: str = '_original_filename',) ‑> None:

Bulk set a bunch of cache entries from a dataframe.

The dataframe must indicate the original file that each row is associated with. This is the _original_filename column by default.

get

def get(self, file: Union[str, Path]) ‑> Optional[pd.DataFrame]:

Get the persisted data for a given file.

Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.

set

def set(self, file: Union[str, Path], data: pd.DataFrame) ‑> None:

Set the persisted data for a given file.

If existing data is already set, it will be overwritten.

The data should only be the data that is related to that file.

unset

def unset(self, file: Union[str, Path]) ‑> None:

Deletes the persisted data for a given file.

Classes​

BulkResult​

Variables​

Methods​

get_cached_by_filename​

DataPersister​

Ancestors​

Subclasses​

Static methods​

prep_data_for_caching​

Methods​

bulk_get​

bulk_set​

get​

set​

unset​

Classes

BulkResult

Variables

Methods

get_cached_by_filename

DataPersister

Ancestors

Subclasses

Static methods

prep_data_for_caching

Methods

bulk_get

bulk_set

get

set

unset