base
Base interface for data persistence implementations.
Classes
BulkResult
class BulkResult( file_name_column: str, cached: Optional[pd.DataFrame], misses: list[Path], skipped: list[str] = [],):Container for the results of a bulk_get result.
Variables
- static
cached : Optional[pd.DataFrame]
- static
file_name_column : str
- static
misses : list[Path]
- static
skipped : list[str]
data- Ordered DataFrame with cached data, excluding the file names.
hits- Ordered Series of file name hits, possibly including duplicates.
Methods
get_cached_by_filename
def get_cached_by_filename(self, file_name: str) ‑> Optional[pd.DataFrame]:Dataframe with cached data for a single file.
May contain multiple lines (e.g. for e2e files that contain several images).
CacheClearResult
class CacheClearResult(*args, **kwargs):Result structure for cache clearing operations.
Variables
- static
error : Optional[str]', module='bitfount.data.persistence.base
- static
file_existed : bool', module='bitfount.data.persistence.base
- static
file_path : Optional[str]', module='bitfount.data.persistence.base
- static
success : bool', module='bitfount.data.persistence.base
DataPersister
class DataPersister( file_name_column: str, lock: Optional[_Lock] = None, bulk_partition_size: Optional[int] = None,):Abstract interface for data persistence/caching implementations.
Subclasses
Static methods
prep_data_for_caching
def prep_data_for_caching( data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,) ‑> pd.DataFrame:Prepares data ready for caching.
This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.
Does not mutate input dataframe.
Methods
bulk_get
def bulk_get(self, files: Sequence[Union[str, Path]]) ‑> BulkResult:Get the persisted data for several files.
Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.
bulk_set
def bulk_set( self, data: pd.DataFrame, original_file_col: str = '_original_filename',) ‑> None:Bulk set a bunch of cache entries from a dataframe.
The dataframe must indicate the original file that each row is associated
with. This is the _original_filename column by default.
clear_cache_file
def clear_cache_file(self) ‑> CacheClearResult:Delete the cache storage completely.
Returns Dictionary with results of the cache clearing operation.
get
def get(self, file: Union[str, Path]) ‑> Optional[pd.DataFrame]:Get the persisted data for a given file.
Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.
get_all_cached_file_paths
def get_all_cached_file_paths(self) ‑> list[str]:Get list of all cached file paths.
Returns List of canonical file paths (as strings) that have entries in the cache.
get_all_skipped_files
def get_all_skipped_files(self) ‑> list[str]:Get list of all skipped file paths.
Returns List of file paths that have been marked as skipped.
get_cached_distinct_values
def get_cached_distinct_values( self, columns: Sequence[str], file_paths: Optional[Sequence[Union[str, Path]]] = None,) ‑> dict[str, list[Any]]:Get distinct values for columns from cache, optionally scoped to files.
get_cached_dtype_sample
def get_cached_dtype_sample( self, file_paths: Optional[Sequence[Union[str, Path]]] = None, limit: int = 100,) ‑> pd.DataFrame:Get a bounded cache sample for dtype reconciliation.
get_cached_row_count
def get_cached_row_count( self, file_paths: Optional[Sequence[Union[str, Path]]] = None,) ‑> int:Get row count from cached data, optionally scoped to selected files.
get_cached_table_columns
def get_cached_table_columns(self) ‑> list[str]:Get all column names currently present in cached data storage.
Returns an empty list if cache is not initialised or an error occurs.
get_column_for_id
def get_column_for_id( self, id_value: str, id_column: str, target_column: str,) ‑> list[Any]:Get all values of a target column for rows matching a given ID.
Queries the cached data for all entries where id_column equals
id_value and returns the corresponding values from
target_column.
Arguments
id_value: The ID value to match against.id_column: The name of the column containing IDs to filter on.target_column: The name of the column whose values should be returned.
Returns
A list of values from target_column for all matching rows.
Returns an empty list if no matches are found, the cache is not
initialised, or an error occurs.
get_column_values_for_files
def get_column_values_for_files( self, file_paths: Sequence[Union[str, Path]], columns: Sequence[str],) ‑> dict[str, dict[str, Any]]:Get specific column values for multiple files via targeted queries.
Retrieves only the requested columns from the cache for the given
files, avoiding loading full rows into DataFrames. This is
significantly more efficient than bulk_get when only a subset of
columns is needed (e.g. during filtering).
Arguments
file_paths: The file paths to query.columns: The column names to retrieve from the cached data.
Returns
A nested dict mapping file_path -> {column_name -> value}.
Files not found in the cache are omitted from the result.
Returns an empty dict if the cache is not initialised or an
error occurs.
get_skip_reason_summary
def get_skip_reason_summary(self) ‑> pandas.core.frame.DataFrame:Get aggregate statistics of skip reasons.
Returns DataFrame with columns: reason_code, reason_description, file_count
is_file_skipped
def is_file_skipped(self, file: Union[str, Path]) ‑> bool:Check if a file has been previously skipped.
Arguments
file: The file path to check.
Returns True if the file has been marked as skipped, False otherwise.
mark_file_skipped
def mark_file_skipped(self, file: Union[str, Path], reason: FileSkipReason) ‑> None:Mark a file as skipped with the given reason.
Wraps the underlying _mark_file_skipped implementation with error
handling so that a failure to persist the skip record (e.g. a transient
OS/network error) does not propagate up and crash the caller.
Arguments
file: The file path that was skipped.reason: The reason why the file was skipped.
set
def set(self, file: Union[str, Path], data: pd.DataFrame) ‑> None:Set the persisted data for a given file.
If existing data is already set, it will be overwritten.
The data should only be the data that is related to that file.
touch
def touch(self, file_paths: Optional[Sequence[Union[str, Path]]] = None) ‑> None:Mark the given cached entries as recently validated.
This signals to the cache that the entries for the given files are still current and should not be considered stale. The concrete effect depends on the implementation.
Files not present in the cache are silently ignored.
unset
def unset(self, file: Union[str, Path]) ‑> None:Deletes the persisted data for a given file.