Skip to main content

utils

Utility functions concerning data sources.

Module

Functions

load_data_in_memory

def load_data_in_memory(    datasource: base_source.BaseSource, **kwargs: Any,)> pandas.core.frame.DataFrame:

Load all data from a datasource into memory and return a singular DataFrame.

Arguments

  • datasource: the datasource to load from.
  • kwargs: kwargs to pass through to the underlying yield_data() call.

task_running_context_manager

def task_running_context_manager(    datasource: base_source.BaseSource,)> collections.abc.Generator[BaseSource, None, None]:

A context manager to temporarily set a datasource in a "task running" context.

Classes

FileSystemFilter

class FileSystemFilter(    file_extension: Optional[SingleOrMulti[str]] = None,    strict_file_extension: bool = False,    file_creation_min_date: Optional[Union[Date, DateTD]] = None,    file_modification_min_date: Optional[Union[Date, DateTD]] = None,    file_creation_max_date: Optional[Union[Date, DateTD]] = None,    file_modification_max_date: Optional[Union[Date, DateTD]] = None,    min_file_size: Optional[float] = None,    max_file_size: Optional[float] = None,):

Filter files based on various criteria.

Arguments

  • file_extension: File extension(s) of the data files. If None, all files will be searched. Can either be a single file extension or a list of file extensions. Case-insensitive. Defaults to None.
  • strict_file_extension: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant if file_extension is provided. Defaults to False.
  • file_creation_min_date: The oldest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.
  • file_modification_min_date: The oldest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.
  • file_creation_max_date: The newest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.
  • file_modification_max_date: The newest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.
  • min_file_size: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • max_file_size: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.

Methods


check_skip_file

def check_skip_file(    self,    entry: Optional[os.DirEntry] = None,    path: Optional[str | os.PathLike] = None,    stat: Optional[os.stat_result] = None,)> bool:

Filter files based on the criteria provided.

Check the following things in order:

  • is this a file?
  • is this an allowed type of file?
  • does this file meet the date criteria?
  • does this file meet the file size criteria?

Either entry OR path should be supplied. If path is supplied, stat may be optionally provided, but will be newly read if not.

If both entry and path are provided, then entry will take precedence.

Arguments

  • entry: The file to check as an os.DirEntry object, as from os.scandir(). Mutually exclusive with path.
  • path: The file path to check. Mutually exclusive with entry.
  • stat: The os.stat() details associated with path. Optional, will be read directly if not provided.

Returns True if the file should be skipped, False otherwise

log_files_found_with_extension

def log_files_found_with_extension(    self, num_found_files: int, interim: bool = True,)> None:

Log the files found with the given extension.