utils
Utility functions concerning data sources.
Module
Functions
load_data_in_memory
def load_data_in_memory( datasource: base_source.BaseSource, **kwargs: Any,) ‑> pandas.core.frame.DataFrame:
Load all data from a datasource into memory and return a singular DataFrame.
Arguments
datasource
: the datasource to load from.kwargs
: kwargs to pass through to the underlying yield_data() call.
task_running_context_manager
def task_running_context_manager( datasource: base_source.BaseSource,) ‑> collections.abc.Generator[BaseSource, None, None]:
A context manager to temporarily set a datasource in a "task running" context.
Classes
FileSystemFilter
class FileSystemFilter( file_extension: Optional[SingleOrMulti[str]] = None, strict_file_extension: bool = False, file_creation_min_date: Optional[Union[Date, DateTD]] = None, file_modification_min_date: Optional[Union[Date, DateTD]] = None, file_creation_max_date: Optional[Union[Date, DateTD]] = None, file_modification_max_date: Optional[Union[Date, DateTD]] = None, min_file_size: Optional[float] = None, max_file_size: Optional[float] = None,):
Filter files based on various criteria.
Arguments
file_extension
: File extension(s) of the data files. If None, all files will be searched. Can either be a single file extension or a list of file extensions. Case-insensitive. Defaults to None.strict_file_extension
: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant iffile_extension
is provided. Defaults to False.file_creation_min_date
: The oldest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.file_modification_min_date
: The oldest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.file_creation_max_date
: The newest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.file_modification_max_date
: The newest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.min_file_size
: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.max_file_size
: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
Methods
check_skip_file
def check_skip_file( self, entry: Optional[os.DirEntry] = None, path: Optional[str | os.PathLike] = None, stat: Optional[os.stat_result] = None,) ‑> bool:
Filter files based on the criteria provided.
Check the following things in order:
- is this a file?
- is this an allowed type of file?
- does this file meet the date criteria?
- does this file meet the file size criteria?
Either entry
OR path
should be supplied. If path is supplied, stat
may
be optionally provided, but will be newly read if not.
If both entry
and path
are provided, then entry
will take precedence.
Arguments
entry
: The file to check as anos.DirEntry
object, as fromos.scandir()
. Mutually exclusive withpath
.path
: The file path to check. Mutually exclusive withentry
.stat
: Theos.stat()
details associated withpath
. Optional, will be read directly if not provided.
Returns True if the file should be skipped, False otherwise
log_files_found_with_extension
def log_files_found_with_extension( self, num_found_files: int, interim: bool = True,) ‑> None:
Log the files found with the given extension.