heidelberg_source

Data source for loading ophthalmology files using private-eye.

Classes

HeidelbergCSVColumns

class HeidelbergCSVColumns(heidelberg_files_col: str = 'heidelberg_file'):

Arguments for ophthalmology columns in the csv.

Arguments

heidelberg_files_col: The name of the column that points to Heidelberg files in the CSV file. Defaults to 'heidelberg_file'. These files should all be in the .sdb format.

Ancestors

bitfount.types.UsedForConfigSchemas

Variables

static heidelberg_files_col : str

HeidelbergSource

class HeidelbergSource(    private_eye_parser: Union[PrivateEyeParser, Mapping[str, PrivateEyeParser]],    path: Union[os.PathLike, str],    parsers: Optional[Union[PrivateEyeParser, Mapping[str, PrivateEyeParser]]] = None,    heidelberg_csv_columns: Optional[Union[HeidelbergCSVColumns, _HeidelbergCSVColumnsTD]] = None,    required_fields: Optional[dict[str, Any]] = None,    ophthalmology_args: Optional[Union[OphthalmologyDataSourceArgs, _OphthalmologyDataSourceArgsTD]] = None,    data_cache: Optional[DataPersister] = None,    infer_class_labels_from_filepaths: bool = False,    output_path: Optional[Union[os.PathLike, str]] = None,    iterable: bool = True,    fast_load: bool = True,    cache_images: bool = False,    filter: Optional[FileSystemFilter] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    partition_size: int = 16,):

Data source for loading Heidelberg files.

Arguments

****kwargs**: Keyword arguments passed to the parent base classes.
cache_images: Whether to cache images in the file system. Defaults to False. This is ignored if fast_load is True.
data_cache: A DataPersister instance to use for data caching.
data_splitter: Deprecated argument, will be removed in a future release. Defaults to None. Not used.
fast_load: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant if iterable is True, otherwise it is ignored. Defaults to True.
heidelberg_csv_columns: If path is a CSV file, this contains information about the specific columns that contain the path information for the Heidelberg files. If not provided, it is assumed that the CSV file contains a column named 'heidelberg_file' that contains the paths to the Heidelberg files. Defaults to None.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
infer_class_labels_from_filepaths: Whether class labels should be added to the data based on the filepath of the files. Defaults to the first directory within self.path, but can go a level deeper if the datasplitter is provided with infer_data_split_labels set to true
iterable: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to True.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
ophthalmology_args: Arguments for ophthalmology modality data.
output_path: The path where to save intermediary output files. Defaults to 'preprocessed/'.
parsers: The private eye parsers to use for the different file extensions. Only needs to be supplied if file_extension filter is non-default. Can either be a single parser to use for all file extensions or a mapping of file extensions to parser type. Defaults to appropriate parser(s) for the default file extension(s).
partition_size: The size of each partition when iterating over the data in a batched fashion.
path: The path to the directory containing the Heidelberg files or to a CSV file that includes Heidelberg files as one of its columns. If a CSV file is provided, the file extensions specified in file_extension will be ignored. If a CSV file is provided, the heidelberg_csv_columns argument should also be provided.
private_eye_parser: Private-eye supported machine type(s). Can either be a single parser to use for all files or a mapping of file extension to the desired parser type. If privateeye_parser is a mapping of file extensions to parsers, there must be a parser for each file extension specified. If no file extensions are specified then mapping can exist in whatever form (warnings will be logged if we encounter an extension for which no parser is specified). If private_eye_parser is a _single parser, file_extension can be anything, we simply try to use this parser against any extension.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

seed: Random number seed. Used for setting random seed for all libraries.

Raises

ValueError: If the minimum DOB is greater than the maximum DOB.
ValueError: If the minimum number of B-scans is greater than the maximum number of B-scans.

Ancestors

bitfount.data.datasources.ophthalmology.private_eye_base_source._PrivateEyeSource
bitfount.data.datasources.ophthalmology.ophthalmology_base_source._OphthalmologySource
FileSystemIterableSourceInferrable
FileSystemIterableSource
BaseSource
collections.abc.Sized
MultiProcessingMixIn
abc.ABC

Variables

file_names : list[str] - Returns a list of file names in the specified directory.

This property accounts for files skipped at runtime by filtering them out of the list of cached file names. Files may get skipped at runtime due to errors or because they don't contain any image data and images_only is True. This allows us to skip these files again more quickly if they are still present in the directory.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns True if a task is running.

path : pathlib.Path - Resolved absolute path to data.

Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.

selected_file_names : list[str] - Returns a list of selected file names.

Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

selected_file_names_differ : bool - Returns True if selected_file_names will differ from default.

In particular, returns True iff there is a selected file names override in place and/or there is filtering for new file names only present.

Static methods

get_num_workers

def get_num_workers(file_names: Sequence[str]) ‑> int:

Inherited from:

FileSystemIterableSourceInferrable.get_num_workers :

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods

add_hook

def add_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.add_hook :

Add a hook to the datasource.

apply_ignore_cols

def apply_ignore_cols(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

FileSystemIterableSourceInferrable.apply_ignore_cols :

Apply ignored columns to dataframe, dropping columns as needed.

Returns A copy of the dataframe with ignored columns removed, or the original dataframe if this datasource does not specify any ignore columns.

apply_ignore_cols_iter

def apply_ignore_cols_iter(    self, dfs: Iterator[pd.DataFrame],) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.apply_ignore_cols_iter :

Apply ignored columns to dataframes from iterator.

apply_modifiers

def apply_modifiers(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

FileSystemIterableSourceInferrable.apply_modifiers :

Apply column modifiers to the dataframe.

If no modifiers are specified, returns the dataframe unchanged.

clear_file_names_cache

def clear_file_names_cache(self) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.clear_file_names_cache :

Clears the list of selected file names.

This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.

file_names_iter

def file_names_iter(    self, as_strs: bool = False,) ‑> Union[collections.abc.Iterator[pathlib.Path], collections.abc.Iterator[str]]:

Inherited from:

FileSystemIterableSourceInferrable.file_names_iter :

Iterate over files in a directory, yielding those that match the criteria.

Arguments

as_strs: By default the files yielded will be yielded as Path objects. If this is True, yield them as strings instead.

get_all_cached_file_paths

def get_all_cached_file_paths(self) ‑> list[str]:

Inherited from:

FileSystemIterableSourceInferrable.get_all_cached_file_paths :

Get all file paths that are currently stored in the cache.

Returns A list of file paths that have cache entries, or an empty list if there is no cache or the cache hasn't been initialized.

get_data

def get_data(    self, data_keys: SingleOrMulti[str], *, use_cache: bool = True, **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.get_data :

Get data corresponding to the provided data key(s).

Can be used to return data for a single data key or for multiple at once. If used for multiple, the order of the output dataframe must match the order of the keys provided.

Arguments

data_keys: Key(s) for which to get the data of. These may be things such as file names, UUIDs, etc.
use_cache: Whether the cache should be used to retrieve data for these keys. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If datacache is set on the instance, data will be _set in the cache, regardless of this argument.
****kwargs**: Additional keyword arguments.

Returns A dataframe containing the data, ordered to match the order of keys in data_keys, or None if no data for those keys was available.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self) ‑> list[str]:

Inherited from:

FileSystemIterableSourceInferrable.get_project_db_sqlite_columns :

Returns the required columns to identify a data point.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self) ‑> str:

Inherited from:

FileSystemIterableSourceInferrable.get_project_db_sqlite_create_table_query :

Returns the required columns and types to identify a data point.

The file name is used as the primary key and the last modified date is used to determine if the file has been updated since the last time it was processed. If there is a conflict on the file name, the row is replaced with the new data to ensure that the last modified date is always up to date.

partition

def partition(    self, iterable: Iterable[_I], partition_size: int = 1,) ‑> collections.abc.Iterable[collections.abc.Sequence[~_I]]:

Inherited from:

FileSystemIterableSourceInferrable.partition :

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

remove_hook

def remove_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.remove_hook :

Remove a hook from the datasource.

use_file_multiprocessing

def use_file_multiprocessing(self, file_names: Sequence[str]) ‑> bool:

Inherited from:

FileSystemIterableSourceInferrable.use_file_multiprocessing :

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

yield_data

def yield_data(    self,    data_keys: Optional[SingleOrMulti[str]] = None,    *,    use_cache: bool = True,    partition_size: Optional[int] = None,    **kwargs: Any,) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.yield_data :

Yields data in batches from this source.

If data_keys is specified, only yield from that subset of the data. Otherwise, iterate through the whole datasource.

Arguments

data_keys: An optional list of data keys to use for yielding data. Otherwise, all data in the datasource will be considered. data_keys is always provided when this method is called from the Dataset as part of a task.
use_cache: Whether the cache should be used to retrieve data for these data points. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If datacache is set on the instance, data will be _set in the cache, regardless of this argument.
partition_size: The number of data elements to load/yield in each iteration. If not provided, defaults to the partition size configured in the datasource.
****kwargs**: Additional keyword arguments.

Classes​

HeidelbergCSVColumns​

Ancestors​

Variables​

HeidelbergSource​

Ancestors​

Variables​

Static methods​

get_num_workers​

Methods​

add_hook​

apply_ignore_cols​

apply_ignore_cols_iter​

apply_modifiers​

clear_file_names_cache​

file_names_iter​

get_all_cached_file_paths​

get_data​

get_project_db_sqlite_columns​

get_project_db_sqlite_create_table_query​

partition​

remove_hook​

use_file_multiprocessing​

yield_data​

Classes

HeidelbergCSVColumns

Ancestors

Variables

HeidelbergSource

Ancestors

Variables

Static methods

get_num_workers

Methods

add_hook

apply_ignore_cols

apply_ignore_cols_iter

apply_modifiers

clear_file_names_cache

file_names_iter

get_all_cached_file_paths

get_data

get_project_db_sqlite_columns

get_project_db_sqlite_create_table_query

partition

remove_hook

use_file_multiprocessing

yield_data