dicom_source

Module containing DICOMSource class.

DICOMSource class handles loading of DICOM data.

Classes

DICOMSource

class DICOMSource(    path: Union[os.PathLike[str], str],    images_only: bool = True,    data_cache: Optional[DataPersister] = None,    infer_class_labels_from_filepaths: bool = False,    output_path: Optional[Union[os.PathLike[str], str]] = None,    iterable: bool = True,    fast_load: bool = True,    cache_images: bool = False,    filter: Optional[FileSystemFilter] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    partition_size: int = 16,    required_fields: Optional[dict[str, Any]] = None,    name: Optional[str] = None,):

Data source for loading DICOM files.

Arguments

**kwargs: Keyword arguments passed to the parent base classes.
cache_images: Whether to cache images in the file system. Defaults to False. This is ignored if fast_load is True.
data_cache: A DataPersister instance to use for data caching.
data_splitter: Deprecated argument, will be removed in a future release. Defaults to None. Not used.
fast_load: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant if iterable is True, otherwise it is ignored. Defaults to True.
file_extension: The file extension of the DICOM files. Defaults to '.dcm'.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
images_only: If True, only dicom files containing image data will be loaded. If the file does not contain any image data, or it does but there was an error loading or saving the image(s), the whole file will be skipped. Defaults to True.
infer_class_labels_from_filepaths: Whether class labels should be added to the data based on the filepath of the files. Defaults to the first directory within self.path, but can go a level deeper if the datasplitter is provided with infer_data_split_labels set to true
iterable: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to True.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
name: The name for the datasource. Optional, defaults to None.
output_path: The path where to save intermediary output files. Defaults to 'preprocessed/'.
partition_size: The size of each partition when iterating over the data in a batched fashion.
path: The path to the directory containing the DICOM files.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

seed: Random number seed. Used for setting random seed for all libraries.

Raises

ValueError: If iterable is False or fast_load is False or cache_images is True.

Ancestors

Subclasses

DICOMOphthalmologySource

Variables

accessibility_details : Optional[AccessibilityDetails] - Detailed accessibility status. None if accessible.

Subclasses should override to perform lightweight connectivity check.

Returns: None if accessible, or dict with 'error_code' and 'message' if not.

file_names : list[str] - Returns a list of file names in the specified directory.

.. deprecated:: The file_names property is deprecated and will be removed in a future release. Use file_names_iter(as_strs=True) for memory-efficient iteration, or list(file_names_iter(as_strs=True)) if you need a list.

This property accounts for files skipped at runtime by filtering them out of the list of cached file names. Files may get skipped at runtime due to errors or because they don't contain any image data and images_only is True. This allows us to skip these files again more quickly if they are still present in the directory.

is_accessible : bool - Check if datasource is currently accessible.

Returns True if accessibility_details is None (no errors). This is a convenience property that wraps accessibility_details.

is_file_iterable : bool - Returns True since this source iterates over files.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns True if a task is running.

path : pathlib.Path - Resolved absolute path to data.

Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.

selected_file_names : list[str] - Returns a list of selected file names as strings.

Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

WARNING: This method loads all filenames into memory. For large datasets, consider using selected_file_names_iter() instead.

selected_file_names_differ : bool - Returns True if selected_file_names will differ from default.

In particular, returns True iff there is a selected file names override in place and/or there is filtering for new file names only present.

supports_project_db : bool - Whether the datasource supports the project database.

Each datasource needs to implement its own methods to define how what its project database table should look like. If the datasource does not implement the methods to get the table creation query and columns, it does not support the projectdatabase.

task_skipped_file_names : set[str] - Return set of task-skipped filenames for set operations.

Static methods

get_num_workers

def get_num_workers(file_names: Sequence[str]) ‑> int:

Inherited from:

FileSystemIterableSourceInferrable.get_num_workers :

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods

add_hook

def add_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.add_hook :

Add a hook to the datasource.

add_strategy_filter_flag_columns

def add_strategy_filter_flag_columns(    self, df: pd.DataFrame,) ‑> pandas.core.frame.DataFrame:

Inherited from:

FileSystemIterableSourceInferrable.add_strategy_filter_flag_columns :

Add task-scoped strategy filter flag columns to a report DataFrame.

Decodes the compact per-file bitmask into one boolean column per flag-only strategy. Only files present in df are decoded, keeping memory proportional to the report size rather than the total number of evaluated files.

Arguments

df: DataFrame containing at least the ORIGINAL_FILENAME_METADATA_COLUMN column.

Returns A new DataFrame with one boolean column per flag-only strategy. If no flag-only strategies are registered, returns df unchanged (same object, no copy). If df does not contain the ORIGINAL_FILENAME_METADATA_COLUMN (e.g. patient-level reports), flag injection is skipped gracefully and df is returned unchanged.

apply_ignore_cols

def apply_ignore_cols(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

FileSystemIterableSourceInferrable.apply_ignore_cols :

Apply ignored columns to dataframe, dropping columns as needed.

Returns A copy of the dataframe with ignored columns removed, or the original dataframe if this datasource does not specify any ignore columns.

apply_ignore_cols_iter

def apply_ignore_cols_iter(    self, dfs: Iterator[pd.DataFrame],) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.apply_ignore_cols_iter :

Apply ignored columns to dataframes from iterator.

apply_merged_filter_config

def apply_merged_filter_config(self, merged_filter_config: MergedFilterConfig) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.apply_merged_filter_config :

Apply the filter configuration to the datasource.

apply_modifiers

def apply_modifiers(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

FileSystemIterableSourceInferrable.apply_modifiers :

Apply column modifiers to the dataframe.

If no modifiers are specified, returns the dataframe unchanged.

clear_dataset_cache

def clear_dataset_cache(self) ‑> dict[str, typing.Any]:

Inherited from:

FileSystemIterableSourceInferrable.clear_dataset_cache :

Clear all dataset cache for this data source.

This clears both:

The file names cache (Python cached_property)
The dataset cache file (deletes the SQLite database file completely)

Returns Dictionary with cache clearing results.

clear_file_names_cache

def clear_file_names_cache(self) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.clear_file_names_cache :

Clears the list of selected file names.

This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.

clear_task_specific_configs

def clear_task_specific_configs(self) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.clear_task_specific_configs :

Clear task-scoped state at task boundaries.

Resets skipped files, merged filter config, and strategy filter flags so that subsequent tasks start with a clean slate.

create_empty_pixel_frames

def create_empty_pixel_frames(    self, data: Dict[str, Any], number_of_frames: int,) ‑> Dict[str, Any]:

Creates empty pixel frames with '<SKIPPED>'.

extract_file_metadata_for_telemetry

def extract_file_metadata_for_telemetry(    self, filename: str | os.PathLike[str], data: Optional[dict[str, Any]],) ‑> dict[str, typing.Any]:

Inherited from:

FileSystemIterableSourceInferrable.extract_file_metadata_for_telemetry :

Extracts file metadata for telemetry.

Only extracts metadata that is common to all file types. If data is None, returns a dictionary with generic file metadata.

Arguments

filename: The filename of the file to extract metadata from.
data: The data to extract metadata from.

Returns A dictionary of file metadata.

file_names_iter

def file_names_iter(    self, as_strs: bool = False,) ‑> Union[collections.abc.Iterator[pathlib.Path], collections.abc.Iterator[str]]:

Inherited from:

FileSystemIterableSourceInferrable.file_names_iter :

Iterate over files in a directory, yielding those that match the criteria.

Arguments

as_strs: By default the files yielded will be yielded as Path objects. If this is True, yield them as strings instead.

get_all_cached_file_paths

def get_all_cached_file_paths(self) ‑> list[str]:

Inherited from:

FileSystemIterableSourceInferrable.get_all_cached_file_paths :

Get all file paths that are currently stored in the cache.

Returns A list of file paths that have cache entries, or an empty list if there is no cache or the cache hasn't been initialized.

get_data

def get_data(    self,    data_keys: SingleOrMulti[str] | SingleOrMulti[int],    *,    use_cache: bool = True,    **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.get_data :

Get data corresponding to the provided data key(s).

Can be used to return data for a single data key or for multiple at once. If used for multiple, the order of the output dataframe must match the order of the keys provided.

Arguments

data_keys: Key(s) for which to get the data of. These may be things such as file names, UUIDs, etc. Can also be a list of integers if the datasource has an integer index.
use_cache: Whether the cache should be used to retrieve data for these keys. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If data_cache is set on the instance, data will be set in the cache, regardless of this argument.
**kwargs: Additional keyword arguments.

Returns A dataframe containing the data, ordered to match the order of keys in data_keys, or None if no data for those keys was available.

get_datasource_metrics

def get_datasource_metrics(    self, use_skip_codes: bool = False, data: Optional[pd.DataFrame] = None,) ‑> DatasourceSummaryStats:

Inherited from:

FileSystemIterableSourceInferrable.get_datasource_metrics :

Get metadata about this datasource.

This can be used to store information about the datasource that may be useful for debugging or tracking purposes. The metadata will be stored in the project database.

Arguments

use_skip_codes: Whether to use the skip reason codes as the keys in the skip_reasons dictionary, rather than the existing reason descriptions.
data: The data to use for getting the metrics.

Returns A dictionary containing metadata about this datasource.

get_filter_config

def get_filter_config(self) ‑> FilterConfig:

Inherited from:

FileSystemIterableSourceInferrable.get_filter_config :

Get the filter configuration for the datasource.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self) ‑> list[str]:

Inherited from:

FileSystemIterableSourceInferrable.get_project_db_sqlite_columns :

Returns the required columns to identify a data point.

The first value must be filename column, and second value must be the last modified datetime. These two are used to build the processed_file_cache for the worker execution.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self) ‑> str:

Inherited from:

FileSystemIterableSourceInferrable.get_project_db_sqlite_create_table_query :

Returns the required columns and types to identify a data point.

The file name is used as the primary key and the last modified date is used to determine if the file has been updated since the last time it was processed. If there is a conflict on the file name, the row is replaced with the new data to ensure that the last modified date is always up to date.

get_schema

def get_schema(self) ‑> dict[str, typing.Any]:

Inherited from:

FileSystemIterableSourceInferrable.get_schema :

Get the pre-defined schema for this datasource.

This method should be overridden by datasources that have pre-defined schemas (i.e., those with has_predefined_schema = True).

Returns The schema as a dictionary.

Raises

NotImplementedError: If the datasource doesn't have a pre-defined schema.

get_task_skip_reason_summary

def get_task_skip_reason_summary(self) ‑> dict[str, int]:

Inherited from:

FileSystemIterableSourceInferrable.get_task_skip_reason_summary :

Get aggregated skip reasons for the current task.

Combines both task-only skips (files that failed task filters) and datasource skips that occurred during the current task execution. This provides a complete picture of all files skipped during a task run.

Returns Dict mapping reason codes (as strings) to file counts.

get_uncached_file_names

def get_uncached_file_names(self) ‑> list[str]:

Inherited from:

FileSystemIterableSourceInferrable.get_uncached_file_names :

Return potentially uncached files via fast raw filesystem scanning.

This fast path skips datasource filters and computes file-path set difference against cache/skipped tables. It may include files that are later filtered out during normal datasource processing.

has_uncached_files

def has_uncached_files(self) ‑> bool:

Inherited from:

FileSystemIterableSourceInferrable.has_uncached_files :

Returns True if there are any files in the datasource not yet cached.

Uses a fast path that skips the full filter pipeline: walks the filesystem with scantree and checks each path against cache + skipped files metadata.

merge_and_validate_filters

def merge_and_validate_filters(    self, datasource_level_filters: FilterConfig, task_level_filters: list[TaskFilter],) ‑> MergedFilterConfig:

Inherited from:

FileSystemIterableSourceInferrable.merge_and_validate_filters :

Merge and validate the filters from the datasource and the task.

Returns a MergedFilterConfig with resolved filter values from both datasource and task-level filters, using intersection logic (most restrictive wins).

partition

def partition(    self, iterable: Iterable[_I], partition_size: int = 1,) ‑> collections.abc.Iterable[collections.abc.Sequence[~_I]]:

Inherited from:

FileSystemIterableSourceInferrable.partition :

Partition the iterable into chunks of the given size.

process_sequence_field

def process_sequence_field(    self, elem: _DICOMSequenceField, filename: str,) ‑> Optional[dict[str, typing.Any]]:

Process a sequence field.

This method is called when a sequence field is encountered. It can be overridden by plugins to process specific sequence data.

tip

Override this method in your plugin if you want to process specific sequence data.

Arguments

elem: The DICOM data element which has its 'VR' set to 'SQ'.
filename: The filename of the DICOM file.

Returns A dictionary containing the processed sequence data or None.

remove_hook

def remove_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.remove_hook :

Remove a hook from the datasource.

selected_file_names_iter

def selected_file_names_iter(self) ‑> collections.abc.Iterator[str]:

Inherited from:

FileSystemIterableSourceInferrable.selected_file_names_iter :

Returns an iterator over selected file names.

Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

Returns Iterator over selected file names.

set_strategy_filter_flags

def set_strategy_filter_flags(    self, flags: dict[str, int], meta: list[tuple[int, str, Optional[str]]],) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.set_strategy_filter_flags :

Store strategy filter flag-only match bitmasks for the current task.

Arguments

flags: Mapping of filename to bitmask of matched flag-only strategies.
meta: Ordered list of (original_index, strategy_name, flag_column_name). Bit position in the bitmask corresponds to list index.

skip_file

def skip_file(    self, filename: str, reason: FileSkipReason, data: Optional[dict[str, Any]] = None,) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.skip_file :

Skip a file by updating cache and skipped_files set.

The first reason is always the one recorded in the data cache.

Arguments

filename: Path to the file being skipped
reason: Reason for skipping the file
data: Optional data dictionary containing file metadata for telemetry

task_skip_file

def task_skip_file(    self, filename: str, reason: FileSkipReason, data: Optional[dict[str, Any]] = None,) ‑> None:

Inherited from:

FileSystemIterableSourceInferrable.task_skip_file :

Skip a file due to task-level filter (not persisted to cache).

Unlike skip_file(), this does NOT persist to cache because the file may pass a different task's filters. The skip is tracked in-memory only and cleared between tasks.

Arguments

filename: Path to the file being skipped.
reason: Reason for skipping the file.
data: Optional metadata for telemetry.

use_file_multiprocessing

def use_file_multiprocessing(self, file_names: Sequence[str]) ‑> bool:

Inherited from:

FileSystemIterableSourceInferrable.use_file_multiprocessing :

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

yield_data

def yield_data(    self,    data_keys: Optional[SingleOrMulti[str] | SingleOrMulti[int]] = None,    *,    use_cache: bool = True,    partition_size: Optional[int] = None,    **kwargs: Any,) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSourceInferrable.yield_data :

Yields data in batches from this source.

If data_keys is specified, only yield from that subset of the data. Otherwise, iterate through the whole datasource.

Arguments

data_keys: An optional list of data keys to use for yielding data. Otherwise, all data in the datasource will be considered. data_keys is always provided when this method is called from the Dataset as part of a task. Can also be a list of integers if the datasource has an integer index.
use_cache: Whether the cache should be used to retrieve data for these data points. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If data_cache is set on the instance, data will be set in the cache, regardless of this argument.
partition_size: The number of data elements to load/yield in each iteration. If not provided, defaults to the partition size configured in the datasource.
**kwargs: Additional keyword arguments.

Classes​

DICOMSource​

Ancestors​

Subclasses​

Variables​

Static methods​

get_num_workers​

Methods​

add_hook​

add_strategy_filter_flag_columns​

apply_ignore_cols​

apply_ignore_cols_iter​

apply_merged_filter_config​

apply_modifiers​

clear_dataset_cache​

clear_file_names_cache​

clear_task_specific_configs​

create_empty_pixel_frames​

extract_file_metadata_for_telemetry​

file_names_iter​

get_all_cached_file_paths​

get_data​

get_datasource_metrics​

get_filter_config​

get_project_db_sqlite_columns​

get_project_db_sqlite_create_table_query​

get_schema​

get_task_skip_reason_summary​

get_uncached_file_names​

has_uncached_files​

merge_and_validate_filters​

partition​

process_sequence_field​

remove_hook​

selected_file_names_iter​

set_strategy_filter_flags​

skip_file​

task_skip_file​

use_file_multiprocessing​

yield_data​

Classes

DICOMSource

Ancestors

Subclasses

Variables

Static methods

get_num_workers

Methods

add_hook

add_strategy_filter_flag_columns

apply_ignore_cols

apply_ignore_cols_iter

apply_merged_filter_config

apply_modifiers

clear_dataset_cache

clear_file_names_cache

clear_task_specific_configs

create_empty_pixel_frames

extract_file_metadata_for_telemetry

file_names_iter

get_all_cached_file_paths

get_data

get_datasource_metrics

get_filter_config

get_project_db_sqlite_columns

get_project_db_sqlite_create_table_query

get_schema

get_task_skip_reason_summary

get_uncached_file_names

has_uncached_files

merge_and_validate_filters

partition

process_sequence_field

remove_hook

selected_file_names_iter

set_strategy_filter_flags

skip_file

task_skip_file

use_file_multiprocessing

yield_data