intermine_source

Module containing InterMineSource class.

InterMineSource class handles loading data stored in InterMine Templates.

InterMine is an open source biological data warehouse developed by the University of Cambridge http://intermine.org/ . Please see InterMine's tutorials for a detailed overview of the python API: https://github.com/intermine/intermine-ws-python-docs.

Classes

InterMineSource

class InterMineSource(    service_url: str,    template_name: str,    token: Optional[str] = None,    is_reconnection: bool = False,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    iterable: bool = True,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    partition_size: int = 16,    required_fields: Optional[dict[str, Any]] = None,    name: Optional[str] = None,):

Data Source for loading data from InterMine Templates.

Arguments

****kwargs**: Additional keyword arguments passed to the BaseSource.
data_splitter: Deprecated argument, will be removed in a future release. Defaults to None. Not used.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
name: The name for the datasource. Optional, defaults to None.
partition_size: The size of each partition when iterating over the data in a batched fashion.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
service_url: Required. The URL of the InterMine service. An example service URL is "https://www.humanmine.org/humanmine/service". Omitting the "/service" suffix may also work depending on the server version.
template_name: Required. The name of the InterMine template to load.
token: Optional. The user token for accessing the InterMine service. Bitfount does not support username/password authentication (is supported by all webservices). A user token must be provided if authentication is required. This is supported on webservices version 6+.

Attributes

seed: Random number seed. Used for setting random seed for all libraries.

Raises

DataSourceError: If the connection to the InterMine service fails.
ValueError: If no value is provided for template_name. Or if the template is not found in the service or if duplicate template names are found in the service.

Ancestors

Variables

accessibility_details : Optional[AccessibilityDetails] - Check InterMine service accessibility.

Returns: None if accessible, otherwise a dict with error_code and message.

is_accessible : bool - Check if datasource is currently accessible.

Returns True if accessibility_details is None (no errors). This is a convenience property that wraps accessibility_details.

is_file_iterable : bool - Returns True if the datasource iterates over files.

Subclasses that iterate over files (e.g., FileSystemIterableSource) should override this to return True.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns True if a task is running.

supports_project_db : bool - Whether the datasource supports the project database.

Each datasource needs to implement its own methods to define how what its project database table should look like. If the datasource does not implement the methods to get the table creation query and columns, it does not support the projectdatabase.

Methods

add_hook

def add_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

BaseSource.add_hook :

Add a hook to the datasource.

apply_ignore_cols

def apply_ignore_cols(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

BaseSource.apply_ignore_cols :

Apply ignored columns to dataframe, dropping columns as needed.

Returns A copy of the dataframe with ignored columns removed, or the original dataframe if this datasource does not specify any ignore columns.

apply_ignore_cols_iter

def apply_ignore_cols_iter(    self, dfs: Iterator[pd.DataFrame],) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.apply_ignore_cols_iter :

Apply ignored columns to dataframes from iterator.

apply_modifiers

def apply_modifiers(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

BaseSource.apply_modifiers :

Apply column modifiers to the dataframe.

If no modifiers are specified, returns the dataframe unchanged.

get_data

def get_data(    self,    data_keys: SingleOrMulti[str] | SingleOrMulti[int],    *,    use_cache: bool = True,    **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:

Get data using the set InterMine Template.

Arguments

data_keys: String or integer based indices for which rows should be returned.
use_cache: Whether to use cached data if available.
****kwargs**: Additional keyword arguments.

Returns DataFrame containing the selected rows or None if no data is returned.

get_datasource_metrics

def get_datasource_metrics(    self, use_skip_codes: bool = False, data: Optional[pd.DataFrame] = None,) ‑> DatasourceSummaryStats:

Inherited from:

BaseSource.get_datasource_metrics :

Get metadata about this datasource.

This can be used to store information about the datasource that may be useful for debugging or tracking purposes. The metadata will be stored in the project database.

Arguments

use_skip_codes: Whether to use the skip reason codes as the keys in the skip_reasons dictionary, rather than the existing reason descriptions.
data: The data to use for getting the metrics.

Returns A dictionary containing metadata about this datasource.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self) ‑> list[str]:

Inherited from:

BaseSource.get_project_db_sqlite_columns :

Implement this method to get the required columns.

This is used by the "run on new data only" feature. This is used to add data to the task table in the project database.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self) ‑> str:

Inherited from:

BaseSource.get_project_db_sqlite_create_table_query :

Implement this method to return the required columns and types.

This is used by the "run on new data only" feature. This should be in the format that can be used after a "CREATE TABLE" statement and is used to create the task table in the project database.

get_schema

def get_schema(self) ‑> dict[str, typing.Any]:

Inherited from:

BaseSource.get_schema :

Get the pre-defined schema for this datasource.

This method should be overridden by datasources that have pre-defined schemas (i.e., those with has_predefined_schema = True).

Returns The schema as a dictionary.

Raises

NotImplementedError: If the datasource doesn't have a pre-defined schema.

merge_and_validate_filters

def merge_and_validate_filters(    self, datasource_level_filters: FilterConfig, task_level_filters: list[TaskFilter],) ‑> MergedFilterConfig:

Inherited from:

BaseSource.merge_and_validate_filters :

Merge and validate the filters from the datasource and the task.

Returns a MergedFilterConfig with resolved filter values from both datasource and task-level filters, using intersection logic (most restrictive wins).

partition

def partition(    self, iterable: Iterable[_I], partition_size: int = 1,) ‑> collections.abc.Iterable[collections.abc.Sequence[~_I]]:

Inherited from:

BaseSource.partition :

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

remove_hook

def remove_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

BaseSource.remove_hook :

Remove a hook from the datasource.

yield_data

def yield_data(    self,    data_keys: Optional[SingleOrMulti[str] | SingleOrMulti[int]] = None,    *,    use_cache: bool = True,    partition_size: Optional[int] = 10000,    **kwargs: Any,) ‑> Iterator[pandas.core.frame.DataFrame]:

Generator for providing data chunkwise from the InterMine Template query.

Arguments

data_keys: String or integer based indices for which rows should be returned.
use_cache: Whether to use cached data if available.
partition_size: Size of each partition to yield.
****kwargs**: Additional keyword arguments.

Classes​

InterMineSource​

Ancestors​

Variables​

Methods​

add_hook​

apply_ignore_cols​

apply_ignore_cols_iter​

apply_modifiers​

get_data​

get_datasource_metrics​

get_project_db_sqlite_columns​

get_project_db_sqlite_create_table_query​

get_schema​

merge_and_validate_filters​

partition​

remove_hook​

yield_data​

Classes

InterMineSource

Ancestors

Variables

Methods

add_hook

apply_ignore_cols

apply_ignore_cols_iter

apply_modifiers

get_data

get_datasource_metrics

get_project_db_sqlite_columns

get_project_db_sqlite_create_table_query

get_schema

merge_and_validate_filters

partition

remove_hook

yield_data