datastructure
Classes concerning data structures.
DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.
Classes
BaseDataStructure
class BaseDataStructure():
Base DataStructure class.
Subclasses
DataStructure
class DataStructure( table: Optional[Union[str, Mapping[str, str]]] = None, query: Optional[Union[str, Mapping[str, str]]] = None, schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None, target: Optional[Union[str, list[str]]] = None, ignore_cols: list[str] = [], selected_cols: list[str] = [], selected_cols_prefix: Optional[str] = None, data_splitter: Optional[DatasetSplitter] = None, image_cols: Optional[list[str]] = None, image_prefix: Optional[str] = None, batch_transforms: Optional[list[dict[str, _JSONDict]]] = None, dataset_transforms: Optional[list[dict[str, _JSONDict]]] = None, auto_convert_grayscale_images: bool = True, image_prefix_batch_transforms: Optional[list[dict[str, _JSONDict]]] = None,):
Information about the columns of a BaseSource.
This component provides the desired structure of data to be used by discriminative machine learning models.
If the datastructure includes image columns, batch transformations will be applied to them.
Arguments
table
: The table in the Pod schema to be used for single pod tasks. If executing a remote task involving multiple pods, this should be a mapping of Pod names to table names. Defaults to None.query
: The sql query that needs to be applied to the data. It should be a string if it is used for single pod tasks or a mapping of Pod names to the queries if multiple pods are involved in the task. Defaults to None.schema_types_override
: A mapping that defines the new data types that will be returned after the sql query is executed. For single-pod task it will be a mapping of column names to their types, for multi-pod task it will be a mapping of the Pod name to the new columns and types. If a column is defined as "categorical", the mapping should include a mapping to the categories. Required if a sql query is provided. E.g.{'Pod_id': {'categorical': [{'col1': {'value_1':0, 'value_2': 1}}], "continuous": ['col2']}
for multi-pod or{'categorical':[{ "col1" : {'value_1':0, 'value_2': 1}}],'continuous': ['col2']}
for single-pod. Defaults to None.target
: The training target column or list of columns.ignore_cols
: A list of columns to ignore when getting the data. Defaults to None.selected_cols
: A list of columns to select when getting the data. The order of this list determines the order in which the columns are fed to the model. Defaults to None.selected_cols_prefix
: A prefix to use for selected columns. Defaults to None.image_prefix
: A prefix to use for image columns. Defaults to None.image_prefix_batch_transforms
: A mapping of image prefixes to batch transform to apply.data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.image_cols
: A list of columns that will be treated as images in the data.batch_transforms
: A dictionary of transformations to apply to batches. Defaults to None.dataset_transforms
: A dictionary of transformations to apply to the whole dataset. Defaults to None.auto_convert_grayscale_images
: Whether or not to automatically convert grayscale images to RGB. Defaults to True.
Raises
DataStructureError
: If 'sql_query' is provided as well as eitherselected_cols
orignore_cols
.DataStructureError
: If bothignore_cols
andselected_cols
are provided.ValueError
: If a batch transformation name is not recognised.
Ancestors
- BaseDataStructure
- bitfount.types._BaseSerializableObjectMixIn
Variables
- static
auto_convert_grayscale_images : bool
- static
batch_transforms : Optional[list]
- static
data_splitter : Optional[DatasetSplitter]
- static
dataset_transforms : Optional[list]
- static
fields_dict : ClassVar[dict[str, marshmallow.fields.Field]]
- static
ignore_cols : list
- static
image_cols : Optional[list]
- static
image_prefix : Optional[str]
- static
image_prefix_batch_transforms : Optional[list]
- static
nested_fields : ClassVar[dict[str, collections.abc.Mapping[str, Any]]]
- static
query : Union[str, collections.abc.Mapping[str, str], ForwardRef(None)]
- static
schema_types_override : Union[collections.abc.Mapping[Literal['categorical', 'continuous', 'image', 'text'], list[Union[str, collections.abc.Mapping[str, collections.abc.Mapping[str, int]]]]], collections.abc.Mapping[str, collections.abc.Mapping[Literal['categorical', 'continuous', 'image', 'text'], list[Union[str, collections.abc.Mapping[str, collections.abc.Mapping[str, int]]]]]], ForwardRef(None)]
- static
selected_cols : list
- static
selected_cols_prefix : Optional[str]
- static
table : Union[str, collections.abc.Mapping[str, str], ForwardRef(None)]
- static
target : Union[list[str], str, ForwardRef(None)]
Static methods
create_datastructure
def create_datastructure( table_config: DataStructureTableConfig, select: DataStructureSelectConfig, transform: DataStructureTransformConfig, assign: DataStructureAssignConfig, data_split: Optional[DataSplitConfig] = None, *, schema: BitfountSchema,) ‑> DataStructure:
Creates a datastructure based on the yaml config and pod schema.
Arguments
table_config
: The table in the Pod schema to be used for local data. If executing a remote task, this should a mapping of Pod names to table names.select
: The configuration for columns to be included/excluded from theDataStructure
.transform
: The configuration for dataset and batch transformations to be applied to the data.assign
: The configuration for special columns in theDataStructure
.data_split
: The configuration for splitting the data into training, test, validation.schema
: The Bitfount schema of the target pod
Returns
A DataStructure
object.
Methods
apply_dataset_transformations
def apply_dataset_transformations(self, datasource: BaseSource) ‑> BaseSource:
Applies transformations to whole dataset.
Arguments
datasource
: TheBaseSource
object to be transformed.
Returns datasource: The transformed datasource.
get_columns_ignored_for_training
def get_columns_ignored_for_training(self, table_schema: TableSchema) ‑> list:
Adds all the extra columns that will not be used in model training.
Arguments
table_schema
: The schema of the table.
Returns ignore_cols_aux: A list of columns that will be ignored when training a model.
get_pod_identifiers
def get_pod_identifiers(self) ‑> Optional[list]:
Returns a list of pod identifiers specified in the table
attribute.
These may actually be logical pods, or datasources.
If there are no pod identifiers specified, returns None.
get_table_name
def get_table_name(self, data_identifier: Optional[str] = None) ‑> str:
Returns the relevant table name of the DataStructure
.
Arguments
data_identifier
: The identifier of the pod/logical pod/datasource to retrieve the table of.
Returns
The table name of the DataStructure
corresponding to the pod_identifier
provided or just the local table name if running locally.
Raises
ValueError
: If thedata_identifier
is not provided and there are different table names for different pods.KeyError
: If thedata_identifier
is not in the collection of tables specified for different pods.
get_table_schema
def get_table_schema(self, schema: BitfountSchema) ‑> TableSchema:
Returns the table schema based on the datastructure arguments.
This will return either the new schema defined by the schema_types_override if the datastructure has been initialised with a query, or the relevant table schema if the datastructure has been initialised with a table name.
Arguments
schema
: The BitfountSchema either taken from the pod or provided by the user when defining a model.data_identifier
: The pod/logical pod/datasource identifier on which the model will be trained on. Defaults to None.datasource
: The datasource on which the model will be trained on. Defaults to None.
Raises
BitfountSchemaError
: If the table is not found.
set_columns_after_transformations
def set_columns_after_transformations( self, transforms: list[dict[str, _JSONDict]],) ‑> None:
Updates the selected/ignored columns based on the transformations applied.
It updates self.selected_cols
by adding on the new names of columns after
transformations are applied, and removing the original columns unless
explicitly specified to keep.
Arguments
transforms
: A list of transformations to be applied to the data.
set_training_column_split_by_semantic_type
def set_training_column_split_by_semantic_type(self, schema: TableSchema) ‑> None:
Sets the column split by type from the schema.
This method splits the selected columns from the dataset based on their semantic type.
Arguments
schema
: TheTableSchema
for the data.
set_training_input_size
def set_training_input_size(self, schema: TableSchema) ‑> None:
Get the input size for model training.
Arguments
schema
: The schema of the table.table_name
: The name of the table.