schema

Classes concerning data schemas.

Classes

BitfountSchema

class BitfountSchema(    name: str,    description: Optional[str] = None,    column_descriptions: Optional[Mapping[str, str]] = None,):

A schema that defines the tables of a BaseSource.

It includes the table found in BaseSource and its features.

Arguments

**kwargs: Optional keyword arguments to be provided to \_add_dataframe_features.
column_descriptions: A dictionary of column names and their descriptions.
description: The description of the datasource.
name: The name of the datasource associated with this schema.

Ancestors

bitfount.data.schema._BitfountSchemaMarshmallowMixIn

Variables

hash : str - The hash of this schema.

This relates to the BaseSource(s) that were used in the generation of this schema to assure that this schema is used against compatible data sources.

Returns: A sha256 hash of the _datasource_hashes.

Methods

add_dataframe_features

def add_dataframe_features(    self,    data: pd.DataFrame,    ignore_cols: Optional[Sequence[str]] = None,    force_stypes: Optional[MutableMapping[Union[_ForceStypeValue, _SemanticTypeValue], list[str]]] = None,    column_descriptions: Optional[Mapping[str, str]] = None,) ‑> None:

Add the features of a dataframe to the schema.

This method is not called directly, but used as a hook in for yield_data in the BaseSource class.

add_feature

def add_feature(self, feature_name: str, semantic_type: SemanticType, dtype: Any) ‑> None:

Add a single feature to the schema.

Note that this method does not support Categorical features.

Arguments

feature_name: The name of the feature.
semantic_type: The semantic type of the feature.
dtype: The dtype of the feature.

apply

def apply(    self,    dataframe: pd.DataFrame,    keep_cols: Optional[list[str]] = None,    image_cols: Optional[list[str]] = None,) ‑> pandas.core.frame.DataFrame:

Applies the schema to a dataframe and returns the transformed dataframe.

Sequentially adds missing columns to the dataframe, removes superfluous columns from the dataframe, changes the types of the columns in the dataframe and finally encodes the categorical columns in the dataframe before returning the transformed dataframe.

Arguments

dataframe: The dataframe to transform.
keep_cols: A list of columns to keep even if they are not part of the schema. Defaults to None.
image_cols: The list of image columns in the dataframe. Defaults to None.

Returns The dataframe with the transformations applied.

decode_categorical

def decode_categorical(self, feature: str, value: int) ‑> Any:

Decode label corresponding to a categorical feature in the schema.

Arguments

feature: The name of the feature.
value: The encoded value.

Returns The decoded feature value.

Raises

ValueError: If the feature cannot be found in the schema.
ValueError: If the label cannot be found in the feature encoder.

dump

def dump(self, file_path: PathLike) ‑> None:

Dumps the schema as a yaml file.

Arguments

file_path: The path where the file should be saved

Returns none

dumps

def dumps(self) ‑> str:

Produces the YAML representation of the schema object.

Returns The YAML representation of the schema as a string.

generate_full_schema

def generate_full_schema(    self,    datasource: BaseSource,    force_stypes: Optional[MutableMapping[Union[_ForceStypeValue, _SemanticTypeValue], list[str]]] = None,    ignore_cols: Optional[list[str]] = None,) ‑> None:

Generate a full schema from a datasource.

generate_partial_schema

def generate_partial_schema(self, datasource: BaseSource) ‑> None:

Adds one batch of data to the schema.

get_categorical_feature_size

def get_categorical_feature_size(self, var: Union[str, list[str]]) ‑> int:

Gets the column dimensions.

Arguments

var: A column name or a list of column names for which to get the dimensions.

Returns The number of unique value in the categorical column.

get_categorical_feature_sizes

def get_categorical_feature_sizes(    self, ignore_cols: Optional[Union[str, list[str]]] = None,) ‑> list[int]:

Returns a list of categorical feature sizes.

Arguments

ignore_cols: The column(s) to be ignored from the schema.

get_column_names

def get_column_names(    self, dataframe: pd.DataFrame, ignore_cols: list[str],) ‑> collections.abc.Iterable[str]:

Get the column names of the datasource.

get_feature_names

def get_feature_names(self, semantic_type: Optional[SemanticType] = None) ‑> list[str]:

Returns the names of all the features in the schema.

Arguments

table_name: The name of the table to get the features from.
semantic_type: if semantic type is provided, only the feature names corresponding to the semantic type are returned. Defaults to None.

Returns features: A list of feature names.

get_num_categorical

def get_num_categorical(self, ignore_cols: Optional[Union[str, list[str]]] = None) ‑> int:

Get the number of (non-ignored) categorical features.

Arguments

ignore_cols: Columns to ignore when counting categorical features.

get_num_continuous

def get_num_continuous(self, ignore_cols: Optional[Union[str, list[str]]] = None) ‑> int:

Get the number of (non-ignored) continuous features.

Arguments

ignore_cols: Columns to ignore when counting continuous features.

initialize_dataless_schema

def initialize_dataless_schema(self, required_fields: dict[str, Any]) ‑> None:

Initialize the schema with required fields but no data.

Arguments

required_fields: A dictionary with field names and their types.

to_json

def to_json(self) ‑> dict[str, typing.Any]:

Turns a schema object into a JSON compatible dictionary.

Returns A simple JSON compatible representation of the Schema

SchemaGenerationFromYieldData

class SchemaGenerationFromYieldData(    schema: BitfountSchema,    ignore_cols: Optional[list[str]] = None,    force_stypes: "Optional[MutableMapping[Literal['categorical', 'continuous', 'image', 'text', 'image_prefix'], list[str]]]" = None,):

Custom hook to execute logic during datasource yield data.

Initialize the hook.

Arguments

schema: The schema to update.
ignore_cols: Columns to ignore when updating the schema.
force_stypes: Forced semantic types for specific columns.

Ancestors

bitfount.hooks.DataSourceHook
bitfount.hooks.BaseHook

Methods

on_datasource_yield_data

def on_datasource_yield_data(self, data: pd.DataFrame, *args: Any, **kwargs: Any) ‑> None:

Hook method triggered when the datasource yields data.

Arguments

data: The dataframe yielded by the datasource.
args: Additional arguments.
kwargs: Additional keyword arguments.

Classes​

BitfountSchema​

Ancestors​

Variables​

Methods​

add_dataframe_features​

add_feature​

apply​

decode_categorical​

dump​

dumps​

generate_full_schema​

generate_partial_schema​

get_categorical_feature_size​

get_categorical_feature_sizes​

get_column_names​

get_feature_names​

get_num_categorical​

get_num_continuous​

initialize_dataless_schema​

to_json​

SchemaGenerationFromYieldData​

Ancestors​

Methods​

on_datasource_yield_data​

Classes

BitfountSchema

Ancestors

Variables

Methods

add_dataframe_features

add_feature

apply

decode_categorical

dump

dumps

generate_full_schema

generate_partial_schema

get_categorical_feature_size

get_categorical_feature_sizes

get_column_names

get_feature_names

get_num_categorical

get_num_continuous

initialize_dataless_schema

to_json

SchemaGenerationFromYieldData

Ancestors

Methods

on_datasource_yield_data