Skip to main content

schema

Classes concerning data schemas.

Classes

BitfountSchema

class BitfountSchema(    datasource: Optional[BaseSource] = None, table: Optional[TableSchema] = None,):

A schema that defines the tables of a BaseSource.

It includes the table found in BaseSource and its features.

Arguments

  • **kwargs: Optional keyword arguments to be provided to \_add_datasource_table.
  • datasource: An optional BaseSource object.

Ancestors

  • bitfount.data.schema._BitfountSchemaMarshmallowMixIn

Variables

  • hash : str - The hash of this schema.

    This relates to the BaseSource(s) that were used in the generation of this schema to assure that this schema is used against compatible data sources.

    Returns: A sha256 hash of the _datasource_hashes.

  • table : TableSchema - Getter for the table property.

    Raises: BitfountSchemaError: If the table is None.

Methods


apply

def apply(    self, dataframe: pd.DataFrame, keep_cols: Optional[list[str]] = None,)> pandas.core.frame.DataFrame:

Applies the schema to a dataframe and returns the transformed dataframe.

Sequentially adds missing columns to the dataframe, removes superfluous columns from the dataframe, changes the types of the columns in the dataframe and finally encodes the categorical columns in the dataframe before returning the transformed dataframe.

Arguments

  • dataframe: The dataframe to transform.
  • keep_cols: A list of columns to keep even if they are not part of the schema. Defaults to None.

Returns The dataframe with the transformations applied.

Raises

  • BitfountSchemaError: If the schema cannot be applied to the dataframe.

dump

def dump(self, file_path: PathLike)> None:

Dumps the schema as a yaml file.

Arguments

  • file_path: The path where the file should be saved

Returns none

dumps

def dumps(self)> Any:

Produces the YAML representation of the schema object.

Returns str: The YAML representation of the schema

freeze

def freeze(self)> None:

Freezes the schema, ensuring no more datasources can be added.

If this schema was loaded from an already generated schema, this will also check that the schema is compatible with the datasources set.

get_categorical_feature_size

def get_categorical_feature_size(    self, table_name: str, var: Union[str, list[str]],)> int:

Gets the column dimensions.

Arguments

  • table_name: The name of the table to get the column dimensions from.
  • var: A column name or a list of column names for which to get the dimensions.

Returns The number of unique value in the categorical column.

get_categorical_feature_sizes

def get_categorical_feature_sizes(    self, table_name: str, ignore_cols: Optional[Union[str, list[str]]] = None,)> list:

Returns a list of categorical feature sizes.

Arguments

  • table_name: The name of the table to get the categorical feature sizes.
  • ignore_cols: The column(s) to be ignored from the schema.

get_feature_names

def get_feature_names(    self, table_name: str, semantic_type: Optional[SemanticType] = None,)> list:

Returns the names of all the features in the schema.

Arguments

  • table_name: The name of the table to get the features from.
  • semantic_type: if semantic type is provided, only the feature names corresponding to the semantic type are returned. Defaults to None.

Returns features: A list of feature names.

to_json

def to_json(self)> dict:

Turns a schema object into a JSON compatible dictionary.

Returns dict: A simple JSON compatible representation of the Schema

unfreeze

def unfreeze(self)> None:

Unfreezes the schema, allowing more datasources to be added.

TableSchema

class TableSchema(name: str, description: Optional[str] = None):

A schema that defines the features of a dataframe.

It lists all the (categorical, continuous, image, and text) features found in the dataframe.

Arguments

  • description: A description of the table.
  • name: The name of the table.

Attributes

  • description: A description of the table. Optional.
  • features: An ordered dictionary of features (column names).
  • name: The name of the table.

Ancestors

  • bitfount.data.schema._TableSchemaMarshmallowMixIn

Methods


add_datasource_features

def add_datasource_features(    self,    datasource: BaseSource,    ignore_cols: Optional[Sequence[str]] = None,    force_stype: Optional[MutableMapping[Union[_ForceStypeValue, _SemanticTypeValue], list[str]]] = None,    descriptions: Optional[Mapping[str, str]] = None,)> None:

Adds datasource features to schema.

Arguments

  • datasource: The datasource whose features this method adds.
  • ignore_cols: Columns to ignore from the BaseSource. Defaults to None.
  • force_stype: Columns for which to change the semantic type.
  • Format: semantictype: [columnnames]. Defaults to None.
  • descriptions: Descriptions of the features. Defaults to None.

Raises

  • BitfountSchemaError: if the schema is already frozen

apply

def apply(    self,    dataframe: pd.DataFrame,    keep_cols: Optional[list[str]] = None,    image_cols: Optional[list[str]] = None,)> pandas.core.frame.DataFrame:

Applies the schema to a dataframe and returns the transformed dataframe.

Sequentially adds missing columns to the dataframe, removes superfluous columns from the dataframe, changes the types of the columns in the dataframe and finally encodes the categorical columns in the dataframe before returning the transformed dataframe.

Arguments

  • dataframe: The dataframe to transform.
  • keep_cols: A list of columns to keep even if they are not part of the schema. Defaults to None.
  • image_cols: The list of image columns in the dataframe. Defaults to None.

Returns The dataframe with the transformations applied.

decode_categorical

def decode_categorical(self, feature: str, value: int)> Any:

Decode label corresponding to a categorical feature in the schema.

Arguments

  • feature: The name of the feature.
  • value: The encoded value.

Returns The decoded feature value.

Raises

  • ValueError: If the feature cannot be found in the schema.
  • ValueError: If the label cannot be found in the feature encoder.

get_categorical_feature_size

def get_categorical_feature_size(self, var: Union[str, list[str]])> int:

Gets the column dimensions.

Arguments

  • var: A column name or a list of column names for which to get the dimensions.

Returns The number of unique value in the categorical column.

get_categorical_feature_sizes

def get_categorical_feature_sizes(    self, ignore_cols: Optional[Union[str, list[str]]] = None,)> list:

Returns a list of categorical feature sizes.

Arguments

  • ignore_cols: The column(s) to be ignored from the schema.

get_feature_names

def get_feature_names(self, semantic_type: Optional[SemanticType] = None)> list:

Returns the names of all the features in the schema.

Returns features: A list of feature names.

get_num_categorical

def get_num_categorical(self, ignore_cols: Optional[Union[str, list[str]]] = None)> int:

Get the number of (non-ignored) categorical features.

Arguments

  • ignore_cols: Columns to ignore when counting categorical features.

get_num_continuous

def get_num_continuous(self, ignore_cols: Optional[Union[str, list[str]]] = None)> int:

Get the number of (non-ignored) continuous features.

Arguments

  • ignore_cols: Columns to ignore when counting continuous features.