Skip to main content

processor

Classes for dealing with Transformation Processing.

Classes

TransformationProcessor

class TransformationProcessor(    transformations: list[Transformation],    schema: Optional[BitfountSchema] = None,    col_refs: Optional[set[str]] = None,):

Processes Transformations on a given dataframe.

caution

The Transformation processor does not add any of the newly created columns to the Schema. This must be done separately after processing the transformations.

Arguments

  • transformations: The list of transformations to apply.
  • schema: The schema of the data to be transformed.
  • col_refs: The set of columns referenced in those transformations.

Attributes

  • transformations: The list of transformations to apply.
  • schema: The schema of the data to be transformed.
  • col_refs: The set of columns referenced in those transformations.

Methods


batch_transform

def batch_transform(self, data: np.ndarray, step: DataSplit)> numpy.ndarray:

Performs batch transformations.

Arguments

  • data: The data to be transformed at batch time as a numpy array.
  • step: The step at which the data should be transformed.

Returns np.ndarray: The transformed data as a numpy array.

Raises

  • InvalidBatchTransformationError: If one of the specified transformations does not inherit from BatchTimeOperation.

transform

def transform(self, data: pd.DataFrame)> pandas.core.frame.DataFrame:

Performs self.transformations on data sequentially.

Arguments to an operation are extracted by first checking if they are referencing another transformed column by checking for the name attribute. If not, we then check if they are referencing a non-transformed column by using a regular expression. Finally, if the regex comes back empty we take the argument 'as is' e.g. a string, integer, etc. After the transformations are complete, finally removes any columns that shouldn't be part of the final output.

Arguments

  • data: The pandas dataframe to be transformed.

Raises

  • MissingColumnReferenceError: If there is a reference to a non-existing column.
  • TypeError: if there are clashes between column names or if unable to apply transformation.