Bitfount Schemas
Every data source in the Bitfount ecosystem has an associated schema, whether it is a remote Pod, or a local DataSource object. This BitfountSchema defines what the data types of the columns are. For categorical variables, the schema also specifies the list of possible values and the ordering that should be used for categorical embeddings. For image types, the schema also specifies the dimensions of the image.
A default BitfountSchema for a given DataSource can be generated by using the BitfountSchema constructor, for example:
datasource = CSVSource(path='file_path.csv')
schema = BitfountSchema(datasource=datasource)
When using a remote Pod, the BitfountSchema can be looked up on the hub using the
get_pod_schema
function
schema = get_pod_schema('user1/pod1')
When building a federated model across multiple Pods, it is important to use a common schema across the Pods for modelling. For this purpose we also provide a helper function to join together pod schemas:
schema1 = get_pod_schema('user1/pod1')
schema2 = get_pod_schema('user2/pod2')
schema = combine_pod_schemas([schema1, schema2])
This function currently only supports joining schemas where the names of all
columns to be joined overlap. The combine_pod_schemas
function ensures
that the superset of all categorical values is used for categorical variables.