Data Source Configuration Best Practices
In Connecting Data & Running Pods, we outlined the mechanisms for configuring Pods. In this guide, we will provide examples by data source for each mechanism and details on best practices for each data source type. Note we leverage various parameter combinations in Pod configuration here which are not representative of all possible configurations. For more details on Pod configuration, please refer to the API Reference.
Jump to:
CSV
A local .csv file is the most common data source for Bitfount data connections.
Best Practices
For .csv files:
- Pods based on .csv files will run as long as they are not interrupted, meaning Pods configured to point to local files will be interrupted if the machine on which they are hosted is turned off or experiences transient disconnection. Bitfount will attempt to bring Pods back online if there is an interruption, but in most cases they will need to be re-started.
- Best practice is to point to a .csv file hosted on a server not tied to a user's local machine.
- Exclude any personally identifiable information fields from the Pod configuration specification.
- Bitfount will automatically generate the schema for a .csv file based on its header row. Please ensure to include a header row in your .csv file with the column names you wish to be reflected in the Pod's schema.
YAML Configuration Example
The configuration yaml file needs to follow the format specified in the PodConfig class:
name: <enter-pod-name-for-system-id>datasource: CSVSourcepod_details_config: display_name: <Enter Name You'll See in the Hub> description: > This is a description of the data connected to this Pod with any relevant details for potential collaborators.data_config: ignore_cols: ["Name", "DOB", "National Tax Number"] force_stypes: enter-your-pod-name: categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"] datasource_args: path: <PATH TO CSV>/<filename>.csv seed: 100 data_split: 30,10
Bitfount Python API Configuration Example
Using the python API is quite similar to specification via yaml. With the python API, we configure Pods using the PodDetailsConfig
and PodDataConfig
classes. The former is required to specify the display name and description of the Pod, whilst the latter is used to customise the schema and underlying data in the data source. For more information, refer to the config_schemas reference guide. Note, Pod names cannot include underscores. Here is a .csv python API configuration example:
pod = Pod( name="enter-pod-name-for-system-id", datasource=CSVSource(</PATH/OR_URL/TO/YOUR/CSV_FILE.csv>, seed = 100,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10)) pod_details_config=PodDetailsConfig( display_name="Hub Pod Display Name", description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.", ), # Specify the structure of the dataset data_config=PodDataConfig( # Specify stypes for fields (optional) force_stypes={ "enter-pod-name-for-system-id": { "categorical": ["target"] }, } ), ), )
DataFrame
Best Practices
- Ensure you know the structure of the DataFrame prior to Pod configuration.
YAML Configuration Example
Yaml configuration is not supported for DataFrame data sources.
Bitfount Python API Configuration Example
The main difference between connecting DataFrame sources and connecting other source types is the requirement to instantiate the data source with a pd.DataFrame object. An example of how to do this is as follows:
data_structure = {'col1': [1, 2], 'col2': [3, 4]}dataframe_object = pd.DataFrame(data=data_structure)pod = Pod( name="enter-pod-name-for-system-id", datasource=DataFrameSource( dataframe_object,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10) ), pod_details_config=PodDetailsConfig( display_name="Hub Pod Display Name", description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.", ), # Specify the structure of the dataset data_config=PodDataConfig( # Specify stypes for fields (optional) force_stypes={ "enter-pod-name-for-system-id": { "categorical": ["target"] }, }, ))
Image Sources
Images can be stored in any supported data source and configured as in the examples above. However, data sources must be configured to have an image reference column indicating the names of the image files.
Best Practices
- If connecting image data via a database or other non-local source, be sure to use the
force_stypes
parameter and classify image columns as"image"
. - If connecting image data via files on your local machine, create a reference file .csv with a column indicating all of the file names for the images you wish to connect to the Pod. This column can be something as simple as:
image_file_name
0001.png
...
You can label this column however you wish, though you must be sure to reference it when generating the PodDataConfig for your DataSource.
- If using a .csv source, we recommend you take note of the filepaths for both the .csv file and the images themselves.
- Place all images in the same folder or cloud services bucket.
FAQ
Don't see a data source you're hoping to use? See Using Custom Data Sources or reach out to us with your request!