Multi-Pod SQL Tasks
One of the most useful things about a federated data science platform is the ability to run tasks across multiple datasets without needing to centralise the data! Bitfount enables you to run a SQL query against multiple Pods simultaneously, so your or your partners' data need not be in the same database to collaborate.
There are three quick steps to running SQL tasks on multiple Pods:
Confirming Permissions
Before running multi-Pod tasks, you need to check two items:
- You have a sufficient level of access to all of the Pods you wish to perform tasks against given the task you wish to perform.
- If you intend to use
SecureAggregation
as an aggregator for your task, the Pod owners of the Pods you wish to perform tasks against need to allow their Pod(s) to be used in combination with the other Pod(s).
To check whether you have a sufficient level of access for a Pod owned by someone else, go to Accessible Pods in Bitfount Hub, and find the Pod(s) you wish to use. You can see what permission level you have by toggling the "Access Granted" menu.
You must have sufficient access across all of the Pods with which you wish to interact to perform a given task. This means your role does not need to be the same across Pods, however, you need to have been assigned a sufficient level of permissions for the task you wish to perform across all of them.
Next, if you wish to use SecureAggregation
, the Data Custodians for the desired Pods need to have approved the ability to run SQL tasks against the full list of the Pods you wish to use. This includes the case where you are the Pod owner for multiple Pods and wish to query against them. This approval step is performed via the approved_pods
parameter using the Bitfount python API or yaml configuration step when creating the Pods. Work with the Data Custodian(s) of the Pods you wish to use to ensure they have included all of the Pods required in the approved_pods
list.
Verifying Data Structures
When you run a SQL query against multiple Pods, the SQL query runs independently against each Pod prior to averaging the results. As a result, all Pods specified in the task configuration must contain any columns or formatting you specify in your query. This will often require column names across files or databases to be specified with the same name, and contained in one table within each Pod. Before attempting to perform a multi-Pod SQL query, check that each of the Pods will be compatible with your query.
Running Multi-Pod Tasks
There are two possible expected outputs for running SQL queries across multiple Pods:
- No aggregator: If you do not provide an
aggregator
, your SQL task will run independently against all specified Pods, then return the results for each Pod. - With aggregator: If you provide an
aggregator
to theexecute
method for theSqlQuery
orPrivateSqlQuery
algorithms, you will receive an average of all of the results across Pods.
Note that queries requiring joins across Pods are not yet supported.
Querying across multiple Pods is similar to the normal usage of the SqlQuery
algorithm. When querying across multiple Pods, you must additionally specify the Pod identifiers for the additional Pods. For example:
pod_identifier_1 = "census-income-demo"
pod_identifier_2 = "census-income-demo-yaml"
query = SqlQuery(
query="""
SELECT `occupation`, AVG(`age`)
FROM df
GROUP BY `occupation`
"""
)
query.execute(pod_identifiers=[pod_identifier_1,pod_identifier_2])
Note that when you are using the PrivateSqlQuery
algorithm, the protections in place for federated SQL analysis on a single Pod apply in the multi-Pod context as well. When using this algorithm, Bitfount will not enable queries which allow for extraction of raw data, triangulation of a single record, or other potentially nefarious GROUP BY
clauses.
Next Steps
You're now ready to run multi-Pod SQL tasks! Still have questions? Check out our Troubleshooting & FAQs guide or ask your question via support@bitfount.com.