Designing a Secure + Efficient Access Control Plan for BigQuery
For large and rapidly growing data-driven businesses, it’s important to design a manageable access control structure that fits efficiently and transparently with your dynamic data architecture.
In this blog, we’ll discuss the concept of 2-layer Role-Based Access Control, and walk through how to set this up for a BigQuery project with a fresh data team. In this example, it’s for a team that will start on the implementation of a dbt project, but we’ll look at several best practices that apply to a range of situations, to increase security, transparency, and efficiency.
What is 2-layer Role-Based Access Control (2-layer RBAC)?
RBAC is an Access Control system based on the role. It can have a single layer where the roles are directly attached to the object access. For example, the admin role has a specific owner access to a table. This works well for simple cases and small implementations but it’s not very scalable.
This is why businesses use a 2-layer Role Based Access Control (2-layer RBAC). It refers to a method of designing access control structures in which roles are assigned at two different levels.
The first level assigns functional roles to groups of users or service accounts, with a specific business function. The second level is object access, which assigns roles to specific resources within the system. Having the business model and business logic (functional roles) separated from the operational reality of your data (object access roles) allows you to change one without changing the other. For example, it’s possible to add a new source of raw data or a new type of data analyst, without breaking the system or having data breaches.
2-layer RBAC is also the preferred way for Google Cloud tools and most specifically for BigQuery. BigQuery has predefined IAM roles based on more granular permissions, such as creating a job or updating a table. Your predefined IAM role is your object access role and then you use User groups to combine different roles to have a more functional role. Optionally you can attach predefined IAM roles straight to service accounts, ideally through something like IaC. We’ll look at this in more detail below.
Who has access?
To get a better understanding of this design, we need to highlight the different stakeholders who will be involved in accessing BigQuery:
- The data loader(s), which will ingest data into BigQuery. They should belong to the LOADER_ROLE group. A loader that requires different permissions and is used for two different use cases may be better broken out into two different loaders for a more tool-centric role such as FIVETRAN_ROLE and SEGMENT_ROLE if you use Fivetran and Segment separately as ingestion tools.
- The Data Engineer(s), who will manage and keep tools running while ensuring data quality and security for BigQuery. They should have the BigQuery Admin role on each of the projects created. Teams will rely on the Data Engineer’s proficiency with the BigQuery Admin role to ensure data is handled securely and efficiently.
- The Analytics Engineer(s), who will maintain and optimize the data modeling and transformation processes with dbt. They should belong to the ANALYTICS_ENGINEER_ROLE group.
- The orchestrator(s). For example, the dbt Cloud platform or Airflow, which will run the dbt project periodically in BigQuery to update the models and the data. As there should be a single account using this functional role, there’s no need to create a specific user group. The Orchestrator Service Account can be created in Google Cloud and the granting will be done directly on the service account.
- The Data Analyst(s) or business stakeholders, who will query the data in BigQuery, either through a BI tool, or directly in BigQuery. They will belong to the ANALYST_ROLE group or ANALYST_<dataset>_ROLE for dataset-specific access.
- The BI tool(s): Looker, Sigma, etc. which will access the production data. It should belong to the ANALYST_ROLE group as well. If the BI tool requires more than a Viewer access on the production project, then using a service account with the specific permissions needed is a better option.
It’s important to note here that the data team can have a different composition and some stakeholders can have common responsibilities. This is why some permissions might be similar for different stakeholders and we’ll be able to group some permissions and grant them to different users.
Users and service accounts
Users are the Google Account users assigned membership to groups: these are human users identifiable by their Google Workspace email. These users can be assigned to multiple groups but note that these groups’ access is overlapping: if a user is a member of two groups, that user always has access to both groups’ accesses.
Service accounts are application users that are assigned membership to groups and have JSON credential objects. These service accounts can, like users, be assigned to multiple groups but note that these groups’ access is overlapping exactly like users.
Service accounts are highly recommended when using another service to connect with Google Cloud tools. Using an account with admin access on different projects can lead to data breaches.
RBAC Diagram for BigQuery
Resources
The following is a review and explanation of all the different resources included in this diagram.
Projects
Projects are a Google Cloud concept that, in the context of BigQuery, are similar to databases on a platform like Snowflake. We recommend at least three projects:
- raw: this project stores raw, untransformed data and is loaded by tools like Fivetran. We recommend that the LOADER_ROLE group has write access on this project and that the ANALYTICS_ENGINEER_ROLE group and the service have read access on this project.
In the case of multiple loaders used such as Fivetran, Stitch, Google Cloud Cloud Functions it’s recommended to have separate projects for each loader. Why? Some loaders’ permissions might defer from one to another, and also keeping separate projects might be useful for cost analysis between different loaders. - dbt-dev: this project stores transformed data and is transformed by dbt developers, but is not used in a production context. We recommend that ANALYTICS_ENGINEER_ROLE has read and write access on this project. This project should be a development space that has no dependencies, which allows you to easily drop your datasets and create new ones as needed.
You may want to expose this to ANALYST_ROLE to help with validation, and in some cases that may make sense, but in general, it should not be treated like dbt-prod: it does not update on a given schedule, it may include errors, and it may be dropped or changed at any point. - dbt-prod: this project stores transformed data that has been tested and validated in the dev environment. The data is now transformed by a dbt orchestrator like dbt Cloud and is analyzed by a BI tool like Looker. We recommend that the Orchestrator Service Account has read and write access on this project and that ANALYST_ROLE and any of its variants (e.g. ANALYST_FINANCE_ROLE) have read access on this project.
Datasets
Datasets are a Google Cloud concept that, in the context of BigQuery, are similar to schemas on a platform like Snowflake.* We recommend using datasets in a project-specific way:
- The raw project should have one schema per data source and prefixed with the ingestion tools if necessary (e.g., fivetran_salesforce, fivetran_shopify). This ensures that there will be no collisions of similarly or identically named tables (e.g., salesforce.users, shopify.users).
- The dbt-dev project should have schemas that align as much as possible with dbt-prod to make development and validation as simple and close to production as possible.
This pattern is not always used in cases when there are many developers to prevent an explosion of schemas in this project. See the dbt documentation here. - The dbt-prod project should have a schema for staging and then a schema to help with access control: staging should not be exposed but is important for testing and sometimes performance, and then the other schemas (e.g. core, core_finance) should be exposed according to their contents (e.g. core to ANALYST_ROLE, core_finance to ANALYST_FINANCE_ROLE).
This access can be managed with groups or within the BI tool.
Groups and role management
Groups are a set of users associated with roles that allow you to manage users at scale. In BigQuery access control: users, groups, or service accounts are granted access to predefined or custom roles to permit them to perform actions on resources. You can grant access at the following BigQuery resource levels:
- Organization or Google Cloud project level
- Dataset level
- Table or View level
When you create a Group, you’ll need to assign BigQuery predefined and/or custom roles to this group and then assign users and/or service accounts to this group.
- LOADER_ROLE: this group should have BigQuery User role on the raw project and includes service accounts like Fivetran; this group should rarely have users associated with it.
In the case of multiple loaders, using groups might not be necessary and the roles can be assigned directly to the service accounts used for each loader. Each loader should have a single service account.
According to the principle of least privilege (POPL), it is recommended to give the most restricted access possible, which is BigQuery User predefined role in this case. Some loaders might require more permissions, and this is why assigning the right role directly to the service account is a recommended option. - ANALYTICS_ENGINEER_ROLE: this group has BigQuery Data Viewer role on the raw project, BigQuery User and BigQuery Data Editor roles on the dbt-dev project, BigQuery Data Viewer role on the dbt-prod project (for developers), and includes all developer users and service accounts like dbt Cloud for running CI.
Why BigQuery Data Editor and BigQuery user? The Data Editor has more permissions in a project, it can remove and delete other datasets. As for the BigQuery User role, when applied to a project, this role also provides the ability to run jobs, including queries, within the project. Additionally, allows the creation of new datasets within the project; the creator is granted the BigQuery Data Owner role on these new datasets.
Note: this is the first iteration of the roles given for the ANALYTICS_ENGINEER_ROLE group. These permissions can find some limitations when you need further access to a dataset. For example, creating a row-level access policy in a table requires the BigQuery Data Owner role. Then the ANALYTICS_ENGINEER_ROLE and the Orchestrator Service Account should be granted this role to be able to create the policy. - Orchestrator Service Account: this service account has BigQuery Data Viewer role on the raw project, BigQuery User role on the dbt-prod project and includes service accounts like dbt Cloud; this group should rarely have users associated with it.
- ANALYST_ROLE: this group has BigQuery Data Viewer role and BigQuery Job User role on the dbt-prod project.
- ANALYST_<dataset>_ROLE: this group has BigQuery Data Viewer role and BigQuery Job User role on the dbt-prod.core_<dataset> dataset.
Default permissions should be given on the project level first. In the case of restricted access required, a new group can be created with access on the dataset level.
Manage access and implement RBAC
The recommended approach to implementing RBAC is to configure it with Infrastructure as Code (IaC), more specifically with tools like Terraform. Alternatively, custom roles can be designed and managed with Deployment Manager, a Google Cloud tool to configure IaC. If IaC is not a good fit for your business, then RBAC systems can also be implemented manually with the Google Cloud console or with gcloud. Both ways can manage access from the organization to the table level.
It’s important to regularly review and update the access control plan to adapt to evolving data requirements and security needs. Creating a more granular access control (e.g., a group that needs access to a specific mart or table) requires additional management, and manual configuration can also be time-consuming (e.g., creating a service account, role, and giving permissions). Overall there’s not a single best option for managing access control but best practices depend on the size of the stakeholders and the maturity of the project.
Summary
- 6 Users: Data Loader, Data Engineer, Analytics Engineer, Orchestrator, Data Analyst, and BI tool
- 3 Projects: raw, dbt-dev, dbt-prod
- 4 User Groups: LOADER_ROLE, ANALYTICS_ENGINEER_ROLE, ANALYST_ROLE, ANALYST_<dataset>_ROLE
- 1 Service Account for the orchestration
Conclusion
In conclusion, designing an access control plan for BigQuery is crucial for ensuring the security and efficiency of your data infrastructure. By following a 2-layer RBAC approach and assigning roles to groups of users or service accounts and specific resources within the system, you can create a manageable and transparent structure that fits efficiently with the dynamic data architecture.
It’s important to regularly review and update the plan to adapt to evolving data requirements and security needs. By implementing RBAC in BigQuery, you can ensure that the right stakeholders have access to the right resources, while maintaining the principle of least privilege and minimizing the risk of data breaches or other security issues.
Datatonic, 5x Google Cloud Partner of the Year, is the leading cloud data + AI consultancy for the world’s most ambitious businesses, challenging traditions to deliver tangible innovation at the leading edge of Google Cloud. Learn about our experience with Google Cloud, and contact us to find out more.
Useful Resources