Talk to Airflow - Build an AI Agent Using PydanticAI and Gemini 2.0
Create an AI agent with PydanticAI to interact with Airflow DAGs
In the field of data warehousing, there’s a universal truth: managing data can be costly. Like a dragon guarding its treasure, each byte stored and each query executed demands its share of gold coins. But let me give you a magical spell to appease the dragon: burn data, not money!
In this article, we will unravel the arts of BigQuery sorcery, to reduce costs while increasing efficiency, and beyond. Join as we journey through the depths of cost optimization, where every byte is a precious coin.
BigQuery is not just a tool but a package of scalable compute and storage technologies, with fast network, everything managed by Google. At its core, BigQuery is a serverless Data Warehouse for analytical purposes and built-in features like Machine Learning (BigQuery ML). BigQuery separates storage and compute with Google’s Jupiter network in-between to utilize 1 Petabit/sec of total bisection bandwidth. The storage system is using Capacitor, a proprietary columnar storage format by Google for semi-structured data and the file system underneath is Colossus, the distributed file system by Google. The compute engine is based on Dremel and it uses Borg for cluster management, running thousands of Dremel jobs across cluster(s).
BigQuery is not just a tool but a package of scalable compute and storage technologies, with fast network, everything managed by Google
The following illustration shows the basic architecture of how BigQuery is structured:
Data can be stored in Colossus, however, it is also possible to create BigQuery tables on top of data stored in Google Cloud Storage. In that case, queries are still processed using the BigQuery compute infrastructure but read data from GCS instead. Such external tables
come with some disadvantages but in some cases it can be more cost efficient to have the data stored in GCS. Also, sometimes it is not about Big Data but simply reading data from existing CSV files that are somehow ingested to GCS. For simplicity it can also be benificial to use these kind of tables.
To utilize the full potential of BigQuery, the regular case is to store data in the BigQuery storage.
The main drivers for costs are storage and compute, Google is not charging you for other parts, like the network transfer in between storage and compute.
Storage costs you $0.02 per GB - $0.04 per GB for active and $0.01 per GB - $0.02 per GB for inactive data (which means not modified in the last 90 days). If you have a table or partition that is not modified for 90 consecutive days, it is considered long term storage, and the price of storage automatically drops by 50%. Discount is applied on a per-table, per-partition basis. Modification resets the 90-day counter.
BigQuery charges for data scanned and not the runtime of the query, also transfer from storage to compute cluster is not charged. Compute costs depend on the location, the costs for europe-west3
are $8.13 per TB for example.
This means:
We want to minimize the data to be scanned for each query!
When executing a query, BigQuery is estimating the data to be processed. After entering your query in the BigQuery Studio query editor, you can see the estimate on the top right.
If it says 1.27 GB like in the screenshot above and the query is processed in the location europe-west3
, the costs can be calculated like this:
The estimate is mostly a pessimistic calculation, often the optimizer is able to use cached results, materialized views or other techniques, so that the actual bytes billed are lower than the estimate. It is still a good practice to check this estimate in order to get a rough feeling of the impact of your work.
It is also possible to set a maximum for the bytes billed for your query. If your query exceeds the limit it will fail and create no costs at all. The setting can be changed by navigating to More -> Query settings -> Advanced options -> Maximum bytes billed.
Unfortunately up until now, it is not possible to set a default value per query. It is only possible to limit the bytes billed for each day per user per project or for all bytes billed combined per day for a project.
When you start using BigQuery for the first projects, you will most likely stick with the on-demand compute pricing model. With on-demand pricing, you will generally have access to up to 2000 concurrent slots, shared among all queries in a single project, which is more than enough in most cases. A slot is like a virtual CPU working on a unit of work of your query DAG.
When reaching a certain spending per month, it is worth looking into the capacity pricing model, which gives you more predictable costs.
To reduce the costs for storage but also compute, it is very important to always use the smallest datatype possible for your columns. You can easily estimate the costs for a certain amount of rows following this overview:
Type | Size |
---|---|
ARRAY |
Sum of the size of its elements |
BIGNUMERIC |
32 logical bytes |
BOOL |
1 logical byte |
BYTES |
2 logical bytes + logical bytes in the value |
DATE |
8 logical bytes |
DATETIME |
8 logical bytes |
FLOAT64 |
8 logical bytes |
GEOGRAPHY |
16 logical bytes + 24 logical bytes * vertices in the geo type |
INT64 |
8 logical bytes |
INTERVAL |
16 logical bytes |
JSON |
Logical bytes in UTF-8 encoding of the JSON string |
NUMERIC |
16 logical bytes |
STRING |
2 logical bytes + the UTF-8 encoded string size |
STRUCT |
0 logical bytes + the size of the contained fields |
TIME |
8 logical bytes |
TIMESTAMP |
8 logical bytes |
NULL
is calculated as 0 logical bytes
Example:
With this definition and the table of datatypes, it is possible to estimate the logical size of 100,000,000 rows:
Assuming we are running a SELECT *
on this table, it would cost us 5.78 GB / 1024 = 0.0056 TB * $8.13 = $0.05 in europe-west3
.
It is a good idea to make these calculations before designing your data model, not only to optimize the datatype usage but also to get an estimate of the costs for the project that you are working on.
In the realm of database design and management, data normalization and de-normalization are fundamental concepts aimed at optimizing data structures for efficient storage, retrieval, and manipulation. Traditionally, normalization has been hailed as a best practice, emphasizing the reduction of redundancy and the preservation of data integrity. However, in the context of BigQuery and other modern data warehouses, the dynamics shift, and de-normalization often emerges as the preferred approach.
In normalized databases, data is structured into multiple tables, each representing a distinct entity or concept, and linked through relationships such as one-to-one, one-to-many, or many-to-many. This approach adheres to the principles laid out by database normalization forms, such as the First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF), among others.
This comes with the advantages of reduction of redundancy, data integrity and consequently, less storage usage.
While data normalization holds merit in traditional relational databases, the paradigm shifts when dealing with modern analytics platforms like BigQuery. BigQuery is designed for handling massive volumes of data and performing complex analytical queries at scale. In this environment, the emphasis shifts from minimizing storage space to optimizing query performance.
In BigQuery, de-normalization emerges as a preferred strategy for several reasons:
Also, storage is much cheaper than compute and that means:
With pre-joined datasets, you exchange compute for storage resources!
While de-normalization is not a one-size-fits-all solution, it should be considered for cost and performance optimization. However, there are aspects that might lead to a different, cost-efficient design.
Especially when having small tables on the right side of the JOIN
, BigQuery utilizes Broadcast Joins to broadcast the full dataset of the table to each slot which processes the larger table. That way, normalization has no negative impact on performance. Actually, the opposite is the case and due to reduced data redundancy.
When BigQuery is not using the Broadcast Join, it uses the Hash Join approach. In this case, BigQuery uses hash and shuffle operations so that matching keys are processed in the same slot in order to perform a local join. However, compared to a Broadcast Join, this can be a an expensive operation as data needs to be moved.
If you find yourself in a situation where Hash Joins are being used, there are still ways to potentially improve performance. At least aim for defining the join columns as cluster columns. This colocates data in the same columnar file, reducing the impact of shuffling.
Ultimately, the best approach depends on the specifics of your data model and the size of the normalized tables. If redundancy can be reduced with a normalized structure while keeping the size of the JOIN
tables small, so that Broadcast Joins are used, this is the better solution than enforcing a de-normalized approach. For tables bigger than 10G however, this should be evaluated with concrete benchmarks, which leads to the golden rule:
Benchmarking is key! Don’t rely solely on theory.
Test different approaches (normalized, denormalized, nested/repeated) to find the most efficient solution for your specific use case.
Partitions divide a table into segments based on one specific column. The partition column can use one of 3 approaches:
🗂️ Integer range partitioning: Partition by integer column based on range with start, end and interval
⏰ Time-unit partitioning: Partition by date, timestamp or datetime column in table with hourly, daily, monthly or yearly granularity
⏱️ Ingestion time partitioning: Automatically assign partition when inserting data based on current time with a pseudocolumn named _PARTITIONTIME
It is up to you to define the partition column but it is highly recommend to choose this wisely as it can eliminate a lot of bytes processed / billed.
Example:
In the above example you can also see how to set the partition_expiration_days
option, which will remove partitions older than X days.
Clusters sort the data within each partition based on one ore more columns. When using cluster columns in your query filter, this technique will speed up the execution since BigQuery can determine which blocks to scan. This is especially recommended to use with high cardinality columns such as the title column in the following example.
You can define up to four cluster columns.
Example:
With data de-normalization often also duplication of information is introduced. This data redundancy adds additional storage and bytes to be processed in our queries. However, there is a way to have a de-normalized table design without redundancy using nested repeated columns.
A nested column uses the type struct
and combines certain attributes to one object. A nested repeated column is an array of struct
s stored for a single row in the table. For example: if you have a table storing one row per login of a user, together with the user ID and the registration country of that user, you would have redundancy in form of the ID and country per login for each user.
Instead of storing one row per login, with a nested repeated column you can store one single row per user and in a column of type ARRAY<STRUCT<...>>
you store an array of all logins of that user. The struct holds all attributes attached to the login, like the date and device. The following illustration visualizes this example:
Example:
The above example also shows the utilization of the require_partition_filter
which will prevent any queries without filtering on the partition column.
This data modelling technique can reduce the stored and processed bytes drastically. However, it is not the silver bullet for all de-normalization or data modeling cases. The major downside is: you can’t set cluster or partition columns on attributes of structs.
That means: in the example above, if a user would filter by login_device
a full table scan is necessary and we do not have the option to optimize this with clustering. This can be an issue especially if your table is used as a data source for third party software like Excel or PowerBI. In such cases, you should carefully evaluate if the benefit of removing redundancy with nested repeated columns compensates the lack of optimizations via clustering.
By defining a search index on one or multiple columns, BigQuery can use this to speed up queries using the SEARCH
function.
A search index can be created with the CREATE SEARCH INDEX
statement:
With ALL COLUMNS
the index is automatically created for all STRING
and JSON
columns. It is also possible to be more selective and add a list of column names instead. With the SEARCH
function, the index can be utilized to search within all or specific columns:
A new feature, which is in preview state by the time writing this article, is to also utilize the index for operators such as =
, IN
, LIKE
, and STARTS_WITH
. This can be very beneficial for data structures that are directly used by end users via third party tools like PowerBI or Excel to further increase speed and reduce costs for certain filter operations.
More information about this can be found in the official search index documentation.
BigQuery offers two billing models for storage: Standard and Physical Bytes Storage Billing. Choosing the right model depends on your data access patterns and compression capabilities.
The standard model is straightforward. You pay a set price per gigabyte of data, with a slight discount for data that hasn’t been modified in 90 days. This is simple to use and doesn’t require managing different storage categories. However, it can be more expensive if your data is highly compressed or if you don’t access it very often.
Physical Bytes Storage Billing takes a different approach. Instead of paying based on how much logical data you store, you pay based on the physical space it occupies on disk, regardless of how often you access it or how well it’s compressed. This model can be significantly cheaper for highly compressed data or data you don’t access frequently. However, it requires you to manage two separate storage classes: one for frequently accessed data and another for long-term storage, which can add complexity.
So, which model should you choose? Here’s a quick guide:
Choose the standard model if:
Choose PBSB if:
You can change the billing model in the advanced option for your datasets. You can also check the logical vs. physical bytes in the table details view, which makes it easier to decide for a model.
Since July 2023, BigQuery introduced unenforced Primary Key and Foreign Key constraints. Keep in mind that BigQuery is not a classical RDBMS, even though defining a data model with this feature might give you the feeling that it is.
If the keys are not enforced and this is not a relational database as we know it, what is the point? The answer is: the query optimizer may use this information to better optimize queries, namely with the concepts of Inner Join Elimination, Outer Join Elimination and Join Reordering.
Defining constraints is similar to other SQL dialects, just that you have to specify them as NOT ENFORCED
:
Copying data from one place to another is a typical part of our daily business as Data Engineers. Let’s assume the task is to copy data from a BigQuery dataset called bronze
to another dataset called silver
within a Google Cloud Platform project called project_x
. The naive approach would be a simple SQL query like:
Even though this allows for transformation, in many cases we simply want to copy data from one place to another. The bytes billed for the query above would essentially be the amount of data we have to read from the source. However, we can also get this for free with the following query:
Alternatively, the bq
CLI tool can be used to achieve the same result:
That way, you can copy data for 0 costs.
For data ingestion Google Cloud Storage is a pragmatic way to solve the task. No matter if it is a CSV file, ORC / Parquet files from a Hadoop ecosystem or any other source. Data can easily be uploaded and stored for low costs.
It is also possible to create BigQuery tables on top of data stored in GCS. These external tables still utilize the compute infrastructure from BigQuery but do not offer some of the features and performance.
Let’s assume we upload data from a partitioned Hive table using the ORC storage format. Uploading the data can be achieved using distcp
or simply by getting the data from HDFS first and then uploading it to GCS using one of the available CLI tools to interact with Cloud Storage.
Assuming we have a partition structure including one partition called month, the files might look like the following:
When we uploaded this data to GCS, an external table definition can be created like this:
It will derive the schema from the ORC files and even detect the partition column. The naive approach to move this data from GCS to BigQuery storage might now be, to create a table in BigQuery and then follow the pragmatic INSERT INTO ... SELECT FROM
approach.
However, similar to the previous example, the bytes billed would reflect the amount of data stored in gs://project_x/ingest/some_orc_table
. There is another way, which will achieve the same result but again for 0 costs using the LOAD DATA
SQL statement.
Using this statement, we directly get a BigQuery table with the data ingested, no need to create an external table first! Also this query comes at 0 costs. The OVERWRITE
is optional, since data can also be appended instead of overwriting the table on every run.
As you can see, also the partition columns can be specified. Even though no transformation can be applied, there is one major advantage: we can already define cluster columns. That way, we can create an efficient version of the target table for further downstream processing, for free!
In certain ETL or ELT scenarios, a typical workflow is to have a table partitioned by day and then replace specific partitions based on new data coming from a staging / ingestion table.
BigQuery offers the MERGE
statement but the naive approach is to first delete the affected partitions from the target table and then insert the data.
Deleting partitions in such a scenario can be achieved like this:
Even if day
is a partition column in both cases, this operation is connected to several costs. However, again there is an alternative solution that comes at 0 costs again:
With DROP TABLE
you can actually also just drop one single partition by appending the suffix $<partition_id>
.
Of course the above example is just dropping one partition. However, with the procedual language from BigQuery, we can easily execute the statement in a loop.
Or alternatively use Airflow and/or dbt to first select the partitions and then run a certain templated query in a loop.
However, getting the distinct partitions for a partitioned table can be done like the in the examples above, but this will still cause some costs even if we only read a single column. But yet again, there is a way to get this almost for free, which we will explore in the next chapter.
In the examples above, we used the following approach to get the distinct partitions of a partitioned BigQuery table:
This is how much the query cost me in an example use-case I worked on:
BigQuery maintains a lot of valuable metadata about tables, columns and partitions. This can be accessed via the INFORMATION_SCHEMA
. We can achieve the very same result, by simply using this metadata:
And comparing it with the same use-case as I mentioned above, this is how much the query cost:
As you can see, 149GB vs 10MB is a huge difference. With this method, you can get the distinct partitions even for huge tables at almost 0 costs.
When you start using BigQuery for the first projects, you will most likely stick with the on-demand compute pricing model. With on-demand pricing, you will generally have access to up to 2000 concurrent slots, shared among all queries in a single project. But even with capacity pricing, you will have a minimum of 100 slots.
With a lot of the daily ETL / ELT workload, these slots are actually not the limitation of the performance. You can simply check this yourself by navigating to BigQuery -> Administration -> Monitoring, select the correct location and change the Chart to Slot Usage under Chart Configuration. In a lot of cases you will be surprised how little slots you are actually using.
How does that relate to saving costs? Let’s assume you have a classic fact table or some table in general, which delivers certain KPIs. This table is then used for analysis / reporting in Looker, Excel, PowerBI or other tools.
Often these tools automatically generate queries to serve the report or dashboard with the necessary data. These generated queries might not be ideal, when it comes to applying BigQuery best practices. In other words, they might end up scanning more data than necessary which increases the bytes billed.
We can avoid this, by introducing a view layer on top of our fact tables. Serving tools with data from a view rather than the actual table is a very valuable best practice, as it gives you more flexibility when it comes to schema changes but it also gives the possibility to introduce calculated measures within the view without persisting the data.
Of course this might increase the CPU usage when these measures are used but on the other hand, it can drastically reduce the total size of the underlying table.
To illustrate this principle, take the following fact table as a basis:
The basic idea is to introduce a view for stakeholders accessing this data and extend it with calculated measures:
In this example we were able to avoid persisting two INT64
values. One of these uses 8 logical bytes. If our fact table has 1,000,000,000 rows this would mean we save:
This is not a huge amount of data, but it can mean that BigQuery has to scan 15 GB less data in certain situations. In practice, there can be calculated measures that might save you much more data to be scanned.
Forget hoarding every byte like a dragon guarding its treasure. Instead, learn to burn data through smart management and optimization. By embracing this fiery approach, you’ll transform BigQuery from a cost center to a powerful engine for data exploration, allowing you to burn data, not money!
CREATE TABLE ... COPY
or bq cp
commands to copy data between tables without incurring charges.LOAD DATA
statements to directly load data from Cloud Storage into BigQuery tables, again at no cost.DROP TABLE
with partition suffixes to efficiently remove specific partitions.INFORMATION_SCHEMA
to retrieve table metadata like distinct partition values, significantly reducing costs compared to traditional queries.There are more techniques to consider regarding data modeling for cost-optimized BigQuery usage I didn’t mention in this article.
For example: BigQuery offers smart tuning via materialized views, which means queries are automatically rewritten by the engine using materialized views when certain conditions are met. That way, common aggregations, JOINs, and other operations can be pre-computed in materialized views to significantly reduce costs. The best practice here is to apply this kind of optimization after the release of the project to stakeholders to first gather data on how they utilize your model. Then, try to identify common cost-intensive patterns in the queries and create materialized views for these parts. I might cover this aspect in a future article.
By adopting these strategies, you can unlock the true potential of BigQuery, transforming it into a cost-effective engine for data exploration and analysis. Remember, in the realm of BigQuery, it’s all about burning data, not money!
Feel free to share your experiences in the comments!