This blog assumes you know the data lake table formats; otherwise, it might not make much sense.
Branching and tagging are an important feature of Apache Iceberg, released in February 2023 as part of version 1.2.0. They unlock a new world of use cases and are an exciting evolution to the Iceberg snapshot. Iceberg co-creator Ryan Blue posted a blog titled Tags and Branches to introduce the features in March 2023.
In this post, I will provide an overview of Iceberg snapshots, branches, and tags. Lastly, I will emulate these behaviors in Apache Hudi and Delta Lake.
In a previous life as IT Director at a payroll and tax company, we often had situations where we needed to save a state of the data (like year-end) or backtest code for new regulations or features against existing data. We’d save that data to DAT tape for offsite storage. For scenario testing, I’d written some scripts that would extract data for an arbitrary list of clients and load it in the dev copy of the database. I used this same tool and method at several jobs for years. It wasn’t perfect, but it worked well, given the technology available at the time.
How would I have done it differently on a modern data lake, specifically using Apache Iceberg? Let’s explore that. But first, a little conceptual background. This blog assumes a basic understanding of Apache Iceberg.
Understanding Apache Iceberg snapshots
Apache Iceberg will create a snapshot of a table at a specific point in time when a table operation like AppendFile or RewriteFiles is used. This will generate a union of all associated manifests and their data files. Snapshots enable simple time-travel queries that allow you to read the state of a table at a particular point in time and see how it might change over time.
Snapshots can be explicitly removed with the ExpireSnapshots operation. There are several methods provided that allow you to specify various parameters, such as expireOlderThan and expireSnapshotId.
That should give you enough background for what follows, as tags and branches are more advanced snapshot lifecycle management methods.
Using snapshot tags for time travel in Iceberg
At its most basic, tags can be considered named snapshots that you choose to keep longer than your normal retention period and that you configure to expire automatically. Tags are an immutable label for a snapshot and really open up the possible use cases with snapshots. They are very similar to how commits in Git work. The RETAIN parameter allows you to automatically expire that snapshot instead of doing it manually.
Using the example of my payroll/tax company, we wanted to preserve the state of our data at a year-end record and make it easily accessible for five years. In Iceberg, we could have done it like this:
ALTER TABLE transactions CREATE TAG 2022_eoy RETAIN 1825 DAYS;
I could then query that particular tag for whatever reason like this:
SELECT * FROM transactions VERSION AS OF '2022_eoy';
It’s that simple! Some other ideas would be to tag a set of data you want to use to exercise a test suite whenever code gets changed or maybe to backtest a machine learning application. Having a named snapshot makes it very convenient to find and use data from a particular point in time.
How branches enhance tags in Iceberg
Branches are a named reference to a particular snapshot; it can be the current snapshot or a specific version specified during the creation. The difference between a tag and a branch can be confusing initially, but a tag is a named snapshot with its own expiration. A branch is a named copy of a snapshot that can also have its own expiration. That branch can be queried and written to, which would then create snapshots on the branch, which can also be tagged and branched. Branches allow you to work on different versions of a table in isolation.
These features enable a number of obvious use cases:
- A specific point in time retention for compliance purposes
- Isolate data for testing without impacting production workloads
- Robust data auditing (WAP) by allowing you to work with staged changes
You can make branches of branches and even tag those branches if you like. Branches, like tags, also allow you to specify a retention value so they are automatically deleted after the retention period. Branches can be written multiple times, allowing you to exercise your tests multiple times before dropping or merging the data back. Having named references to data makes auditing via time-travel syntax significantly more apparent.
How I could have used them
Year-end processing, where we printed the tax forms for client customers, would take days to print, which was the only work that could happen while it was running because of the system design, so all the other processing would queue up. If we could have branched and tagged that year-end data, we could have done our new processing concurrently and also have that year-end snapshot that we could maintain and do any reports or reruns on as needed.
For testing code changes, let’s say a new tax rate is coming out, and we need to do scenario runs with the new rates to show to the customers. I know this feature has to be completed in 30 days, so I will expire it after 45, just in case. In this case, I want to create the branch as of the current snapshot, so I don’t need to specify the one I want to branch. I could create the branch like this:
ALTER TABLE transactions CREATE BRANCH new_tax_test RETAIN 45 DAYS;
I can then perform actions against that branch by referencing with the same time-travel syntax:
SELECT * FROM transactions VERSION AS OF new_tax_test;
I can then insert data into that branch, which would create a new snapshot:
INSERT INTO transactions AT BRANCH new_tax_test VALUES (1, 'a'), (2, 'b');
I could replace the branch with a different snapshot version if I needed to grab a different version of the data without changing my code references to exercise it. Retention time can also be changed during the update process. When I’m all done, I can remove the branch, either by waiting for the retention to expire or explicitly, like this:
ALTER TABLE transactions DROP BRANCH new_tax_test;
Does Apache Hudi or Delta Lake support tags and branches?
Great question; let’s look at how we would do something similar in those systems.
Instead of the Iceberg style of a snapshot, Apache Hudi serializes all table actions into an event log called Timeline. Every transaction on a Hudi table will create a new entry. This log allows the ability to read through a point in time by specifying the read option’s begin and end time. This is similar to an Iceberg snapshot.
That said, Hudi has a SNAPSHOT statement that can be used to create a specific snapshot, but it can only be executed from the Hudi command line interface. A view of the snapshot can be named, similar to an Iceberg branch. To read that snapshot, you must restore it via the Hudi CLI. During the restore, you must bring down all your writer processes (according to their docs).
There doesn’t appear to be a concept of automatic expiration of snapshots; the only option I’m aware of is again through the Hudi CLI and executing a savepoint delete to remove a savepoint.
The closest option I could find for branching was the HoodieSnapshotExporter, which will make a copy of the selected data into a target output path in Hudi, parquet, or JSON format. You would then need to connect to the new location to perform any queries. This option could also be used to satisfy my example of wanting to save a backup of data at a particular point in time.
Delta also incorporates transaction logs along with their data files, like Hudi. The retention time defaults to 30 days for the log files (data files are never automatically deleted; they must be removed explicitly via a VACUUM action), but the retention time can be configured by setting the interval for the table property logRetentionDuration, which sets the number of days.
delta.logRetentionDuration = “interval <interval>”
You can time travel through any data with the log and data files available. This is analogous to the notion of a snapshot in Iceberg. There is the ability to have a start-from date or a start-from version with Delta. At first glance, the version seems like it could be a named checkpoint, but it’s the actual file name. Those values are retrieved from the output of a DESCRIBE HISTORY table_spec command, which you’d need to either retrieve or know to be able to use. The two versions look like this:
SELECT * FROM transactions TIMESTAMP AS OF '2023–07–01';
SELECT * FROM delta.'/tmp/delta/transactions' VERSION AS OF 1023;
Time travel over months or years isn’t feasible with Delta, as computing Delta’s state won’t scale. In addition, time travel still requires the data and logs to exist; also, note that when you perform a VACUUM, the data files are removed.
As for branching, you can get close to it using the Delta CLONE operation. This will create a copy of an existing Delta Lake table at a specific version. Two types of clones are available: a deep clone and a shallow clone. A deep clone copies an existing table’s table data and metadata, while a shallow clone simply copies the table metadata.
I’ve just touched the surface of what is possible with these features. For example, you can utilize even more robust scenarios, like snapshots of branches that could be tagged. These can be used to make time travel even more straightforward, facilitate machine learning training, and aid in data retention and removal for regulatory compliance (HIPPA, GDPR, CCPA, etc.) We also saw that these capabilities appear to be more involved to implement in Hudi and Delta, but granted, I’m not an expert in those formats; this is what I gleaned from reviewing the docs and blogs.
What ideas can you come up with to help solve your data challenges?
Make sure to review the docs for each format on the topic for specifics.