What the Heck is SDF?
Introduction
2023 has been quite a year for innovation, adoption, and competition. We saw HashiCorp generate significant outrage when they changed the licenses of their products to be less open sourceish. This quickly gave rise to forks of Terraform, such as OpenTofu. We’re seeing a rise in competition with established players like Fivetran and dbt, who you often see working together as a solution. The latter brings us to the subject of this blog post on the Rust-based alternative to dbt, called SDF, which stands for Semantic Data Fabric. So, just what the heck is SDF?
SDF Overview
SDF has its roots in Meta. Given that it is written in Rust, there is a single tight binary with no packages to deal with. That means you are ready to go as soon as you download it. It is built on top of Datafusion, part of the Apache Arrow project. With SDF, you’ll just be working with YAML and SQL, which should be familiar to any data engineer.
Fundamentally, SDF is a compiler and build system that utilizes static analysis to examine SQL code. It can then create a robust dependency graph to view your data assets better. This makes it easier to discover potential problems and optimization issues ahead of time.
One of the nifty features of SDF is the ability to annotate your SQL sources with metadata ranging from simple types and classifiers (PII) to table visibility and privacy policies. SDF calls these Checks, and borrowing from their documentation, here are some examples of the types of Checks you can perform:
- Check Data Privacy: Ensure all personally identifiable information (PII) is appropriately anonymized.
- Check Data Ownership: Guarantee every table has an owner (a staple of GDPR)
- Check Data Quality: Prevent different currency types from combining in calculations (e.g., preventing £ + $ )
SDF code checks
The resulting information schema is a fascinating extension of what I’m accustomed to. You can use SQL to explore the information schema and even write your own Checks that can be integrated into your workflows.
Additional Features
The SDF engine comprises a multi-dialect SQL compiler, a static analyzer, a dependency manager, and a build cache. The following images illustrate the deploy, describe, and lineage commands.
That is all from their CLI, but a cloud version also provides an excellent interactive interface to explore the catalog and lineage. You can easily zoom in and out, click on a node, and drill down. Here is a zoomed-out image:
As previously mentioned, the metadata and configurations for SDF are in YAML, and create an SDF Workspace. While YAML is often used for configurations, SDF takes it further by allowing you to specify data asset definitions at the same time, which they call definition blocks. These blocks can be used to define or enrich tables, functions, classifiers, etc. Once again, borrowing an example from their docs, here is an example SDF YAML workspace.
# - - - - - - - - - - - #
# An SDF Workspace #
# - - - - - - - - - - - #
workspace:
edition: "1.1"
name: "Example SDF Workspace"
dialect: presto
includes:
- path: sources/ # Folder with SQL Sources
code-checks:
- name: No_Currency_Mismatch
description: Never mix Currencies
assert: not-exists
path: checks/no_currency_mismatch.sql
- -
classifier: # declaring a classifier
name: CURRENCY
labels:
- name: usd
- name: cad
- -
table: # Attaching metadata to the lineitem table
name: lineitem
description: Contains sale & purchase details
columns:
- name: l_price
classifiers:
- CURRENCY.usd
The classifiers provide a convenient mechanism to apply tags to your columns and their children, all through a project. Between those and the available functions, you have a high amount of control and visibility over your data models. The compilation step and useful error messages make for a pretty comprehensive environment.
Summary
I really like how tight this project is; you can install it locally and try things out with minimal effort. It is a very different mindset than dbt, so that will be a source of resistance for some people. I’m unsure what this would look like in a large, complex environment where you want to deploy thousands of these with reusable modules. The Checks are a fascinating feature that might address that concern, however. It’s certainly a worthwhile project that will satisfy a lot of use cases. If you are frustrated with dbt, or looking to implement something like dbt, then SDF should likely also be on your list of things to try out.
You can read the other “What the heck” articles at these links:
What The Heck Is DuckDB? (I was pretty out front on this one.)
What the Heck Is Malloy? (I was out front on this one, too.)
What the Heck is PRQL? (slower, but also growing)
What the Heck is GlareDB? (growing quickly)
What the Heck is SeaTunnel? (interest is hot)
What the Heck is LanceDB? (growing quickly)