What the Heck is Proton?

Shawn Gordon
4 min readDec 28, 2023

--

Introduction

This series of articles has been a lot of fun for me as I have learned about and explored new technology. It’s also fun to see what has been catching on since I first discovered it. My last article on Apache Paimon was incredibly popular, much to my surprise, but it seems I wasn’t the only one interested in what the heck it was. Thanks to that article, I ran across the open-source Apache 2.0 licensed project, Proton, sponsored by Timeplus. It’s a SQL database that accommodates both historical and streaming data. Written in C++ and powered by ClickHouse, the focus is on simplicity and performance. With a single executable, installation is simple.

A trend I’ve seen is that more and more real-time analytics applications are being built, but you don’t want to build them twice. Once for streaming and once for the historical backfill. There would be definite advantages to having a single platform that could query in either batch or streaming mode or even a hybrid mode where you are joining historical data to a stream of incoming data. It appears that Proton was built to do just that.

Proton Overview

In a nutshell, we have a ClickHouse database, and Timeplus has added support for streaming services. That should get you a Flink-like query engine and Kafka-like streaming storage with that ClickHouse database. So, what does that look like?

Proton Architecture

The dotted line is where Proton comes in. I suggest reading through the architecture docs to get a good sense of what is possible.

To create a random stream of data and query it with Proton, we can do something like this:

-- Create a stream with random data.
CREATE RANDOM STREAM devices(device string default 'device'||to_string(rand()%4), temperature float default rand()%1000/10);

-- Run the long-running stream query.
SELECT device, count(*), min(temperature), max(temperature) FROM devices GROUP BY device;

┌─device──┬─count()─┬─min(temperature)─┬─max(temperature)─┐
│ device0 │ 2256 │ 0 │ 99.6 │
│ device1 │ 2260 │ 0.1 │ 99.7 │
│ device3 │ 2259 │ 0.3 │ 99.9 │
│ device2 │ 2225 │ 0.2 │ 99.8 │
└─────────┴─────────┴──────────────────┴──────────────────┘

Proton Features

Proton has many nifty features; one that struck me immediately was the ability to create a materialized view to save specific events in Proton. Borrowing from the documentation, let’s say you have a Kafka stream reporting web events, and you want to save the broken link reports so you can query them later, even with Kafka down or the events removed. It would look something like this:

create materialized view mv_broken_links as
select raw:requestedUrl as url,raw:method as method, raw:ipAddress as ip,
raw:response.statusCode as statusCode, domain(raw:headers.referrer) as referrer
from frontend_events where raw:response.statusCode<>'200';

Then, if you want to directly query the materialized view and make a bar chart from the data, it would look like this:

- streaming query
select * from mv_broken_links;
- historical query
select method, count() as cnt, bar(cnt,0,40,5) as bar from table(mv_broken_links)
group by method order by cnt desc;
┌─method─┬─cnt─┬─bar─┐
│ GET │ 25 │ ███ │
│ DELETE │ 20 │ ██▌ │
│ HEAD │ 17 │ ██ │
│ POST │ 17 │ ██ │
│ PUT │ 17 │ ██ │
│ PATCH │ 17 │ ██ │
└────────┴─────┴─────┘

Some of this functionality reminds me of Upsolver, a company I worked at a few years ago.

Drivers for other languages are available for Java, Go, and Python. Using Proton with something like Redpanda would be a minimal footprint for streaming historical data.

There are a lot of other features available, but this isn’t meant to be a tutorial. I want to do a light explanation and draw attention to some features. The docs are concise and, overall, well written, certainly better than many open-source projects.

Summary

While I don’t personally need this kind of arrangement at the moment, I’ve certainly been at places and seen companies where this would be very, very cool to have. As cool as this guy?

Probably not, but then again, nothing is :). Frivolity aside, the Proton team has done an excellent job documenting the project and making it as simple to install and use as possible. I love these single-binary projects that don’t need a vast Java ecosystem with tons of dependencies. Make no mistake, though, Timeplus has a commercial version that gives you more capability than the stock Proton release. However, they seem to be very supportive of Proton and welcoming of the community.

Check out my other What the Heck is… articles at the links below:

What The Heck Is DuckDB?

What the Heck Is Malloy?

What the Heck is PRQL?

What the Heck is GlareDB?

What the Heck is SeaTunnel?

What the Heck is LanceDB?

What the heck is SDF?

What the Heck is Paimon?

--

--

Shawn Gordon

All things data, developer, sustainable energy enthusiast as well as prolific musician.