Every day there is about three quintillion (a 3 followed by 18 zeroes) bytes of data created in this world, and only about 20% of it is structured in conventional data repositories and available to easily process. More and more often this data is stored in a Data Lake, which you could think of as a toy chest filled with random things. Companies can get the data in, but being able to do something useful with it, in a timely fashion, is very difficult. Solving that problem is what led me to Upsolver. Upsolver calls itself “the data lake engineering platform,“ combining data lake economics with database simplicity and speed through their proprietary technology and their visual SQL IDE. Their solution is entirely cloud-based, making use of data buckets such as S3 on AWS to store in the data lake, which keeps costs down from dedicated database instances while providing maximum flexibility.
I decided to give the Upsolver Community Edition a try. You create a short profile by either filling in the form or using a Google/Microsoft account and filling in what is left of the form. After I got in by entering my information, I got a brief survey about how I want to use Upsolver. Then I was presented with some options as to what I wanted to build, and where. I went with the Sandbox version.
From this point, there are a few options available:
I went with “Load raw data. Create Data Source” to work with their sample data (and also because the little red dot is what tells you what to click as part of the product walk-through).
From this point a series of screens continue to walk you through the process, with lots of documentation and help. The interface can feel a bit daunting at first, but a lot of really great metrics are presented that can instantly give you a sense of what is going on with your data — types of values, density, values over time, and more.
I’ve really got to compliment Upsolver on how this demo works. I started the “Create” process to create the data lake and map the data sources to it. The system then has a very slick walk-through that highlights areas of the screen, giving you specific instructions as to what to click on and what to enter to get to the finished example. Without typing any code at all and mostly just clicking on fields and functions, it generated this SQL statement for me. Note that you can also just type the SQL commands yourself, or edit the generated SQL.
You are able to preview the data output and make sure everything is the way you want it to be prior to finalizing the run. The last step lets you configure the time range you want to work with, either as absolute values you type in or with a visual slider.
I do have one complaint on the UI at this point, and it’s a similar complaint I have with a lot of modern UI’s: the notion of the hidden option. You have to hover over a specific area of the screen for an option to reveal itself. I find this to be counterintuitive as I stare at the screen trying to figure out where the ‘add’ or ‘delete’ button is, and it turns out that I had to hover near the item in question to see the icon.
We are now ready to perform queries on the data we processed into the data lake. Start by creating a Worksheet, which then allows you to generate a variety of queries and views to the available tables. As you can see in this example, we counted the orders per day, but the sample data only had one day’s worth of data.
The Upsolver Catalog is on the left and provides a pretty standardized interface for your schema. Click on any of those entities, and immediately get a tab open on the right that gives you a simple SELECT statement — essentially a quick way to browse every table.
Looking at the screenshot, note on the right that CSV download is always available from a result set. At this stage, you can just have fun doing reports on your data using standard SQL statements. It’s as simple as that.
Upsolver is incredibly easy to use, I would say that with the demos provided you should be able to get comfortable in a half day. The ease of use is quite stunning, considering what it is doing, I’d like to see it run against a truly massive set of data. The platform has extensibility via Python, works with all the big Cloud providers like AWS, Azure, and GCP, git integration, and more. Setting up reproducible environments and workflows is very simple and certainly a godsend for people looking for useful insight from their data. This is the easiest-to-use solution that I’ve run across in this space. It’s worth your consideration if you are dealing with large amounts of disparate data.