I started seeing LanceDB early in 2023, and my first thought was that it might be an attractive fit in the Apache Iceberg ecosystem or as a general replacement for the Parquet columnar file format. I was thinking too small, though. It is an in-process multi-model serverless vector database written in Rust that is cloud-native and open-source. It’s not to be confused with the Lance columnar format (which I was doing), which is also written in Rust and a more appropriate comparison to Parquet but is fundamental to LanceDB. It is also Apache Arrow compatible.
So, what we have is a SQL-compatible vector database that supports vectors, images, text, and videos with full-text search. It is also said to be very fast, but you’ll need to do your own tests to see how it does on your own stack.
Why a Vector Database?
Two letters, AI. A vector database is at the core of the data repository used to train Large Language Models (LLM). LanceDB has gone the extra mile to provide a GitHub repository with a few vector recipes that you can find here. So, you should store embeddings from a machine learning model, for example, to search for images using written descriptions.
The challenge here is the “Curse of dimensionality”; you can be fast or accurate but can’t be both; you have to pick one. LanceDB first went for speed, then did a lot of tuning to improve accuracy. This leads to “Embeddings,” which are high-dimensional floating-point vector representations of a query or the data. You can embed anything using an appropriate embedding model or function. The position of the embedding in a vector space has semantic significance depending on the type of modal and training you are using. LanceDB supports “explicit” and “implicit” data vectorization methods.
At this stage, we’re getting into some deep water concerning how this all works, and it is beyond the scope of what I’m trying to convey in this blog. I’ll share an image from the LanceDB docs that illustrates how similar entries cluster within a vector system.
The LanceDB ecosystem provides all the latest and most commonly used tools for this space to make it as convenient as possible to get started.
- LangChain JS/TS
There is a great, recent blog, “Serverless Multi-Modal search application” by Ayush Chaurasia, that I highly recommend. He gives a quick walkthrough using Nextjs, LanceDB, and Roboflow’s CLIP inference API. If you’d like to dive a little deeper, it’s a specific deep dive with example code.
Obviously, LanceDB isn’t a general-purpose database, it has a very specific use case, and from what I can tell, it’s a solid solution for that use case. It is an extremely fast vector database that can be used specifically in AI applications. I can envision some really useful scenarios for building up your own LLM around product blogs and documentation with this to enable users to write specific questions and get more tailored responses than trying to read through tons of blogs and docs. This could be the beginning of a tide shift in how we provide information to the public.
Finally, a LanceDB Cloud is coming, and at the time of this writing, in October 2023, you can sign up to be notified about it at this link. I’m excited to give that a try and run some tests on ideas that have been bouncing around in my head since ChatGPT started grabbing everyone’s attention.
You can read the other “What the heck” articles at these links:
What The Heck Is DuckDB? (I was pretty out front on this one.)
What the Heck Is Malloy? (I was out front on this one, too.)
What the Heck is PRQL? (slower, but also growing)
What the Heck is GlareDB? (growing quickly)
What the Heck is SeaTunnel? (interest is hot)