You’re standing in Ancient Egypt, around 250 years before the start of our modern calendar. As your eyes adjust to the warm light shining through high windows, you find yourself in the Great Library of Alexandria. What you see is overwhelming: shelves stretch in every direction, holding nearly half a million scrolls. The air is thick with the scent of papyrus and you’re searching for writings about astronomy. You hear the quite rustling of Callimachus, the library’s most celebrated scholar, working on the solution for your problem.
He revolutionized information retrieval by creating the Pinakes - the world’s first library catalog. Instead of organizing scrolls merely by author or title, he pioneered a system that categorized works by their subject matter and content, enabling scholars to discover related works they might never have found through simple alphabetical browsing.
Illustration of Callimachus, source: generated with DALL-E 3
Over 2,000 years later, we face a similar challenge. In the digital archives of Netflix’s early days, engineers worked on their own Alexandria-scale problem. Their movie recommendation system, built on simple rating matches, struggled to capture the essence of what makes films truly similar. A comedy about a wedding might share more DNA with a romantic drama than another comedy about sports, yet traditional categorization methods - much like organizing scrolls by their physical attributes - missed these subtle connections. This challenge of semantic understanding - capturing the true meaning and similarity between items - remains at the heart of modern search and recommendation systems.
Today, we stand at an interesting crossroad. The evolution of vector search solutions, due to their prominent use in Retrieval-Augmented Generation (RAG) systems, promises to solve these semantic matching problems, but with a catch: most solutions demand complex infrastructure, significant resources, and careful maintenance. However, for some cases, there is a pragmatic answer to this complexity: DuckDB’s Vector Similarity Search (VSS) extension.
In this guide, we’ll build a movie recommendation engine that could have solved Netflix’s early challenges, using modern tools that fit in your laptop’s memory. By combining DuckDB’s VSS extension with Gemini’s embedding capabilities, we’ll create a system that understands the essence of movies, not just their metadata. Whether you’re building the next big recommendation engine or simply want to understand vector search better, this practical journey will equip you with the knowledge to tackle semantic search challenges in your own projects.
Text to Numbers: Understanding Embeddings
Before we dive into similarity search, let’s understand how we convert movie descriptions into numbers that computers can understand. This is where embeddings come in.
Embeddings work by converting text, image, and video into arrays of floating point numbers, called vectors. These vectors are designed to capture the meaning of the text, images, and videos. The length of the embedding array is called the vector’s dimensionality. For example, one passage of text might be represented by a vector containing hundreds of dimensions.
Vector Similarity Search
Once we have these numerical vectors, we need ways to measure how close or similar they are.
DuckDB introduced the ARRAY data type in v0.10.0, which stores fixed-sized lists, to complement the variable-size LIST data type.
DuckDB’s VSS extension then added support for Hierarchical Navigable Small Worlds (HNSW) indexes to accelerate vector similarity search.
HNSW: Understanding the Small World Network
Let’s get an idea how the HNSW index works. Imagine you’re in New York City trying to find a World of Warcraft player who also teaches quantum physics. Here’s how different search approaches would work:
Strategy 1: Brute Force
Stop every single person in NYC and ask if they match your criteria.
Time: Months
Accuracy: 100%
Efficiency: Checking millions unnecessarily
Strategy 2: HNSW (Smart Hierarchical Search)
Think of it like a cleverly organized social network with different levels of connection:
Level 3 (Top Level) - Global Connections
Like knowing the leaders of major gaming communities and university physics departments
Quick, broad reach: “Here are the main institutions and gaming hubs to check”
Level 2 (Mid Level) - District Connections
Like knowing local WoW guild leaders and physics department heads
More focused: “These three physics departments have active gaming communities”
Level 1 (Ground Level) - Local Connections
Direct knowledge of individual gamers and professors
Precise matching of candidates
HNSW visualization, source: generated with DALL-E 3
The search starts at the top level, quickly identifies promising areas, and drills down efficiently. For 1 million candidates, you’ll check about 20 instead of all million, while maintaining 95–99% accuracy.
Google Cloud Platform (GCP) service account with Vertex AI enabled
Setup
Fetch Movie Data
We fetch movie data from the TMDB API using httpx. Also, we allow to specify a minimum average voting score and count, to reduce the dataset to more famous movies.
Generate Embeddings
We use Gemini’s text-embedding-004 model to generate embeddings and set the vector dimensionality to 256.
Note: The dimension size (256) must match in both embedding generation and DuckDB table creation.
DuckDB VSS Setup
As a next step, we install and load the VSS extension in DuckDB, and enable persistence. This allows us to store the embeddings in a database file.
We then create the table, using the same dimensionality as before.
After inserting the embeddings, we create a HNSW index on the vector column in order to speed up vector similarity search.
We then prepare a function, that takes a movie description as input. This is the search query from the user. We also create an embedding vector based on this input. Finally, we use a DuckDB distance function to get similar movies.
With that, this is how the DuckDB VSS setup looks like:
We are not only returning the similarity search result, but also print the query plan with query.explain() to show that the HNSW index is actually used.
Using the Recommender
Putting everything together, this is the full example:
You can search for similar movies using natural language descriptions. For example:
The system will:
Convert your description to a 256-dimensional vector
Use the HNSW index to efficiently find similar movies
Return the top 3 closest matches
Movie Recommender demo, source: by author
Conclusion
Remember Callimachus and his Pinakes - sometimes the most elegant solutions are also the simplest. Often the tech industry’s answer to every problem is add more infrastructure. DuckDB VSS reminds us of a timeless truth: you don’t always need distributed systems to solve challenges effectively. Just as the ancient librarian created a groundbreaking system with fundamental principles, we too can tackle modern recommendation challenges with straightforward, efficient tools that get the job done.
Whether you’re building a movie recommendation engine or tackling other semantic search challenges, the principles and techniques demonstrated here provide a solid foundation for your projects. The power of vector similarity search, combined with the simplicity of DuckDB, opens up new possibilities for creating sophisticated search and recommendation systems or Retrieval-augmented generation (RAG) systems without the complexity of distributed architectures.
Effective AI implementation relies not only on the quantity of data but also on technical expertise. Let's explore the significance of having a skilled data team for AI projects, learn about the Small Data movement and examine how far we can go with no-code or low-code AI platforms.
Netflix Maestro and Apache Airflow - Competitors or Companions in Workflow Orchestration?
Explore how Netflix Maestro and Apache Airflow, two powerful workflow orchestration tools, can complement each other. Delve into their features, strengths, and use cases to uncover whether they are companions or competitors.