Have you ever dreamt of being a Pokémon trainer, exploring the Kanto region to capture and train incredible creatures? I did for sure, when I first enjoyed the Pokémon Red and Pokémon Blue version on my Game Boy in the 90s.
It is absolutely amazing how popular the Pokémon video game series became over time and still is, like a rising star in the video game industry. Talking about rising stars: DuckDB, an in-process SQL analytics engine, became a rising star in the Data Engineering community in the past years.
Just as a Pokémon trainer selects the perfect team for battles, an experiences Data Engineer knows the importance of choosing the right tool for the job. With this article, I would like to show you why DuckDB will make a perfect tool in your toolbox.
DuckDB seamlessly integrates with larger databases, facilitating smooth transitions between in-memory and persistent storage solutions. It’s like having a versatile Pokémon that adapts to any challenge thrown its way. Need to analyze a small dataset quickly? DuckDB is your Machoke, ready to tackle the task with ease. Working with a larger dataset that requires more muscle? DuckDB can transform into a Machamp, seamlessly connecting to external databases for heavy-duty analysis.
DuckDB is easy to install, portable and open-source. It is feature rich regarding its SQL dialect and you can import and export data based on different formats like CSV, Parquest and JSON. Also, it integrates seamlessly with Pandas dataframes, which makes it also a powerful data manipulation tool in your data wrangling scripts.
In the following chapters, we will use the Pokémon API to exemplary process data using DuckDB showcasing some powerful features.
The Pokémon API provides a RESTful API interface to JSON data related to the Pokémon video game franchise. Using this API, you can consume information on Pokémon, their moves, abilities, types, and more.
Since this article should primarily focus on DuckDB features, we will only use a subset of endpoints. To get an overview, let’s explore the relevant endpoints using curl and jq. If you are not familiar with jq, it is a lightweight CLI JSON processor with a lot features to filter, modify or simply format JSON in your terminal. jq can be installed via brew install jq if you are using macOS and Homebrew.
Get Pokémon
Get Pokémon details
Project setup
We start by creating a new Python project. For this, we create a new folder. Within this folder, we create a virtual environment with the the built-in venv module:
With the last command, we also activated the virtual environment, means: everything you execute in that terminal session will use the virtual Python rather than your system wide Python. This is crucial, as we want to keep the dependencies we will install next isolated within the project. The next step is to install all requirements:
From here we are ready to go. In the next chapters, we will look at some of the DuckDB features utilizing the Pokémon API. You can simply copy the code into a Python file in your project and execute it there.
Demo 1: Simple select
The first demo shows how simple it is to get started. After installing DuckDB with pip install duckdb you can directly import it and run SQL statements. No complicated database setup or other requirements. Like advertised, this is your fast in-process analytical database.
Running this code, we get the expected output, a table with one column and one row containg the value 42, the answer to the ultimate question of life, the universe, and everything.
Demo 2: Read JSON from API
Things will escalate quickly now, so ensure you have your Pokéball ready. Usually when fetching JSON data from an API, we would start with requests.get implementing some client logic to finally load the data to a database or use another framework for data wrangling like Pandas. The next example will for sure surprise you with its pragmatism, we will use SQL to fetch the data and make it directly available as a table in SQL.
With read_json_auto we use the simplest method of loading JSON data. It automatically configures the JSON reader and derives the column types from the data.
The code above gives us a table with four columns, according to the respective response from the API, namely: count, next, previous and results, whereas results is a list of structs and each struct is a Pokémon with its name and URL to fetch further details.
Flexibility is where DuckDB truly shines. DuckDB is your zippy Pikachu - and loading JSON data is just one example. You can also read data from CSV files or even Parquet files with Hive partitioning:
Or directly read data from a Pandas dataframe:
But let’s stick with Pokémon for now.
Demo 3: Unnest
With the previous example, the results, so the actual Pokémon, are all within one row and column in form of a list of structs. However, we want to have a row per Pokémon for furhter processing. We can use the unnest function for this purpose.
Which gives us one row per Pokémon with its name and URL to fetch further details.
Demo 4: UDF to get details
Can we level up our Pikachu so that it evolves into Raichu? Sure thing! With the result from the previous example, we now have a column called url that holds a URL for each Pokémon which we have to request in order to get more details. Functions like read_json_auto are table level functions, so we can’t apply them to each row unfortunately.
DuckDB integrates seamlessly into Python, which means, there is a way to simply call a Python function for each of our url values in the column. This is similar to User Defined Functions (UDFs) that you might now from other database systems.
In order to fetch the data from the API, we define a Python function:
And then register it with DuckDB:
When using create_function we need to give the UDF a name to reference it in our SQL script. Also, we need to pass the actual Python function to be called together with a list of parameter types and the return type.
We can then use it in SQL:
In other words: we call the Python function get for each pokemon.url form our unnested list of Pokémon. What we get is the JSON response as text and we parse that with the json function to get a properly typed result.
This is the final code to try yourself:
The execution might take a few seconds as we call the API for each of the Pokémon. The result is a table with the name and all details as JSON.
Demo 5: List comprehension
List comprehension in Python is one of my favourite features. You can also use list comprehension in DuckDB! Our goal is to further process the details, by extracting them into individual columns. But not all details, just the ID, the name, the height and the weight of the Pokémon.
We also want to reduce the abilities to a simple list with the ability names that the Pokémon can use in another column.
To make the SQL more readable, we will also use common table expressions (CTEs).
As you can see, the CTE pokemon_details extracts the reuqired details. But there is one more hidden feature in it: So far, abilities as part of the details column is of type JSON but list comprehension requires to have an actual list type. With the following statement:
we essentially convert the type of abilities to a list of structs. And each struct contains another struct with the ability name and url for more details, as well as the is_hidden flag and the slot number.
Now that abilities is a list, we can apply list comprehension in SQL. This basically works the same as list comprehension in Python, so with the following SQL code:
we create a new list containing only the name of each ability that the Pokémon can use.
Demo 6: DuckDB to Pandas and back
By now it is clear that DuckDB is not only a portable analytical database but a versatile data manipulation tool.
At its core, DuckDB offers a seamless integration between SQL-based operations and other data processing tools like Pandas. This unique feature allows you to effortlessly switch between different technologies within your data processing scripts.
Instead of implementing data wrangling fully in SQL or within your Python script using typical libraries like Pandas or NumPy, you switch between these environments without the need to setup a complex database integration.
With our example so far, let’s assume we would like to perform further processing using Pandas. With DuckDB you can easily export the result of your SQL queries to a dataframe using the .df() function.
It also works the other way around, with DuckDB you can directly query data from a Pandas dataframe!
The code above produces the following result:
As mentioned before, with .df() we get the SQL result as a Pandas dataframe. We can than apply further transformation like:
And with the dataframe stored in the df_agg variable, you can just use it in SQL, which surprised me a lot when I saw it the first time.
That makes DuckDB a great tool in your data wrangling toolbox, since it does not add any glue code.
Demo 7: Persist and load data
Calling the API on every run is not the most efficient solution. Sure, it gives you the opportunity to grab a nice cup of coffee while waiting for results, but DuckDB can also persist data, for example by serializing it into a file on your filesystem.
Let’s extend the example above with the following code:
We open a connection to a file called pokemon.db and persist the data.
In another script, you can then load the data and access the pre-processed data:
Conclusion
DuckDB emerged as a robust companion in data wrangling challenges, offering seamless integration with Pandas dataframes and advanced analytical SQL capabilities. Its lightweight nature and efficient performance underscored its suitability for analytical workloads in resource-constrained environments.
Effective AI implementation relies not only on the quantity of data but also on technical expertise. Let's explore the significance of having a skilled data team for AI projects, learn about the Small Data movement and examine how far we can go with no-code or low-code AI platforms.
Netflix Maestro and Apache Airflow - Competitors or Companions in Workflow Orchestration?
Explore how Netflix Maestro and Apache Airflow, two powerful workflow orchestration tools, can complement each other. Delve into their features, strengths, and use cases to uncover whether they are companions or competitors.
Minds and Machines - AI for Mental Health Support, Fine-Tuning LLMs with LoRA in Practice
Explore the potential of Large Language Models (LLMs) changing the future of mental healthcare and learn about how to apply Parameter-Efficient Fine-Tuning (PEFT) to create an AI-powered mental health support chatbot by example