You can think of a database as a collection of tables (or dataframes). In general, the benefit of databases is that they can handle very large amounts of data - data which would not fit in your computer’s RAM. As such, most databases are stored on a hard disk, from where data is read and written to.
There are many different forms of databases. We have several so-called database management systems (DBMS), which allow us to interact with each of them. Unhelpfully, each different flavor of database has a slightly different dialect of a what is called the structured query language, SQL for short. Helpfully, on the other hand, there are several packages out there which help us keeping the effort manageable by translating our queries. We will see a few of those here, and in different driving languages (i.e. R, python and julia).
One final important feature of different databases is how (and where) they actually run. The more traditional solutions exist as server-client models, where you would start on a machine (could be your own) a database server, which interacts with the persistent storage (e.g. with your hard disk) to read and write data, and a client process, where you formulate so-called queries, and get back the results from the server. In some sense you formulate a question for the database in the client, you submit it to the database, and you get back an answer. Of course those server processes can run on dedicated and powerful machines which you can reach via a network, or directly in the cloud, where they are used to work with arbitrarily huge amounts of data.
We will focus on a relatively recent addition to the database family called duckdb. The main benefit (beyond it’s excellent performance) is that it is very easy to run a database without the need to set up the client-server structure. We will come to that in due course.
What is SQL
As already mentioned, this is the main database language. Here is an example:
SELECT carat, cut, clarity, color, price FROM diamonds WHERE price >15000
This reads:
From the table called diamonds take the columns carat, cut, clarity, color, price, but keep only rows where the column price is larger than the value 15000
Not too bad, right?
Let’s get this example running now in our computers. We need to create a duckdb database first, read some data into it, and the confront it with that SQL query. For simplicity we will do this only in R to get us started, then we will move on to a more elaborate example and big data, which will use both R, python and julia.
Simple Intro Example with R
library(DBI)library(duckdb)
Warning: package 'duckdb' was built under R version 4.5.2
library(tibble)con = DBI::dbConnect(duckdb::duckdb(), shutdown =TRUE) # will get erased after shutdown# con = DBI::dbConnect(duckdb::duckdb(), dbdir = "db.duck") # would save on disk
The con object is our live connection to the database. Think of it like a telephone receiver - you can talk into it, and it will respond results back to you. As such, we have to refer to con in everything we want to do with the database.
Let us first read some data from the ggplot2 package, and add it to our database. Remember, a database is a collection of tables, and now we are adding two of those:
Ok, let’s see what tables we have in the database now:
dbListTables(con)
[1] "diamonds" "mpg"
Great. Now how do get data out of it? Here goes the mpg dataset:
con |>dbReadTable("mpg") |>as_tibble()
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Back to our SQL query from above. We can submit it to the database listening at con:
query1 ="SELECT carat, cut, clarity, color, price FROM diamonds WHERE price > 15000"head(dbGetQuery(con, query1))
carat cut clarity color price
1 1.54 Premium VS2 E 15002
2 1.19 Ideal VVS1 F 15005
3 2.10 Premium SI1 I 15007
4 1.69 Ideal SI1 D 15011
5 1.50 Very Good VVS2 G 15013
6 1.73 Very Good VS1 G 15014
or of course as before with a pipeline such as
con |>dbGetQuery(query1) |>as_tibble()
# A tibble: 1,655 × 5
carat cut clarity color price
<dbl> <fct> <fct> <fct> <int>
1 1.54 Premium VS2 E 15002
2 1.19 Ideal VVS1 F 15005
3 2.1 Premium SI1 I 15007
4 1.69 Ideal SI1 D 15011
5 1.5 Very Good VVS2 G 15013
6 1.73 Very Good VS1 G 15014
7 2.02 Premium SI2 G 15014
8 2.05 Very Good SI2 F 15017
9 1.5 Very Good VS1 F 15022
10 1.82 Very Good SI1 G 15025
# ℹ 1,645 more rows
That’s it for starters. There is now a whole world out there to learn in terms of SQL queries, database joins, pivots (i.e. reshape operations) etc. I leave you in the hands of Hadley Wickham and his dedicated chapter on this topic if you want to dive deeper. It will be almost unavoidable to learn some simple SQL, but hopefully it won’t be so bad. You will see in the associated pages of this tutorial that there are some very nice translation tools for R. Grant’s intro to databases is another valuable resource here. Now, however, let’s get cracking with the real show, i.e. our big data set on NYC taxi rides.
# setup python from R with reticulate packagelibrary(reticulate)Sys.setenv(RETICULATE_PYTHON="/Users/floswald/.pyenv/versions/3.13.5/bin/python")
con = duckdb.connect(database =':memory:', read_only =False)
Digression: In-memory versus on-disk
The fact that our connection lives “in memory” is a killer feature of DuckDB (one that it inherits from SQLite). We don’t need to connect to some complicated, existing database infrastructure to harness all of DuckDB’s power. Instead we can just spin up an ephemeral database that interacts directly with our R (or Python, or Julia, etc.) client.
However, it’s worth noting that you can create a persistent, disk-backed database simply by providing a database file path argument as part of your connection, e.g.
## Uncomment and run the next line if you'd like to create a persistent,## disk-backed database instead.# con = dbConnect(duckdb(), dbdir = "nyc.duck")
## Uncomment and run the next line if you'd like to create a persistent,## disk-backed database instead.# con = duckdb.connect(database = 'nyc.duck', read_only = False)
(Note that the ".duck" file extension above is optional. You could also use ".db", ".dbb", or whatever you want really.)
ImportantBigger than RAM data?
One really important benefit of creating a persistent disk-backed database is that it enables out-of-core computation for bigger than RAM data. See here for more details and performance considerations (which are still great).
Just How Big is This?
Before we get started with some analysis tasks, here’s a simple question to ask our database: how many rows are we dealing with here, actually?
nrows =dbGetQuery( con," FROM 'nyc-taxi/**/*.parquet' SELECT COUNT(passenger_count) as nrows " )
# con.# query(# '''# FROM 'nyc-taxi/**/*.parquet'# SELECT# COUNT(passenger_count) as nrows# '''# )
Ok, well that is 570,009,660. 😬
First example
We’ll start with a simple aggregation query to get situated. I’ll also use this example to highlight some general features of DuckDB SQL and the underlying query engine.
Okay, first query. Let’s say we want to know: What is the average tip per passenger count? A typical SQL job for this question might look as follows:
SELECT passenger_count,AVG(tip_amount) AS mean_tipFROM'nyc-taxi/**/*.parquet'GROUPBY passenger_countORDERBY passenger_count
(Where the last ORDER BY statement is optional. Note that ordering (i.e., sorting) is a potentially expensive operation but we’ll get back to that later.)
This is perfectly valid DuckDB SQL too. However, we can rewrite it with slightly nicer syntax thanks DuckDB’s “friendly SQL”. The key changes for this simple query are going to be: (1) the FROM statement comes first, and (2) we can use the GROUP BY ALL and ORDER BY ALL statements to avoid writing out the “passenger_count” grouping column multiple times.2
FROM'nyc-taxi/**/*.parquet'SELECT passenger_count,AVG(tip_amount) AS mean_tipGROUPBYALLORDERBYALL
TipDuckDB’s “friendly SQL”
One of the under-appreciated (IMHO) features of DuckDB is that it supports many syntax enhancements over tradional SQL dialects, which they collectively dub “friendly SQL”. Together these syntax enhancements allow you to write much more ergonomic SQL queries that cut down on duplication and logical inconsistencies.
To run this operation from our R or Python client, simply pass the SQL query as a string to our connection. Let’s use this as a chance to save the result and time our query too.
tic =Sys.time()dat1 =dbGetQuery( con," FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, AVG(tip_amount) AS mean_tip GROUP BY ALL ORDER BY ALL ")toc =Sys.time()dat1
tic = time.time()dat1 = ( con. query(''' FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, AVG(tip_amount) AS mean_tip GROUP BY ALL ORDER BY ALL ''' ))toc = time.time()dat1
# print(f"Time difference of {toc - tic} seconds")## Timing will be misleading for this rendered Quarto doc, since we're calling## Python from R (via the reticulate package).
Note that we actually get a polars DataFrame as a return object. Click the callout box below to learn more.
NoteResult conversion in Python (click to expand)
By default, the con.query method that we are using here will return a polars DataFrame object that Python understands “natively” (i.e., has a print method for and so on). Behind the scenes, this duckdb to polars integration relies on the pyarrow library being available to our Python environment, which have already installed for this workshop.
It’s also possible return other types of Python objects. For example, you can use the .df() method to coerce to a pandas DataFrame instead, among several other formats like numpy arrays. (Details here.) Given the focus of this workshop, it won’t surprise you to hear that I’m going to stick with the default polars conversion.
So that only took 0.68 seconds in this rendered Quarto doc (and will likely be even faster when you try in an interactive session). To underscore just how crazy impressive this is, recall that this includes the time that it takes to read the data from disk. I can almost guarantee that the read + serialization time alone for traditional data wrangling workflows would take several minutes, and probably crash my laptop RAM. Don’t forget that our full dataset is nearly 200 million rows deep and 30 columns wide.
Aside: We clearly have a few outlier typos in our dataset. 254 passengers in a single taxi trip? I don’t think so. We’d probably want to filter these out with a WHERE statement if we were doing serious analysis, but I’m just going to leave them in for this tutorial.
Aggregation
Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.
tic =Sys.time()dat2 =dbGetQuery( con," FROM 'nyc-taxi/**/*.parquet' SELECT month, passenger_count, AVG(tip_amount) AS mean_tip WHERE month::int <= 3 GROUP BY ALL " )toc =Sys.time()head(dat2)
tic = time.time()dat2 = ( con. query(''' FROM 'nyc-taxi/**/*.parquet' SELECT month, passenger_count, AVG(tip_amount) AS mean_tip WHERE month::int <= 3 GROUP BY ALL ''' ))toc = time.time()dat2
# print(f"Time difference of {toc - tic} seconds")## Timing will be misleading for this rendered Quarto doc, since we're calling## Python from R (via the reticulate package).
Note that this query completed even faster than the first one, even though we added another grouping variable. Reason: Subsetting along our Hive-partitioned parquet dataset allows DuckDB to take shortcuts. We can see this directly by prepending an EXPLAIN statement to our query to reveal the optmized query plan.
dbGetQuery( con," EXPLAIN FROM 'nyc-taxi/**/*.parquet' SELECT month, passenger_count, AVG(tip_amount) AS mean_tip WHERE month::int <= 3 GROUP BY ALL ")
tl;dr DuckDB is able to exploit the month partition of our dataset, so subsetting means that it can avoid unecessary data ingestion. Similarly, it only reads in a select group of columns; that’s what the “PROJECTION” part of the plan denotes. If nothing else, the take-home message is that DuckDB only does what it needs to. Laziness as a virtue!
Here’s a final aggregation example, this time including a high-dimensional grouping column (i.e., “trip_distance”) and some additional aggregations.
tic =Sys.time()dat3 =dbGetQuery( con," FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL")toc =Sys.time()nrow(dat3)
tic = time.time()dat3 = ( con. query(''' FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL ''' ))toc = time.time()len(dat3)
# print(f"Time difference of {toc - tic} seconds")## Timing will be misleading for this rendered Quarto doc, since we're calling## Python from R (via the reticulate package).
Pivot (reshape)
Let’s explore some pivot (reshape) examples, by building off the previous query.
UNPIVOT: wide => long
PIVOT: long => wide
Here I’ll use a Common Table Expression (CTE) to define a temporary table tmp_table, before unpivoting—i.e., reshaping long—at the end.
dat_long =dbGetQuery( con," WITH tmp_table AS ( FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL ) UNPIVOT tmp_table ON mean_tip, mean_fare INTO NAME variable VALUE amount ")head(dat_long)
dat_long = ( con. query(''' WITH tmp_table AS ( FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL ) UNPIVOT tmp_table ON mean_tip, mean_fare INTO NAME variable VALUE amount ''' ))dat_long
Another option would have been to create a new table in memory and then pivot over that, which segues nicely to…
Digression: Create new tables
CTEs are a very common, and often efficient, way to implement multi-table operations in SQL. But, for the record, we can create new tables in DuckDB’s memory cache pretty easily using the CREATE TABLE statement.
Instead of DBI::dbGetQuery, we must now use DBI::dbExecute.
dbExecute( con," CREATE TABLE taxi2 AS FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL ")
[1] 34490
dbListTables(con)
[1] "taxi2"
FWIW, you can always remove a table with dbRemoveTable().
Instead of con.query(), we must now use con.execute().
con.execute(''' CREATE TABLE taxi2 AS FROM 'nyc-taxi/**/*.parquet' SELECT passenger_count, trip_distance, AVG(tip_amount) AS mean_tip, AVG(fare_amount) AS mean_fare GROUP BY ALL ''')
<_duckdb.DuckDBPyConnection object at 0x1265f5530>
# https://stackoverflow.com/q/75727685con.query(''' SELECT table_name, estimated_size AS nrows, column_count AS ncols FROM duckdb_tables; ''')
With our new taxi2 table in hand, let’s redo the previous unpivot query directly on this new table. This makes the actual (un)pivot statement a bit clearer… and also separates out the execution time.
(Note how crazy fast pivoting in DuckDB actually is.)
Joins (merges)
It’s a bit hard to demonstrate a join with only a single main table. But here is a contrived example, where we calculate the mean monthly tips and the mean monthly fares as separate sub-queries (CTEs), before joining them together by month.
dbGetQuery( con," WITH mean_tips AS ( FROM 'nyc-taxi/**/*.parquet' SELECT month, AVG(tip_amount) AS mean_tip GROUP BY month ), mean_fares AS ( FROM 'nyc-taxi/**/*.parquet' SELECT month, AVG(fare_amount) AS mean_fare GROUP BY month ) FROM mean_tips LEFT JOIN mean_fares USING (month) SELECT * ORDER BY mean_tips.month ")
con.query(''' WITH mean_tips AS ( FROM 'nyc-taxi/**/*.parquet' SELECT month, AVG(tip_amount) AS mean_tip GROUP BY month ), mean_fares AS ( FROM 'nyc-taxi/**/*.parquet' SELECT month, AVG(fare_amount) AS mean_fare GROUP BY month ) FROM mean_tips LEFT JOIN mean_fares USING (month) SELECT * ORDER BY mean_tips.month ''')
Redo the above join but, rather than using CTEs, use tables that you first create in DuckDB’s memory bank. Again, this will simplify the actual join operation and also emphasise how crazy fast joins are in DuckDB.
Windowing
One last example: Binning “trip_distance” into deciles and then calculating the the mean “tip_amount” within each decile. This is an example of a window function and query pattern that I use all the time in my own work. I find it extremely useful for quickly pulling out descriptive patterns from large datasets, from which I can then develop a better intuition of my data. In turn, this shapes the hypotheses and modeling choices that I make in the subsequent analysis stage.
WarningSorting and sampling
I’m using a 1% random sample of my data here (see the USING SAMPLE 1% statement). Why? Because calculating deciles requires ranking your data and this is expensive! To rank data, we first have to sort it (ORDER) and this requires evaluating/comparing every single row in your dataset. In turn, this means that it’s very hard to take shortcuts. (This is one reason why DuckDB’s optimized query plan will always delay sorting until as late as possible; to only sort on a smaller subset/aggregation of the data if possible.) FWIW, DuckDB’s sorting algorithm is still crazy fast. But for data of this size, and where sorting on the full datset is unavoidable, I strongly recommend sampling first. Your general insights will almost certainly remain intact.
dbGetQuery( con," WITH trip_deciles AS ( FROM 'nyc-taxi/**/*.parquet' SELECT tip_amount, trip_distance, NTILE(10) OVER (ORDER BY trip_distance) AS decile USING SAMPLE 1% ) FROM trip_deciles SELECT decile, AVG(trip_distance) AS mean_distance, AVG(tip_amount) AS mean_tip GROUP BY ALL ORDER BY ALL ")
con.query(''' WITH trip_deciles AS ( FROM 'nyc-taxi/**/*.parquet' SELECT tip_amount, trip_distance, NTILE(10) OVER (ORDER BY trip_distance) AS decile USING SAMPLE 1% ) FROM trip_deciles SELECT decile, AVG(trip_distance) AS mean_distance, AVG(tip_amount) AS mean_tip GROUP BY ALL ORDER BY ALL ''')
Again, this step isn’t strictly necessary since we instantiated our connection with the shutdown = TRUE argument. But it’s worth seeing in case you want to be explicit.
con.close()
Footnotes
Aside: The shutdown = TRUE argument is a convenience feature that ensures our connection is automatically terminated when our R session ends (i.e., even if we forget to do it manually.) I’m not aware of a similar convenience argument for Python; please let me know if I am missing something.↩︎
I’ll admit that the benefits don’t seem so great for this simple example. But trust me: they make a big difference once you start having lots of grouping columns and complex sub-queries.↩︎