library(DBI)
library(duckdb)
library(tibble)
= DBI::dbConnect(duckdb::duckdb(), shutdown = TRUE) # will get erased after shutdown
con # con = DBI::dbConnect(duckdb::duckdb(), dbdir = "db.duck") # would save on disk
DuckDB SQL
Use the same queries from R, Python, Julia…
What is a Database?
You can think of a database as a collection of tables (or dataframes). In general, the benefit of databases is that they can handle very large amounts of data - data which would not fit in your computer’s RAM. As such, most databases are stored on a hard disk, from where data is read and written to.
There are many different forms of databases. We have several so-called database management systems (DBMS), which allow us to interact with each of them. Unhelpfully, each different flavor of database has a slightly different dialect of a what is called the structured query language, SQL for short. Helpfully, on the other hand, there are several packages out there which help us keeping the effort manageable by translating our queries. We will see a few of those here, and in different driving languages (i.e. R
, python
and julia
).
One final important feature of different databases is how (and where) they actually run. The more traditional solutions exist as server-client models, where you would start on a machine (could be your own) a database server, which interacts with the persistent storage (e.g. with your hard disk) to read and write data, and a client process, where you formulate so-called queries, and get back the results from the server. In some sense you formulate a question for the database in the client, you submit it to the database, and you get back an answer. Of course those server processes can run on dedicated and powerful machines which you can reach via a network, or directly in the cloud, where they are used to work with arbitrarily huge amounts of data.
We will focus on a relatively recent addition to the database family called duckdb
. The main benefit (beyond it’s excellent performance) is that it is very easy to run a database without the need to set up the client-server structure. We will come to that in due course.
What is SQL
As already mentioned, this is the main database language. Here is an example:
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price > 15000
This reads:
From the table called
diamonds
take the columnscarat, cut, clarity, color, price
, but keep only rows where the columnprice
is larger than the value15000
Not too bad, right?
Let’s get this example running now in our computers. We need to create a duckdb
database first, read some data into it, and the confront it with that SQL
query. For simplicity we will do this only in R
to get us started, then we will move on to a more elaborate example and big data, which will use both R
, python
and julia
.
Simple Intro Example with R
The con
object is our live connection to the database. Think of it like a telephone receiver - you can talk into it, and it will respond results back to you. As such, we have to refer to con
in everything we want to do with the database.
Let us first read some data from the ggplot2
package, and add it to our database. Remember, a database is a collection of tables, and now we are adding two of those:
dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
Ok, let’s see what tables we have in the database now:
dbListTables(con)
[1] "diamonds" "mpg"
Great. Now how do get data out of it? Here goes the mpg
dataset:
|>
con dbReadTable("mpg") |>
as_tibble()
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Back to our SQL query from above. We can submit it to the database listening at con
:
= "SELECT carat, cut, clarity, color, price
query1 FROM diamonds
WHERE price > 15000"
head(dbGetQuery(con, query1))
carat cut clarity color price
1 1.54 Premium VS2 E 15002
2 1.19 Ideal VVS1 F 15005
3 2.10 Premium SI1 I 15007
4 1.69 Ideal SI1 D 15011
5 1.50 Very Good VVS2 G 15013
6 1.73 Very Good VS1 G 15014
or of course as before with a pipeline such as
|>
con dbGetQuery(query1) |>
as_tibble()
# A tibble: 1,655 × 5
carat cut clarity color price
<dbl> <fct> <fct> <fct> <int>
1 1.54 Premium VS2 E 15002
2 1.19 Ideal VVS1 F 15005
3 2.1 Premium SI1 I 15007
4 1.69 Ideal SI1 D 15011
5 1.5 Very Good VVS2 G 15013
6 1.73 Very Good VS1 G 15014
7 2.02 Premium SI2 G 15014
8 2.05 Very Good SI2 F 15017
9 1.5 Very Good VS1 F 15022
10 1.82 Very Good SI1 G 15025
# ℹ 1,645 more rows
That’s it for starters. There is now a whole world out there to learn in terms of SQL queries, database joins, pivots (i.e. reshape operations) etc. I leave you in the hands of Hadley Wickham and his dedicated chapter on this topic if you want to dive deeper. It will be almost unavoidable to learn some simple SQL, but hopefully it won’t be so bad. You will see in the associated pages of this tutorial that there are some very nice translation tools for R
. Grant’s intro to databases is another valuable resource here. Now, however, let’s get cracking with the real show, i.e. our big data set on NYC taxi rides.
# setup python from R with reticulate package
library(reticulate)
Sys.setenv(RETICULATE_PYTHON=here::here(".venv/bin/python"))
Load libraries
library(duckdb) # loaded already
Loading required package: DBI
import duckdb
import time # just for timing some queries
Create a database connection
The first thing we need to do is instantiate a connection with an in-memory database.1
= dbConnect(duckdb(), shutdown = TRUE) con
= duckdb.connect(database = ':memory:', read_only = False) con
Digression: In-memory versus on-disk
The fact that our connection lives “in memory” is a killer feature of DuckDB (one that it inherits from SQLite). We don’t need to connect to some complicated, existing database infrastructure to harness all of DuckDB’s power. Instead we can just spin up an ephemeral database that interacts directly with our R (or Python, or Julia, etc.) client.
However, it’s worth noting that you can create a persistent, disk-backed database simply by providing a database file path argument as part of your connection, e.g.
## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.
# con = dbConnect(duckdb(), dbdir = "nyc.duck")
## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.
# con = duckdb.connect(database = 'nyc.duck', read_only = False)
(Note that the ".duck"
file extension above is optional. You could also use ".db"
, ".dbb"
, or whatever you want really.)
One really important benefit of creating a persistent disk-backed database is that it enables out-of-core computation for bigger than RAM data. See here for more details and performance considerations (which are still great).
Just How Big is This?
Before we get started with some analysis tasks, here’s a simple question to ask our database: how many rows are we dealing with here, actually?
= dbGetQuery(
nrows
con,"
FROM 'nyc-taxi/**/*.parquet'
SELECT
COUNT(passenger_count) as nrows
"
)
# con.
# query(
# '''
# FROM 'nyc-taxi/**/*.parquet'
# SELECT
# COUNT(passenger_count) as nrows
# '''
# )
Ok, well that is 178,544,324. 😬
First example
We’ll start with a simple aggregation query to get situated. I’ll also use this example to highlight some general features of DuckDB SQL and the underlying query engine.
Okay, first query. Let’s say we want to know: What is the average tip per passenger count? A typical SQL job for this question might look as follows:
SELECT
passenger_count,AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count
(Where the last ORDER BY
statement is optional. Note that ordering (i.e., sorting) is a potentially expensive operation but we’ll get back to that later.)
This is perfectly valid DuckDB SQL too. However, we can rewrite it with slightly nicer syntax thanks DuckDB’s “friendly SQL”. The key changes for this simple query are going to be: (1) the FROM
statement comes first, and (2) we can use the GROUP BY ALL
and ORDER BY ALL
statements to avoid writing out the “passenger_count” grouping column multiple times.2
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
One of the under-appreciated (IMHO) features of DuckDB is that it supports many syntax enhancements over tradional SQL dialects, which they collectively dub “friendly SQL”. Together these syntax enhancements allow you to write much more ergonomic SQL queries that cut down on duplication and logical inconsistencies.
To run this operation from our R or Python client, simply pass the SQL query as a string to our connection. Let’s use this as a chance to save the result and time our query too.
= Sys.time()
tic = dbGetQuery(
dat1
con,"
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
"
)= Sys.time()
toc
dat1
passenger_count mean_tip
1 0 0.8620988
2 1 1.1510110
3 2 1.0815798
4 3 0.9629494
5 4 0.8445190
6 5 1.1027325
7 6 1.1283649
8 7 0.5441176
9 8 0.3507692
10 9 0.8068000
11 10 0.0000000
12 65 0.0000000
13 66 1.5000000
14 177 1.0000000
15 208 0.0000000
16 247 2.3000000
17 249 0.0000000
18 254 0.0000000
- tic toc
Time difference of 2.233316 secs
= time.time()
tic = (
dat1
con.
query('''
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
'''
)
)= time.time()
toc
dat1
┌─────────────────┬─────────────────────┐
│ passenger_count │ mean_tip │
│ int64 │ double │
├─────────────────┼─────────────────────┤
│ 0 │ 0.862098814142438 │
│ 1 │ 1.1510109615467636 │
│ 2 │ 1.0815798424002565 │
│ 3 │ 0.962949365789299 │
│ 4 │ 0.8445189789660231 │
│ 5 │ 1.1027324532618157 │
│ 6 │ 1.1283649236954085 │
│ 7 │ 0.5441176470588235 │
│ 8 │ 0.35076923076923083 │
│ 9 │ 0.8068 │
│ 10 │ 0.0 │
│ 65 │ 0.0 │
│ 66 │ 1.5 │
│ 177 │ 1.0 │
│ 208 │ 0.0 │
│ 247 │ 2.3 │
│ 249 │ 0.0 │
│ 254 │ 0.0 │
├─────────────────┴─────────────────────┤
│ 18 rows 2 columns │
└───────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).
Note that we actually get a polars DataFrame as a return object. Click the callout box below to learn more.
By default, the con.query
method that we are using here will return a polars DataFrame object that Python understands “natively” (i.e., has a print method for and so on). Behind the scenes, this duckdb to polars integration relies on the pyarrow library being available to our Python environment, which have already installed for this workshop.
It’s also possible return other types of Python objects. For example, you can use the .df()
method to coerce to a pandas DataFrame instead, among several other formats like numpy arrays. (Details here.) Given the focus of this workshop, it won’t surprise you to hear that I’m going to stick with the default polars conversion.
So that only took 2.23 seconds in this rendered Quarto doc (and will likely be even faster when you try in an interactive session). To underscore just how crazy impressive this is, recall that this includes the time that it takes to read the data from disk. I can almost guarantee that the read + serialization time alone for traditional data wrangling workflows would take several minutes, and probably crash my laptop RAM. Don’t forget that our full dataset is nearly 200 million rows deep and 30 columns wide.
Aside: We clearly have a few outlier typos in our dataset. 254 passengers in a single taxi trip? I don’t think so. We’d probably want to filter these out with a WHERE
statement if we were doing serious analysis, but I’m just going to leave them in for this tutorial.
Aggregation
Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.
= Sys.time()
tic = dbGetQuery(
dat2
con,"
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
"
)= Sys.time()
toc
head(dat2)
month passenger_count mean_tip
1 1 3 0.8752660
2 1 6 1.0175694
3 2 3 0.8948977
4 2 6 1.0218459
5 3 6 1.0515659
6 3 3 0.9121819
- tic toc
Time difference of 0.585984 secs
= time.time()
tic = (
dat2
con.
query('''
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
'''
)
)= time.time()
toc
dat2
┌───────┬─────────────────┬────────────────────┐
│ month │ passenger_count │ mean_tip │
│ int64 │ int64 │ double │
├───────┼─────────────────┼────────────────────┤
│ 1 │ 3 │ 0.8752659692390745 │
│ 1 │ 6 │ 1.0175694433120157 │
│ 2 │ 3 │ 0.8948976704752726 │
│ 2 │ 6 │ 1.0218459360559868 │
│ 3 │ 3 │ 0.9121818858010095 │
│ 3 │ 6 │ 1.0515659390082799 │
│ 1 │ 0 │ 0.8417178636744063 │
│ 2 │ 0 │ 0.8766365326038179 │
│ 3 │ 0 │ 0.8776747542704466 │
│ 1 │ 5 │ 1.0011984858733018 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 1 │ 7 │ 0.0 │
│ 1 │ 8 │ 0.0 │
│ 1 │ 1 │ 1.036862814231083 │
│ 2 │ 1 │ 1.0684897865838967 │
│ 3 │ 1 │ 1.0892047410487358 │
│ 1 │ 2 │ 0.9647090920018082 │
│ 1 │ 208 │ 0.0 │
│ 2 │ 2 │ 0.9908003546925659 │
│ 3 │ 2 │ 1.0096468528132239 │
│ 3 │ 208 │ 0.0 │
├───────┴─────────────────┴────────────────────┤
│ 29 rows (20 shown) 3 columns │
└──────────────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).
Note that this query completed even faster than the first one, even though we added another grouping variable. Reason: Subsetting along our Hive-partitioned parquet dataset allows DuckDB to take shortcuts. We can see this directly by prepending an EXPLAIN
statement to our query to reveal the optmized query plan.
dbGetQuery(
con,"
EXPLAIN
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
"
)
physical_plan
┌───────────────────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ Groups: │
│ #0 │
│ #1 │
│ │
│ Aggregates: avg(#2) │
│ │
│ ~22453698 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ month │
│ passenger_count │
│ tip_amount │
│ │
│ ~44907396 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ──────────────────── │
│ Function: │
│ PARQUET_SCAN │
│ │
│ Projections: │
│ month │
│ passenger_count │
│ tip_amount │
│ │
│ File Filters: │
│ (month <= 3) │
│ │
│ Scanning Files: 3/12 │
│ │
│ ~44907396 Rows │
└───────────────────────────┘
con.query('''
EXPLAIN
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
'''
)
┌───────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ explain_key │ explain_value │
│ varchar │ varchar │
├───────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ physical_plan │ ┌───────────────────────────┐\n│ HASH_GROUP_BY │\n│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │\n│ … │
└───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘
tl;dr DuckDB is able to exploit the month partition of our dataset, so subsetting means that it can avoid unecessary data ingestion. Similarly, it only reads in a select group of columns; that’s what the “PROJECTION” part of the plan denotes. If nothing else, the take-home message is that DuckDB only does what it needs to. Laziness as a virtue!
Here’s a final aggregation example, this time including a high-dimensional grouping column (i.e., “trip_distance”) and some additional aggregations.
= Sys.time()
tic = dbGetQuery(
dat3
con,"
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
"
)= Sys.time()
toc
nrow(dat3)
[1] 25569
head(dat3)
passenger_count trip_distance mean_tip mean_fare
1 2 2.6 0.9732562 10.619201
2 1 2.6 1.1313478 10.543915
3 1 6.4 2.0454613 19.462782
4 0 4.1 1.3378420 13.334745
5 0 1.3 0.6467107 6.559703
6 1 9.5 3.4773251 26.061723
- tic toc
Time difference of 4.049236 secs
= time.time()
tic = (
dat3
con.
query('''
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
'''
)
)= time.time()
toc
len(dat3)
25569
dat3
┌─────────────────┬───────────────┬─────────────────────┬────────────────────┐
│ passenger_count │ trip_distance │ mean_tip │ mean_fare │
│ int64 │ double │ double │ double │
├─────────────────┼───────────────┼─────────────────────┼────────────────────┤
│ 1 │ 10.6 │ 3.798417230590116 │ 28.70962220766925 │
│ 4 │ 0.8 │ 0.31864535306162817 │ 5.521868287205346 │
│ 1 │ 3.1 │ 1.2789624509439401 │ 11.872071130994996 │
│ 1 │ 5.3 │ 1.812836031545094 │ 17.073445743534876 │
│ 2 │ 16.1 │ 4.069615695792877 │ 45.16434061488674 │
│ 1 │ 2.4 │ 1.0694859422619258 │ 10.005680611735821 │
│ 2 │ 11.4 │ 3.1219094019637015 │ 31.17485569770902 │
│ 4 │ 1.0 │ 0.35807810956799535 │ 6.19363893529085 │
│ 3 │ 5.3 │ 1.3933198870512298 │ 17.3719846712384 │
│ 2 │ 3.1 │ 1.115457631079296 │ 11.991438118884266 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 2 │ 3.49 │ 1.2619583410823876 │ 12.730611865352428 │
│ 6 │ 16.97 │ 4.9274531835205995 │ 47.52022471910112 │
│ 2 │ 11.62 │ 3.246236842105263 │ 31.140263157894736 │
│ 4 │ 10.7 │ 2.0272682201289043 │ 29.590927119484384 │
│ 1 │ 20.72 │ 5.406076260762608 │ 49.67367773677737 │
│ 4 │ 1.02 │ 0.512334250343879 │ 5.957606602475937 │
│ 6 │ 7.31 │ 2.157258741258741 │ 21.743776223776223 │
│ 3 │ 11.8 │ 2.6827715064227324 │ 32.187855196574546 │
│ 1 │ 14.27 │ 3.765272887323943 │ 39.136839788732395 │
│ 2 │ 5.26 │ 1.6736970837253058 │ 16.93365475070555 │
├─────────────────┴───────────────┴─────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown) 4 columns │
└────────────────────────────────────────────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).
Pivot (reshape)
Let’s explore some pivot (reshape) examples, by building off the previous query.
UNPIVOT
: wide => longPIVOT
: long => wide
Here I’ll use a Common Table Expression (CTE) to define a temporary table tmp_table
, before unpivoting—i.e., reshaping long—at the end.
= dbGetQuery(
dat_long
con,"
WITH tmp_table AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
)
UNPIVOT tmp_table
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
"
)
head(dat_long)
passenger_count trip_distance variable amount
1 1 0.8 mean_tip 0.5347135
2 1 0.8 mean_fare 5.3604431
3 1 1.0 mean_tip 0.6155292
4 1 1.0 mean_fare 6.0121158
5 1 0.4 mean_tip 0.3612548
6 1 0.4 mean_fare 4.1737650
= (
dat_long
con.
query('''
WITH tmp_table AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
)
UNPIVOT tmp_table
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
'''
)
)
dat_long
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable │ amount │
│ int64 │ double │ varchar │ double │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│ 2 │ 1.8 │ mean_tip │ 0.7537931462210244 │
│ 2 │ 1.8 │ mean_fare │ 8.450205623706095 │
│ 1 │ 0.2 │ mean_tip │ 0.3864478915075298 │
│ 1 │ 0.2 │ mean_fare │ 4.67119257714553 │
│ 1 │ 3.47 │ mean_tip │ 1.3073502150223362 │
│ 1 │ 3.47 │ mean_fare │ 12.75832115569286 │
│ 1 │ 4.7 │ mean_tip │ 1.6794478064449927 │
│ 1 │ 4.7 │ mean_fare │ 15.728522049844548 │
│ 1 │ 2.7 │ mean_tip │ 1.1625732397772206 │
│ 1 │ 2.7 │ mean_fare │ 10.811487937005014 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 6 │ 6.06 │ mean_tip │ 1.7787407407407405 │
│ 6 │ 6.06 │ mean_fare │ 18.787736625514402 │
│ 3 │ 8.82 │ mean_tip │ 2.5983870967741933 │
│ 3 │ 8.82 │ mean_fare │ 24.434677419354845 │
│ 4 │ 8.14 │ mean_tip │ 2.4210869565217386 │
│ 4 │ 8.14 │ mean_fare │ 23.169130434782605 │
│ 2 │ 8.58 │ mean_tip │ 2.58382030679328 │
│ 2 │ 8.58 │ mean_fare │ 23.940832724616502 │
│ 5 │ 6.58 │ mean_tip │ 1.9479097682119209 │
│ 5 │ 6.58 │ mean_fare │ 19.83091887417219 │
├─────────────────┴───────────────┴───────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown) 4 columns │
└──────────────────────────────────────────────────────────────────┘
Another option would have been to create a new table in memory and then pivot over that, which segues nicely to…
Digression: Create new tables
CTEs are a very common, and often efficient, way to implement multi-table operations in SQL. But, for the record, we can create new tables in DuckDB’s memory cache pretty easily using the CREATE TABLE
statement.
Instead of DBI::dbGetQuery
, we must now use DBI::dbExecute
.
dbExecute(
con,"
CREATE TABLE taxi2 AS
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
"
)
[1] 25569
dbListTables(con)
[1] "taxi2"
FWIW, you can always remove a table with dbRemoveTable()
.
Instead of con.query()
, we must now use con.execute()
.
con.execute('''
CREATE TABLE taxi2 AS
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
'''
)
<duckdb.duckdb.DuckDBPyConnection object at 0x127f52e30>
# https://stackoverflow.com/q/75727685
con.query('''
SELECT table_name, estimated_size AS nrows, column_count AS ncols
FROM duckdb_tables;
'''
)
┌────────────┬───────┬───────┐
│ table_name │ nrows │ ncols │
│ varchar │ int64 │ int64 │
├────────────┼───────┼───────┤
│ taxi2 │ 25569 │ 4 │
└────────────┴───────┴───────┘
Back to reshaping
With our new taxi2
table in hand, let’s redo the previous unpivot query directly on this new table. This makes the actual (un)pivot statement a bit clearer… and also separates out the execution time.
dbGetQuery(
con,"
UNPIVOT taxi2
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
LIMIT 5
"
)
passenger_count trip_distance variable amount
1 1 2.0 mean_tip 0.9423356
2 1 2.0 mean_fare 8.9324987
3 1 10.8 mean_tip 3.8087776
4 1 10.8 mean_fare 29.2385560
5 1 3.0 mean_tip 1.2506863
con.query('''
UNPIVOT taxi2
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
LIMIT 5
'''
)
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable │ amount │
│ int64 │ double │ varchar │ double │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│ 1 │ 3.7 │ mean_tip │ 1.4340712759977325 │
│ 1 │ 3.7 │ mean_fare │ 13.385660725559163 │
│ 1 │ 2.9 │ mean_tip │ 1.2233798516044125 │
│ 1 │ 2.9 │ mean_fare │ 11.352554762180501 │
│ 2 │ 4.0 │ mean_tip │ 1.3716844336946659 │
└─────────────────┴───────────────┴───────────┴────────────────────┘
(Note how crazy fast pivoting in DuckDB actually is.)
Joins (merges)
It’s a bit hard to demonstrate a join with only a single main table. But here is a contrived example, where we calculate the mean monthly tips and the mean monthly fares as separate sub-queries (CTEs), before joining them together by month.
dbGetQuery(
con,"
WITH
mean_tips AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(tip_amount) AS mean_tip
GROUP BY month
),
mean_fares AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(fare_amount) AS mean_fare
GROUP BY month
)
FROM mean_tips
LEFT JOIN mean_fares
USING (month)
SELECT *
ORDER BY mean_tips.month
"
)
month mean_tip mean_fare
1 1 1.007817 9.813488
2 2 1.036874 9.942640
3 3 1.056353 10.223107
4 4 1.043167 10.335490
5 5 1.078014 10.585157
6 6 1.091082 10.548651
7 7 1.059312 10.379943
8 8 1.079521 10.492650
9 9 1.254601 12.391198
10 10 1.281239 12.501252
11 11 1.250903 12.270138
12 12 1.237651 12.313953
con.query('''
WITH
mean_tips AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(tip_amount) AS mean_tip
GROUP BY month
),
mean_fares AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(fare_amount) AS mean_fare
GROUP BY month
)
FROM mean_tips
LEFT JOIN mean_fares
USING (month)
SELECT *
ORDER BY mean_tips.month
'''
)
┌───────┬────────────────────┬────────────────────┐
│ month │ mean_tip │ mean_fare │
│ int64 │ double │ double │
├───────┼────────────────────┼────────────────────┤
│ 1 │ 1.0078165246989736 │ 9.813487671828481 │
│ 2 │ 1.0368737381554407 │ 9.942640301299889 │
│ 3 │ 1.0563527428724868 │ 10.22310721615329 │
│ 4 │ 1.043167490141166 │ 10.335489610548963 │
│ 5 │ 1.0780143169836582 │ 10.585156844133927 │
│ 6 │ 1.0910820093813216 │ 10.548651231531705 │
│ 7 │ 1.0593122394563554 │ 10.379943069577578 │
│ 8 │ 1.079520899122755 │ 10.492650001889869 │
│ 9 │ 1.2546008978996 │ 12.391197540031683 │
│ 10 │ 1.28123927968838 │ 12.50125248419417 │
│ 11 │ 1.2509031985271484 │ 12.27013751494445 │
│ 12 │ 1.2376507362291627 │ 12.31395285761324 │
├───────┴────────────────────┴────────────────────┤
│ 12 rows 3 columns │
└─────────────────────────────────────────────────┘
Redo the above join but, rather than using CTEs, use tables that you first create in DuckDB’s memory bank. Again, this will simplify the actual join operation and also emphasise how crazy fast joins are in DuckDB.
Windowing
One last example: Binning “trip_distance” into deciles and then calculating the the mean “tip_amount” within each decile. This is an example of a window function and query pattern that I use all the time in my own work. I find it extremely useful for quickly pulling out descriptive patterns from large datasets, from which I can then develop a better intuition of my data. In turn, this shapes the hypotheses and modeling choices that I make in the subsequent analysis stage.
I’m using a 1% random sample of my data here (see the USING SAMPLE 1%
statement). Why? Because calculating deciles requires ranking your data and this is expensive! To rank data, we first have to sort it (ORDER
) and this requires evaluating/comparing every single row in your dataset. In turn, this means that it’s very hard to take shortcuts. (This is one reason why DuckDB’s optimized query plan will always delay sorting until as late as possible; to only sort on a smaller subset/aggregation of the data if possible.) FWIW, DuckDB’s sorting algorithm is still crazy fast. But for data of this size, and where sorting on the full datset is unavoidable, I strongly recommend sampling first. Your general insights will almost certainly remain intact.
dbGetQuery(
con,"
WITH trip_deciles AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
tip_amount,
trip_distance,
NTILE(10) OVER (ORDER BY trip_distance) AS decile
USING SAMPLE 1%
)
FROM trip_deciles
SELECT
decile,
AVG(trip_distance) AS mean_distance,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
"
)
decile mean_distance mean_tip
1 1 0.4447737 0.5654472
2 2 0.8035106 0.5019584
3 3 1.0565594 0.5995832
4 4 1.3158020 0.6832499
5 5 1.6161425 0.7753690
6 6 1.9806189 0.8786339
7 7 2.4748273 1.0192063
8 8 3.2292295 1.2240061
9 9 4.7245936 1.5924244
10 10 11.0015967 3.2038042
con.query('''
WITH trip_deciles AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
tip_amount,
trip_distance,
NTILE(10) OVER (ORDER BY trip_distance) AS decile
USING SAMPLE 1%
)
FROM trip_deciles
SELECT
decile,
AVG(trip_distance) AS mean_distance,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
'''
)
┌────────┬────────────────────┬────────────────────┐
│ decile │ mean_distance │ mean_tip │
│ int64 │ double │ double │
├────────┼────────────────────┼────────────────────┤
│ 1 │ 0.4462739191673977 │ 0.5716254740492032 │
│ 2 │ 0.8055297973770881 │ 0.5000708635821985 │
│ 3 │ 1.0607604873791494 │ 0.5942867281760226 │
│ 4 │ 1.321170068101389 │ 0.6758381506021942 │
│ 5 │ 1.6243709739488024 │ 0.7691185887885554 │
│ 6 │ 1.9912377356013096 │ 0.8836751743156347 │
│ 7 │ 2.4894762676127664 │ 1.0274908304845243 │
│ 8 │ 3.2452311476394033 │ 1.2313065408307358 │
│ 9 │ 4.738938720005852 │ 1.5975793020874476 │
│ 10 │ 11.033802816138595 │ 3.2242121801505546 │
├────────┴────────────────────┴────────────────────┤
│ 10 rows 3 columns │
└──────────────────────────────────────────────────┘
Close connection
dbDisconnect(con)
Again, this step isn’t strictly necessary since we instantiated our connection with the shutdown = TRUE
argument. But it’s worth seeing in case you want to be explicit.
con.close()
Footnotes
Aside: The
shutdown = TRUE
argument is a convenience feature that ensures our connection is automatically terminated when our R session ends (i.e., even if we forget to do it manually.) I’m not aware of a similar convenience argument for Python; please let me know if I am missing something.↩︎I’ll admit that the benefits don’t seem so great for this simple example. But trust me: they make a big difference once you start having lots of grouping columns and complex sub-queries.↩︎