DuckDB SQL

Use the same queries from R, Python, Julia…

What is a Database?

You can think of a database as a collection of tables (or dataframes). In general, the benefit of databases is that they can handle very large amounts of data - data which would not fit in your computer’s RAM. As such, most databases are stored on a hard disk, from where data is read and written to.

There are many different forms of databases. We have several so-called database management systems (DBMS), which allow us to interact with each of them. Unhelpfully, each different flavor of database has a slightly different dialect of a what is called the structured query language, SQL for short. Helpfully, on the other hand, there are several packages out there which help us keeping the effort manageable by translating our queries. We will see a few of those here, and in different driving languages (i.e. R, python and julia).

One final important feature of different databases is how (and where) they actually run. The more traditional solutions exist as server-client models, where you would start on a machine (could be your own) a database server, which interacts with the persistent storage (e.g. with your hard disk) to read and write data, and a client process, where you formulate so-called queries, and get back the results from the server. In some sense you formulate a question for the database in the client, you submit it to the database, and you get back an answer. Of course those server processes can run on dedicated and powerful machines which you can reach via a network, or directly in the cloud, where they are used to work with arbitrarily huge amounts of data.

We will focus on a relatively recent addition to the database family called duckdb. The main benefit (beyond it’s excellent performance) is that it is very easy to run a database without the need to set up the client-server structure. We will come to that in due course.

What is `SQL`

As already mentioned, this is the main database language. Here is an example:

SELECT carat, cut, clarity, color, price 
  FROM diamonds 
  WHERE price > 15000

This reads:

From the table called diamonds take the columns carat, cut, clarity, color, price, but keep only rows where the column price is larger than the value 15000

Not too bad, right?

Let’s get this example running now in our computers. We need to create a duckdb database first, read some data into it, and the confront it with that SQL query. For simplicity we will do this only in R to get us started, then we will move on to a more elaborate example and big data, which will use both R, python and julia.

Simple Intro Example with `R`

library(DBI)
library(duckdb)
library(tibble)
con = DBI::dbConnect(duckdb::duckdb(), shutdown = TRUE)  # will get erased after shutdown
# con = DBI::dbConnect(duckdb::duckdb(), dbdir = "db.duck") # would save on disk

The con object is our live connection to the database. Think of it like a telephone receiver - you can talk into it, and it will respond results back to you. As such, we have to refer to con in everything we want to do with the database.

Let us first read some data from the ggplot2 package, and add it to our database. Remember, a database is a collection of tables, and now we are adding two of those:

dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)

Ok, let’s see what tables we have in the database now:

dbListTables(con)

[1] "diamonds" "mpg"

Great. Now how do get data out of it? Here goes the mpg dataset:

con |>
    dbReadTable("mpg") |>
    as_tibble()

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Back to our SQL query from above. We can submit it to the database listening at con:

query1 = "SELECT carat, cut, clarity, color, price 
  FROM diamonds 
  WHERE price > 15000"

head(dbGetQuery(con, query1))

  carat       cut clarity color price
1  1.54   Premium     VS2     E 15002
2  1.19     Ideal    VVS1     F 15005
3  2.10   Premium     SI1     I 15007
4  1.69     Ideal     SI1     D 15011
5  1.50 Very Good    VVS2     G 15013
6  1.73 Very Good     VS1     G 15014

or of course as before with a pipeline such as

con |>
    dbGetQuery(query1) |>
    as_tibble()

# A tibble: 1,655 × 5
   carat cut       clarity color price
   <dbl> <fct>     <fct>   <fct> <int>
 1  1.54 Premium   VS2     E     15002
 2  1.19 Ideal     VVS1    F     15005
 3  2.1  Premium   SI1     I     15007
 4  1.69 Ideal     SI1     D     15011
 5  1.5  Very Good VVS2    G     15013
 6  1.73 Very Good VS1     G     15014
 7  2.02 Premium   SI2     G     15014
 8  2.05 Very Good SI2     F     15017
 9  1.5  Very Good VS1     F     15022
10  1.82 Very Good SI1     G     15025
# ℹ 1,645 more rows

That’s it for starters. There is now a whole world out there to learn in terms of SQL queries, database joins, pivots (i.e. reshape operations) etc. I leave you in the hands of Hadley Wickham and his dedicated chapter on this topic if you want to dive deeper. It will be almost unavoidable to learn some simple SQL, but hopefully it won’t be so bad. You will see in the associated pages of this tutorial that there are some very nice translation tools for R. Grant’s intro to databases is another valuable resource here. Now, however, let’s get cracking with the real show, i.e. our big data set on NYC taxi rides.

# setup python from R with reticulate package
library(reticulate)
Sys.setenv(RETICULATE_PYTHON=here::here(".venv/bin/python"))

library(duckdb)  # loaded already

Loading required package: DBI

import duckdb
import time # just for timing some queries

Create a database connection

The first thing we need to do is instantiate a connection with an in-memory database.¹

R
Python

con = dbConnect(duckdb(), shutdown = TRUE)

con = duckdb.connect(database = ':memory:', read_only = False)

Digression: In-memory versus on-disk

The fact that our connection lives “in memory” is a killer feature of DuckDB (one that it inherits from SQLite). We don’t need to connect to some complicated, existing database infrastructure to harness all of DuckDB’s power. Instead we can just spin up an ephemeral database that interacts directly with our R (or Python, or Julia, etc.) client.

However, it’s worth noting that you can create a persistent, disk-backed database simply by providing a database file path argument as part of your connection, e.g.

R
Python

## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.

# con = dbConnect(duckdb(), dbdir = "nyc.duck")

## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.

# con = duckdb.connect(database = 'nyc.duck', read_only = False)

(Note that the ".duck" file extension above is optional. You could also use ".db", ".dbb", or whatever you want really.)

Bigger than RAM data?

One really important benefit of creating a persistent disk-backed database is that it enables out-of-core computation for bigger than RAM data. See here for more details and performance considerations (which are still great).

Just How Big is This?

Before we get started with some analysis tasks, here’s a simple question to ask our database: how many rows are we dealing with here, actually?

R
python

nrows = dbGetQuery(
      con,
      "
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        COUNT(passenger_count) as nrows
      "
    )

# con.
#   query(
#   '''
#     FROM 'nyc-taxi/**/*.parquet'
#     SELECT
#     COUNT(passenger_count) as nrows
#   '''
# )

Ok, well that is 178,544,324. 😬

First example

We’ll start with a simple aggregation query to get situated. I’ll also use this example to highlight some general features of DuckDB SQL and the underlying query engine.

Okay, first query. Let’s say we want to know: What is the average tip per passenger count? A typical SQL job for this question might look as follows:

SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count

(Where the last ORDER BY statement is optional. Note that ordering (i.e., sorting) is a potentially expensive operation but we’ll get back to that later.)

This is perfectly valid DuckDB SQL too. However, we can rewrite it with slightly nicer syntax thanks DuckDB’s “friendly SQL”. The key changes for this simple query are going to be: (1) the FROM statement comes first, and (2) we can use the GROUP BY ALL and ORDER BY ALL statements to avoid writing out the “passenger_count” grouping column multiple times.²

FROM 'nyc-taxi/**/*.parquet'
SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL

DuckDB’s “friendly SQL”

One of the under-appreciated (IMHO) features of DuckDB is that it supports many syntax enhancements over tradional SQL dialects, which they collectively dub “friendly SQL”. Together these syntax enhancements allow you to write much more ergonomic SQL queries that cut down on duplication and logical inconsistencies.

To run this operation from our R or Python client, simply pass the SQL query as a string to our connection. Let’s use this as a chance to save the result and time our query too.

R
Python

tic = Sys.time()
dat1 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)
toc = Sys.time()

dat1

   passenger_count  mean_tip
1                0 0.8620988
2                1 1.1510110
3                2 1.0815798
4                3 0.9629494
5                4 0.8445190
6                5 1.1027325
7                6 1.1283649
8                7 0.5441176
9                8 0.3507692
10               9 0.8068000
11              10 0.0000000
12              65 0.0000000
13              66 1.5000000
14             177 1.0000000
15             208 0.0000000
16             247 2.3000000
17             249 0.0000000
18             254 0.0000000

toc - tic

Time difference of 2.233316 secs

tic = time.time()
dat1 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    '''
    )
)
toc = time.time()

dat1

┌─────────────────┬─────────────────────┐
│ passenger_count │      mean_tip       │
│      int64      │       double        │
├─────────────────┼─────────────────────┤
│               0 │   0.862098814142438 │
│               1 │  1.1510109615467636 │
│               2 │  1.0815798424002565 │
│               3 │   0.962949365789299 │
│               4 │  0.8445189789660231 │
│               5 │  1.1027324532618157 │
│               6 │  1.1283649236954085 │
│               7 │  0.5441176470588235 │
│               8 │ 0.35076923076923083 │
│               9 │              0.8068 │
│              10 │                 0.0 │
│              65 │                 0.0 │
│              66 │                 1.5 │
│             177 │                 1.0 │
│             208 │                 0.0 │
│             247 │                 2.3 │
│             249 │                 0.0 │
│             254 │                 0.0 │
├─────────────────┴─────────────────────┤
│ 18 rows                     2 columns │
└───────────────────────────────────────┘

# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Note that we actually get a polars DataFrame as a return object. Click the callout box below to learn more.

Result conversion in Python (click to expand)

By default, the con.query method that we are using here will return a polars DataFrame object that Python understands “natively” (i.e., has a print method for and so on). Behind the scenes, this duckdb to polars integration relies on the pyarrow library being available to our Python environment, which have already installed for this workshop.

It’s also possible return other types of Python objects. For example, you can use the .df() method to coerce to a pandas DataFrame instead, among several other formats like numpy arrays. (Details here.) Given the focus of this workshop, it won’t surprise you to hear that I’m going to stick with the default polars conversion.

So that only took 2.23 seconds in this rendered Quarto doc (and will likely be even faster when you try in an interactive session). To underscore just how crazy impressive this is, recall that this includes the time that it takes to read the data from disk. I can almost guarantee that the read + serialization time alone for traditional data wrangling workflows would take several minutes, and probably crash my laptop RAM. Don’t forget that our full dataset is nearly 200 million rows deep and 30 columns wide.

Aside: We clearly have a few outlier typos in our dataset. 254 passengers in a single taxi trip? I don’t think so. We’d probably want to filter these out with a WHERE statement if we were doing serious analysis, but I’m just going to leave them in for this tutorial.

Aggregation

Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.

R
Python

tic = Sys.time()
dat2 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    month,
    passenger_count,
    AVG(tip_amount) AS mean_tip
  WHERE month <= 3
  GROUP BY ALL
  "
    )
toc = Sys.time()

head(dat2)

  month passenger_count  mean_tip
1     1               3 0.8752660
2     1               6 1.0175694
3     2               3 0.8948977
4     2               6 1.0218459
5     3               6 1.0515659
6     3               3 0.9121819

toc - tic

Time difference of 0.585984 secs

tic = time.time()
dat2 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
    '''
  )
)
toc = time.time()

dat2

┌───────┬─────────────────┬────────────────────┐
│ month │ passenger_count │      mean_tip      │
│ int64 │      int64      │       double       │
├───────┼─────────────────┼────────────────────┤
│     1 │               3 │ 0.8752659692390745 │
│     1 │               6 │ 1.0175694433120157 │
│     2 │               3 │ 0.8948976704752726 │
│     2 │               6 │ 1.0218459360559868 │
│     3 │               3 │ 0.9121818858010095 │
│     3 │               6 │ 1.0515659390082799 │
│     1 │               0 │ 0.8417178636744063 │
│     2 │               0 │ 0.8766365326038179 │
│     3 │               0 │ 0.8776747542704466 │
│     1 │               5 │ 1.0011984858733018 │
│     · │               · │                 ·  │
│     · │               · │                 ·  │
│     · │               · │                 ·  │
│     1 │               7 │                0.0 │
│     1 │               8 │                0.0 │
│     1 │               1 │  1.036862814231083 │
│     2 │               1 │ 1.0684897865838967 │
│     3 │               1 │ 1.0892047410487358 │
│     1 │               2 │ 0.9647090920018082 │
│     1 │             208 │                0.0 │
│     2 │               2 │ 0.9908003546925659 │
│     3 │               2 │ 1.0096468528132239 │
│     3 │             208 │                0.0 │
├───────┴─────────────────┴────────────────────┤
│ 29 rows (20 shown)                 3 columns │
└──────────────────────────────────────────────┘

# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Note that this query completed even faster than the first one, even though we added another grouping variable. Reason: Subsetting along our Hive-partitioned parquet dataset allows DuckDB to take shortcuts. We can see this directly by prepending an EXPLAIN statement to our query to reveal the optmized query plan.

R
Python

dbGetQuery(
  con,
  "
  EXPLAIN
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
  "
)

physical_plan
┌───────────────────────────┐
│       HASH_GROUP_BY       │
│    ────────────────────   │
│          Groups:          │
│             #0            │
│             #1            │
│                           │
│    Aggregates: avg(#2)    │
│                           │
│       ~22453698 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           month           │
│      passenger_count      │
│         tip_amount        │
│                           │
│       ~44907396 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│    ────────────────────   │
│         Function:         │
│        PARQUET_SCAN       │
│                           │
│        Projections:       │
│           month           │
│      passenger_count      │
│         tip_amount        │
│                           │
│       File Filters:       │
│        (month <= 3)       │
│                           │
│    Scanning Files: 3/12   │
│                           │
│       ~44907396 Rows      │
└───────────────────────────┘

con.query(
  '''
  EXPLAIN
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
  '''
)

┌───────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  explain_key  │                                            explain_value                                             │
│    varchar    │                                               varchar                                                │
├───────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ physical_plan │ ┌───────────────────────────┐\n│       HASH_GROUP_BY       │\n│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │\n│    …  │
└───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

tl;dr DuckDB is able to exploit the month partition of our dataset, so subsetting means that it can avoid unecessary data ingestion. Similarly, it only reads in a select group of columns; that’s what the “PROJECTION” part of the plan denotes. If nothing else, the take-home message is that DuckDB only does what it needs to. Laziness as a virtue!

Here’s a final aggregation example, this time including a high-dimensional grouping column (i.e., “trip_distance”) and some additional aggregations.

R
Python

tic = Sys.time()
dat3 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    trip_distance,
    AVG(tip_amount) AS mean_tip,
    AVG(fare_amount) AS mean_fare
  GROUP BY ALL
"
)
toc = Sys.time()

nrow(dat3)

[1] 25569

head(dat3)

  passenger_count trip_distance  mean_tip mean_fare
1               2           2.6 0.9732562 10.619201
2               1           2.6 1.1313478 10.543915
3               1           6.4 2.0454613 19.462782
4               0           4.1 1.3378420 13.334745
5               0           1.3 0.6467107  6.559703
6               1           9.5 3.4773251 26.061723

toc - tic

Time difference of 4.049236 secs

tic = time.time()
dat3 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
    '''
    )
)
toc = time.time()

len(dat3)

dat3

┌─────────────────┬───────────────┬─────────────────────┬────────────────────┐
│ passenger_count │ trip_distance │      mean_tip       │     mean_fare      │
│      int64      │    double     │       double        │       double       │
├─────────────────┼───────────────┼─────────────────────┼────────────────────┤
│               1 │          10.6 │   3.798417230590116 │  28.70962220766925 │
│               4 │           0.8 │ 0.31864535306162817 │  5.521868287205346 │
│               1 │           3.1 │  1.2789624509439401 │ 11.872071130994996 │
│               1 │           5.3 │   1.812836031545094 │ 17.073445743534876 │
│               2 │          16.1 │   4.069615695792877 │  45.16434061488674 │
│               1 │           2.4 │  1.0694859422619258 │ 10.005680611735821 │
│               2 │          11.4 │  3.1219094019637015 │  31.17485569770902 │
│               4 │           1.0 │ 0.35807810956799535 │   6.19363893529085 │
│               3 │           5.3 │  1.3933198870512298 │   17.3719846712384 │
│               2 │           3.1 │   1.115457631079296 │ 11.991438118884266 │
│               · │            ·  │           ·         │          ·         │
│               · │            ·  │           ·         │          ·         │
│               · │            ·  │           ·         │          ·         │
│               2 │          3.49 │  1.2619583410823876 │ 12.730611865352428 │
│               6 │         16.97 │  4.9274531835205995 │  47.52022471910112 │
│               2 │         11.62 │   3.246236842105263 │ 31.140263157894736 │
│               4 │          10.7 │  2.0272682201289043 │ 29.590927119484384 │
│               1 │         20.72 │   5.406076260762608 │  49.67367773677737 │
│               4 │          1.02 │   0.512334250343879 │  5.957606602475937 │
│               6 │          7.31 │   2.157258741258741 │ 21.743776223776223 │
│               3 │          11.8 │  2.6827715064227324 │ 32.187855196574546 │
│               1 │         14.27 │   3.765272887323943 │ 39.136839788732395 │
│               2 │          5.26 │  1.6736970837253058 │  16.93365475070555 │
├─────────────────┴───────────────┴─────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                                    4 columns │
└────────────────────────────────────────────────────────────────────────────┘

# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Pivot (reshape)

Let’s explore some pivot (reshape) examples, by building off the previous query.

UNPIVOT: wide => long
PIVOT: long => wide

Here I’ll use a Common Table Expression (CTE) to define a temporary table tmp_table, before unpivoting—i.e., reshaping long—at the end.

R
Python

dat_long = dbGetQuery(
  con,
  "
  WITH tmp_table AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
  )
  UNPIVOT tmp_table
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  "
)

head(dat_long)

  passenger_count trip_distance  variable    amount
1               1           0.8  mean_tip 0.5347135
2               1           0.8 mean_fare 5.3604431
3               1           1.0  mean_tip 0.6155292
4               1           1.0 mean_fare 6.0121158
5               1           0.4  mean_tip 0.3612548
6               1           0.4 mean_fare 4.1737650

dat_long = (
  con.
  query(
    '''
    WITH tmp_table AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        passenger_count,
        trip_distance,
        AVG(tip_amount) AS mean_tip,
        AVG(fare_amount) AS mean_fare
      GROUP BY ALL
    )
    UNPIVOT tmp_table
    ON mean_tip, mean_fare
    INTO
      NAME variable
      VALUE amount
    '''
  )
)

dat_long

┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable  │       amount       │
│      int64      │    double     │  varchar  │       double       │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│               2 │           1.8 │ mean_tip  │ 0.7537931462210244 │
│               2 │           1.8 │ mean_fare │  8.450205623706095 │
│               1 │           0.2 │ mean_tip  │ 0.3864478915075298 │
│               1 │           0.2 │ mean_fare │   4.67119257714553 │
│               1 │          3.47 │ mean_tip  │ 1.3073502150223362 │
│               1 │          3.47 │ mean_fare │  12.75832115569286 │
│               1 │           4.7 │ mean_tip  │ 1.6794478064449927 │
│               1 │           4.7 │ mean_fare │ 15.728522049844548 │
│               1 │           2.7 │ mean_tip  │ 1.1625732397772206 │
│               1 │           2.7 │ mean_fare │ 10.811487937005014 │
│               · │            ·  │    ·      │          ·         │
│               · │            ·  │    ·      │          ·         │
│               · │            ·  │    ·      │          ·         │
│               6 │          6.06 │ mean_tip  │ 1.7787407407407405 │
│               6 │          6.06 │ mean_fare │ 18.787736625514402 │
│               3 │          8.82 │ mean_tip  │ 2.5983870967741933 │
│               3 │          8.82 │ mean_fare │ 24.434677419354845 │
│               4 │          8.14 │ mean_tip  │ 2.4210869565217386 │
│               4 │          8.14 │ mean_fare │ 23.169130434782605 │
│               2 │          8.58 │ mean_tip  │   2.58382030679328 │
│               2 │          8.58 │ mean_fare │ 23.940832724616502 │
│               5 │          6.58 │ mean_tip  │ 1.9479097682119209 │
│               5 │          6.58 │ mean_fare │  19.83091887417219 │
├─────────────────┴───────────────┴───────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                          4 columns │
└──────────────────────────────────────────────────────────────────┘

Another option would have been to create a new table in memory and then pivot over that, which segues nicely to…

Digression: Create new tables

CTEs are a very common, and often efficient, way to implement multi-table operations in SQL. But, for the record, we can create new tables in DuckDB’s memory cache pretty easily using the CREATE TABLE statement.

R
Python

Instead of DBI::dbGetQuery, we must now use DBI::dbExecute.

dbExecute(
    con,
    "
    CREATE TABLE taxi2 AS
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        passenger_count,
        trip_distance,
        AVG(tip_amount) AS mean_tip,
        AVG(fare_amount) AS mean_fare
      GROUP BY ALL
    "
)

[1] 25569

dbListTables(con)

[1] "taxi2"

FWIW, you can always remove a table with dbRemoveTable().

Instead of con.query(), we must now use con.execute().

con.execute(
  '''
  CREATE TABLE taxi2 AS
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
  '''
)

<duckdb.duckdb.DuckDBPyConnection object at 0x127f52e30>

# https://stackoverflow.com/q/75727685
con.query(
  '''
  SELECT table_name, estimated_size AS nrows, column_count AS ncols
  FROM duckdb_tables;
  '''
)

┌────────────┬───────┬───────┐
│ table_name │ nrows │ ncols │
│  varchar   │ int64 │ int64 │
├────────────┼───────┼───────┤
│ taxi2      │ 25569 │     4 │
└────────────┴───────┴───────┘

Back to reshaping

With our new taxi2 table in hand, let’s redo the previous unpivot query directly on this new table. This makes the actual (un)pivot statement a bit clearer… and also separates out the execution time.

R
Python

dbGetQuery(
  con,
  "
  UNPIVOT taxi2
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  LIMIT 5
  "
)

  passenger_count trip_distance  variable     amount
1               1           2.0  mean_tip  0.9423356
2               1           2.0 mean_fare  8.9324987
3               1          10.8  mean_tip  3.8087776
4               1          10.8 mean_fare 29.2385560
5               1           3.0  mean_tip  1.2506863

con.query(
  '''
  UNPIVOT taxi2
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  LIMIT 5
  '''
)

┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable  │       amount       │
│      int64      │    double     │  varchar  │       double       │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│               1 │           3.7 │ mean_tip  │ 1.4340712759977325 │
│               1 │           3.7 │ mean_fare │ 13.385660725559163 │
│               1 │           2.9 │ mean_tip  │ 1.2233798516044125 │
│               1 │           2.9 │ mean_fare │ 11.352554762180501 │
│               2 │           4.0 │ mean_tip  │ 1.3716844336946659 │
└─────────────────┴───────────────┴───────────┴────────────────────┘

(Note how crazy fast pivoting in DuckDB actually is.)

Joins (merges)

It’s a bit hard to demonstrate a join with only a single main table. But here is a contrived example, where we calculate the mean monthly tips and the mean monthly fares as separate sub-queries (CTEs), before joining them together by month.

R
Python

dbGetQuery(
  con,
  "
  WITH 
    mean_tips AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(tip_amount) AS mean_tip
      GROUP BY month
    ),
    mean_fares AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(fare_amount) AS mean_fare
      GROUP BY month 
    )
  FROM mean_tips
  LEFT JOIN mean_fares
  USING (month)
  SELECT *
  ORDER BY mean_tips.month
  "
)

   month mean_tip mean_fare
1      1 1.007817  9.813488
2      2 1.036874  9.942640
3      3 1.056353 10.223107
4      4 1.043167 10.335490
5      5 1.078014 10.585157
6      6 1.091082 10.548651
7      7 1.059312 10.379943
8      8 1.079521 10.492650
9      9 1.254601 12.391198
10    10 1.281239 12.501252
11    11 1.250903 12.270138
12    12 1.237651 12.313953

con.query(
  '''
  WITH 
    mean_tips AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(tip_amount) AS mean_tip
      GROUP BY month
    ),
    mean_fares AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(fare_amount) AS mean_fare
      GROUP BY month 
    )
  FROM mean_tips
  LEFT JOIN mean_fares
  USING (month)
  SELECT *
  ORDER BY mean_tips.month
  '''
)

┌───────┬────────────────────┬────────────────────┐
│ month │      mean_tip      │     mean_fare      │
│ int64 │       double       │       double       │
├───────┼────────────────────┼────────────────────┤
│     1 │ 1.0078165246989736 │  9.813487671828481 │
│     2 │ 1.0368737381554407 │  9.942640301299889 │
│     3 │ 1.0563527428724868 │  10.22310721615329 │
│     4 │  1.043167490141166 │ 10.335489610548963 │
│     5 │ 1.0780143169836582 │ 10.585156844133927 │
│     6 │ 1.0910820093813216 │ 10.548651231531705 │
│     7 │ 1.0593122394563554 │ 10.379943069577578 │
│     8 │  1.079520899122755 │ 10.492650001889869 │
│     9 │    1.2546008978996 │ 12.391197540031683 │
│    10 │   1.28123927968838 │  12.50125248419417 │
│    11 │ 1.2509031985271484 │  12.27013751494445 │
│    12 │ 1.2376507362291627 │  12.31395285761324 │
├───────┴────────────────────┴────────────────────┤
│ 12 rows                               3 columns │
└─────────────────────────────────────────────────┘

Challenge

Redo the above join but, rather than using CTEs, use tables that you first create in DuckDB’s memory bank. Again, this will simplify the actual join operation and also emphasise how crazy fast joins are in DuckDB.

Windowing

One last example: Binning “trip_distance” into deciles and then calculating the the mean “tip_amount” within each decile. This is an example of a window function and query pattern that I use all the time in my own work. I find it extremely useful for quickly pulling out descriptive patterns from large datasets, from which I can then develop a better intuition of my data. In turn, this shapes the hypotheses and modeling choices that I make in the subsequent analysis stage.

Sorting and sampling

I’m using a 1% random sample of my data here (see the USING SAMPLE 1% statement). Why? Because calculating deciles requires ranking your data and this is expensive! To rank data, we first have to sort it (ORDER) and this requires evaluating/comparing every single row in your dataset. In turn, this means that it’s very hard to take shortcuts. (This is one reason why DuckDB’s optimized query plan will always delay sorting until as late as possible; to only sort on a smaller subset/aggregation of the data if possible.) FWIW, DuckDB’s sorting algorithm is still crazy fast. But for data of this size, and where sorting on the full datset is unavoidable, I strongly recommend sampling first. Your general insights will almost certainly remain intact.

R
Python

dbGetQuery(
  con,
  "
  WITH trip_deciles AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      tip_amount,
      trip_distance,
      NTILE(10) OVER (ORDER BY trip_distance) AS decile
    USING SAMPLE 1%
  )
  FROM trip_deciles
  SELECT
    decile,
    AVG(trip_distance) AS mean_distance,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)

   decile mean_distance  mean_tip
1       1     0.4447737 0.5654472
2       2     0.8035106 0.5019584
3       3     1.0565594 0.5995832
4       4     1.3158020 0.6832499
5       5     1.6161425 0.7753690
6       6     1.9806189 0.8786339
7       7     2.4748273 1.0192063
8       8     3.2292295 1.2240061
9       9     4.7245936 1.5924244
10     10    11.0015967 3.2038042

con.query(
  '''
  WITH trip_deciles AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      tip_amount,
      trip_distance,
      NTILE(10) OVER (ORDER BY trip_distance) AS decile
    USING SAMPLE 1%
  )
  FROM trip_deciles
  SELECT
    decile,
    AVG(trip_distance) AS mean_distance,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  '''
)

┌────────┬────────────────────┬────────────────────┐
│ decile │   mean_distance    │      mean_tip      │
│ int64  │       double       │       double       │
├────────┼────────────────────┼────────────────────┤
│      1 │ 0.4462739191673977 │ 0.5716254740492032 │
│      2 │ 0.8055297973770881 │ 0.5000708635821985 │
│      3 │ 1.0607604873791494 │ 0.5942867281760226 │
│      4 │  1.321170068101389 │ 0.6758381506021942 │
│      5 │ 1.6243709739488024 │ 0.7691185887885554 │
│      6 │ 1.9912377356013096 │ 0.8836751743156347 │
│      7 │ 2.4894762676127664 │ 1.0274908304845243 │
│      8 │ 3.2452311476394033 │ 1.2313065408307358 │
│      9 │  4.738938720005852 │ 1.5975793020874476 │
│     10 │ 11.033802816138595 │ 3.2242121801505546 │
├────────┴────────────────────┴────────────────────┤
│ 10 rows                                3 columns │
└──────────────────────────────────────────────────┘

Close connection

R
Python

dbDisconnect(con)

Again, this step isn’t strictly necessary since we instantiated our connection with the shutdown = TRUE argument. But it’s worth seeing in case you want to be explicit.

con.close()

Footnotes

Aside: The shutdown = TRUE argument is a convenience feature that ensures our connection is automatically terminated when our R session ends (i.e., even if we forget to do it manually.) I’m not aware of a similar convenience argument for Python; please let me know if I am missing something.↩︎
I’ll admit that the benefits don’t seem so great for this simple example. But trust me: they make a big difference once you start having lots of grouping columns and complex sub-queries.↩︎

What is a Database?

What is SQL

Simple Intro Example with R

Load libraries

Create a database connection

Digression: In-memory versus on-disk

Just How Big is This?

First example

Aggregation

Pivot (reshape)

Digression: Create new tables

Back to reshaping

Joins (merges)

Windowing

Close connection

Footnotes

What is `SQL`

Simple Intro Example with `R`