DuckDB in Julia

Use the same queries from Julia

Author
Affiliation

University of Turin

Published

October 30, 2024

Load libraries

We have a julia package as well for duckdb. 1

Create a database connection

We proceed as before with the other languages:

using DuckDB
using DataFrames
using Dates
using Chain
con = DBInterface.connect(DuckDB.DB)
DuckDB.DB(":memory:")

Let’s run the first simple query again to get the number of rows, as before with R:

nrowsdb = @chain con begin
    DBInterface.execute(
        """
        FROM 'nyc-taxi/**/*.parquet'
        SELECT COUNT(passenger_count) as nrows
        """
    )
end

ncolsdb = DBInterface.execute(con,
        """
        FROM 'nyc-taxi/**/*.parquet'
         SELECT * LIMIT 1
         """
         ) |> DataFrame |> names |> length

(rows = nrowsdb, cols = ncolsdb)
(rows = (nrows = [178544324],), cols = 24)

… and we know that this is a lot of data:

run(`du -h -d 0 nyc-taxi`)
8.6G    nyc-taxi
Process(`du -h -d 0 nyc-taxi`, ProcessExited(0))

First example

Same examples as previously:

SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count
time1 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    """
    ) |> DataFrame
2.151580334

That took 2.15 seconds. Here is the result:

first(dat1,5)
5×2 DataFrame
Row passenger_count mean_tip
Int64 Float64
1 0 0.862099
2 1 1.15101
3 2 1.08158
4 3 0.962949
5 4 0.844519

More Aggregation

Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.

time2 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
    """
    )
0.554095084

This time we clocked up 0.55 seconds - keep in mind that this is a lot of data to go through each time.

Julia Specifics: register_data_frame

One cool thing about the julia package is that you can register a local dataframe into the database, which means that there is no copy oepration performed. Duckdb will directly read that julia dataframe instead:

DuckDB.register_data_frame(con, dat1, "dat1_on_db")

# let's read the first row back out just to check
results = DBInterface.execute(con, 
    """
    SELECT * FROM dat1_on_db
    LIMIT 1
    """
    )
DataFrame(results)
1×2 DataFrame
Row passenger_count mean_tip
Int64 Float64
1 4 0.795176

Julia Specifics: Appender

The appender is an advanced duckdb feature which is optimized for fast row insertion into a database table. This feature is not available in neither R nor python packages. Here is an example. Let’s first create a new dataframe in your julia session and then read it row by row into an empty database table.

ndf = 100_000
df = DataFrame(
        id        = collect(1:ndf),
        value     = rand(Float32,ndf),
        timestamp = Dates.now() + Dates.Second.(1:ndf),
        date      = Dates.today() + Dates.Day.(1:ndf)
    )
first(df,5)
5×4 DataFrame
Row id value timestamp date
Int64 Float32 DateTime Date
1 1 0.689282 2024-10-30T22:06:49.044 2024-10-31
2 2 0.620124 2024-10-30T22:06:50.044 2024-11-01
3 3 0.17992 2024-10-30T22:06:51.044 2024-11-02
4 4 0.21845 2024-10-30T22:06:52.044 2024-11-03
5 5 0.586971 2024-10-30T22:06:53.044 2024-11-04

create a new database with an empty table called test_table1:

db = DuckDB.DB()

DBInterface.execute(db,
    """
    CREATE OR REPLACE TABLE 
    test_table1(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)
    """)
(Count = Int64[],)

Now let us prepare a SQL statement, into which we will repeatedly insert the row for each data. This makes use of the positional parameters ?:

stmt = DBInterface.prepare(db, 
    """
    INSERT INTO test_table1 VALUES(?,?,?,?)
    """
    )
DuckDB.Stmt(DuckDB.Connection(":memory:"), Ptr{Nothing} @0x0000000110ff7420, "INSERT INTO test_table1 VALUES(?,?,?,?)\n", DuckDB.MaterializedResult)

We are ready to time this now.

e = @elapsed res = 
    begin
        for r in eachrow(df)
            DBInterface.execute(stmt, 
                (r.id, r.value, r.timestamp, r.date)
            )
        end
        DBInterface.close!(stmt)  # close statement when done
    end
33.035374417

On my machine this took 33.04 seconds. This is probably not the most efficient way to get data into your database, but it could happen (if you create the data yourself from some modeling task, for example).

Now, let’s try with the Appender API. We create an identical table, and attach an Appender object to it:

DBInterface.execute(db,
    "CREATE OR REPLACE TABLE 
    test_table2(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)")

# create an appender on the second table
appender = DuckDB.Appender(db, "test_table2")
DuckDB.Appender(Ptr{Nothing} @0x00000001045e4340)

let’s see how fast that insert operation is now:

e2 = @elapsed begin
    for i in eachrow(df)
        for j in i
            DuckDB.append(appender, j)
        end
    DuckDB.end_row(appender)
    end
    DuckDB.close(appender)  # done now
end
0.179672166

Now only 0.18 seconds only! Holy smokes.

Meta Info

For the purpose of working with this notebook, we can currently only have a single quarto engine per notebook. While it’s perfectly possible to call both R and python code from a running julia session, one needs to annotate the code with some additional markup, which is not great for a learner. So I prefered to separate the julia version from the others. It’s most useful to create a local environment. To achieve this, you can open julia in the root of this document and do

] activate .  # activate current dir in Pkg mode
add DuckDB
add DataFrames
add Dates
add Chain

and quarto will pick up this env when it sees engine: julia in the frontmatter.

Footnotes

  1. Note that there is currently a bug in the QuartoRunner setup. So, to run the quarto notebook, you have to use julia 1.10. For just executing the code, the latest julia version should work↩︎