DuckDB in Julia

Use the same queries from Julia

Author

Affiliation

Florian Oswald

University of Turin

Published

October 30, 2024

Load libraries

We have a julia package as well for duckdb. ¹

Create a database connection

We proceed as before with the other languages:

using DuckDB
using DataFrames
using Dates
using Chain
con = DBInterface.connect(DuckDB.DB)

DuckDB.DB(":memory:")

Let’s run the first simple query again to get the number of rows, as before with R:

nrowsdb = @chain con begin
    DBInterface.execute(
        """
        FROM 'nyc-taxi/**/*.parquet'
        SELECT COUNT(passenger_count) as nrows
        """
    )
end

ncolsdb = DBInterface.execute(con,
        """
        FROM 'nyc-taxi/**/*.parquet'
         SELECT * LIMIT 1
         """
         ) |> DataFrame |> names |> length

(rows = nrowsdb, cols = ncolsdb)

(rows = (nrows = [178544324],), cols = 24)

… and we know that this is a lot of data:

run(`du -h -d 0 nyc-taxi`)

8.6G    nyc-taxi

Process(`du -h -d 0 nyc-taxi`, ProcessExited(0))

First example

Same examples as previously:

SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count

time1 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    """
    ) |> DataFrame

2.151580334

That took 2.15 seconds. Here is the result:

first(dat1,5)

5×2 DataFrame

Row	passenger_count	mean_tip
	Int64	Float64
1	0	0.862099
2	1	1.15101
3	2	1.08158
4	3	0.962949
5	4	0.844519

More Aggregation

Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.

time2 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
    """
    )

0.554095084

This time we clocked up 0.55 seconds - keep in mind that this is a lot of data to go through each time.

Julia Specifics: `register_data_frame`

One cool thing about the julia package is that you can register a local dataframe into the database, which means that there is no copy oepration performed. Duckdb will directly read that julia dataframe instead:

DuckDB.register_data_frame(con, dat1, "dat1_on_db")

# let's read the first row back out just to check
results = DBInterface.execute(con, 
    """
    SELECT * FROM dat1_on_db
    LIMIT 1
    """
    )
DataFrame(results)

1×2 DataFrame

Row	passenger_count	mean_tip
	Int64	Float64
1	4	0.795176

Julia Specifics: `Appender`

The appender is an advanced duckdb feature which is optimized for fast row insertion into a database table. This feature is not available in neither R nor python packages. Here is an example. Let’s first create a new dataframe in your julia session and then read it row by row into an empty database table.

ndf = 100_000
df = DataFrame(
        id        = collect(1:ndf),
        value     = rand(Float32,ndf),
        timestamp = Dates.now() + Dates.Second.(1:ndf),
        date      = Dates.today() + Dates.Day.(1:ndf)
    )
first(df,5)

5×4 DataFrame

Row	id	value	timestamp	date
	Int64	Float32	DateTime	Date
1	1	0.689282	2024-10-30T22:06:49.044	2024-10-31
2	2	0.620124	2024-10-30T22:06:50.044	2024-11-01
3	3	0.17992	2024-10-30T22:06:51.044	2024-11-02
4	4	0.21845	2024-10-30T22:06:52.044	2024-11-03
5	5	0.586971	2024-10-30T22:06:53.044	2024-11-04

create a new database with an empty table called test_table1:

db = DuckDB.DB()

DBInterface.execute(db,
    """
    CREATE OR REPLACE TABLE 
    test_table1(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)
    """)

(Count = Int64[],)

Now let us prepare a SQL statement, into which we will repeatedly insert the row for each data. This makes use of the positional parameters ?:

stmt = DBInterface.prepare(db, 
    """
    INSERT INTO test_table1 VALUES(?,?,?,?)
    """
    )

DuckDB.Stmt(DuckDB.Connection(":memory:"), Ptr{Nothing} @0x0000000110ff7420, "INSERT INTO test_table1 VALUES(?,?,?,?)\n", DuckDB.MaterializedResult)

We are ready to time this now.

e = @elapsed res = 
    begin
        for r in eachrow(df)
            DBInterface.execute(stmt, 
                (r.id, r.value, r.timestamp, r.date)
            )
        end
        DBInterface.close!(stmt)  # close statement when done
    end

33.035374417

On my machine this took 33.04 seconds. This is probably not the most efficient way to get data into your database, but it could happen (if you create the data yourself from some modeling task, for example).

Now, let’s try with the Appender API. We create an identical table, and attach an Appender object to it:

DBInterface.execute(db,
    "CREATE OR REPLACE TABLE 
    test_table2(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)")

# create an appender on the second table
appender = DuckDB.Appender(db, "test_table2")

DuckDB.Appender(Ptr{Nothing} @0x00000001045e4340)

let’s see how fast that insert operation is now:

e2 = @elapsed begin
    for i in eachrow(df)
        for j in i
            DuckDB.append(appender, j)
        end
    DuckDB.end_row(appender)
    end
    DuckDB.close(appender)  # done now
end

0.179672166

Now only 0.18 seconds only! Holy smokes.

Meta Info

For the purpose of working with this notebook, we can currently only have a single quarto engine per notebook. While it’s perfectly possible to call both R and python code from a running julia session, one needs to annotate the code with some additional markup, which is not great for a learner. So I prefered to separate the julia version from the others. It’s most useful to create a local environment. To achieve this, you can open julia in the root of this document and do

] activate .  # activate current dir in Pkg mode
add DuckDB
add DataFrames
add Dates
add Chain

and quarto will pick up this env when it sees engine: julia in the frontmatter.

Footnotes

Note that there is currently a bug in the QuartoRunner setup. So, to run the quarto notebook, you have to use julia 1.10. For just executing the code, the latest julia version should work↩︎