using DuckDB
using DataFrames
using Dates
using Chain
= DBInterface.connect(DuckDB.DB) con
DuckDB.DB(":memory:")
Use the same queries from Julia
October 30, 2024
We have a julia package as well for duckdb
. 1
We proceed as before with the other languages:
DuckDB.DB(":memory:")
Let’s run the first simple query again to get the number of rows, as before with R:
nrowsdb = @chain con begin
DBInterface.execute(
"""
FROM 'nyc-taxi/**/*.parquet'
SELECT COUNT(passenger_count) as nrows
"""
)
end
ncolsdb = DBInterface.execute(con,
"""
FROM 'nyc-taxi/**/*.parquet'
SELECT * LIMIT 1
"""
) |> DataFrame |> names |> length
(rows = nrowsdb, cols = ncolsdb)
(rows = (nrows = [178544324],), cols = 24)
… and we know that this is a lot of data:
Same examples as previously:
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count
time1 = @elapsed dat1 = DBInterface.execute(
con,
"""
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
"""
) |> DataFrame
2.151580334
That took 2.15 seconds. Here is the result:
Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.
time2 = @elapsed dat1 = DBInterface.execute(
con,
"""
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
"""
)
0.554095084
This time we clocked up 0.55 seconds - keep in mind that this is a lot of data to go through each time.
register_data_frame
One cool thing about the julia package is that you can register a local dataframe into the database, which means that there is no copy oepration performed. Duckdb will directly read that julia dataframe instead:
Appender
The appender is an advanced duckdb feature which is optimized for fast row insertion into a database table. This feature is not available in neither R
nor python
packages. Here is an example. Let’s first create a new dataframe in your julia session and then read it row by row into an empty database table.
ndf = 100_000
df = DataFrame(
id = collect(1:ndf),
value = rand(Float32,ndf),
timestamp = Dates.now() + Dates.Second.(1:ndf),
date = Dates.today() + Dates.Day.(1:ndf)
)
first(df,5)
Row | id | value | timestamp | date |
---|---|---|---|---|
Int64 | Float32 | DateTime | Date | |
1 | 1 | 0.689282 | 2024-10-30T22:06:49.044 | 2024-10-31 |
2 | 2 | 0.620124 | 2024-10-30T22:06:50.044 | 2024-11-01 |
3 | 3 | 0.17992 | 2024-10-30T22:06:51.044 | 2024-11-02 |
4 | 4 | 0.21845 | 2024-10-30T22:06:52.044 | 2024-11-03 |
5 | 5 | 0.586971 | 2024-10-30T22:06:53.044 | 2024-11-04 |
create a new database with an empty table called test_table1
:
db = DuckDB.DB()
DBInterface.execute(db,
"""
CREATE OR REPLACE TABLE
test_table1(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)
""")
(Count = Int64[],)
Now let us prepare a SQL statement, into which we will repeatedly insert the row for each data. This makes use of the positional parameters ?
:
DuckDB.Stmt(DuckDB.Connection(":memory:"), Ptr{Nothing} @0x0000000110ff7420, "INSERT INTO test_table1 VALUES(?,?,?,?)\n", DuckDB.MaterializedResult)
We are ready to time this now.
e = @elapsed res =
begin
for r in eachrow(df)
DBInterface.execute(stmt,
(r.id, r.value, r.timestamp, r.date)
)
end
DBInterface.close!(stmt) # close statement when done
end
33.035374417
On my machine this took 33.04 seconds. This is probably not the most efficient way to get data into your database, but it could happen (if you create the data yourself from some modeling task, for example).
Now, let’s try with the Appender
API. We create an identical table, and attach an Appender
object to it:
DBInterface.execute(db,
"CREATE OR REPLACE TABLE
test_table2(id INTEGER PRIMARY KEY, value FLOAT, timestamp TIMESTAMP, date DATE)")
# create an appender on the second table
appender = DuckDB.Appender(db, "test_table2")
DuckDB.Appender(Ptr{Nothing} @0x00000001045e4340)
let’s see how fast that insert operation is now:
e2 = @elapsed begin
for i in eachrow(df)
for j in i
DuckDB.append(appender, j)
end
DuckDB.end_row(appender)
end
DuckDB.close(appender) # done now
end
0.179672166
Now only 0.18 seconds only! Holy smokes.
For the purpose of working with this notebook, we can currently only have a single quarto engine
per notebook. While it’s perfectly possible to call both R
and python
code from a running julia
session, one needs to annotate the code with some additional markup, which is not great for a learner. So I prefered to separate the julia version from the others. It’s most useful to create a local environment. To achieve this, you can open julia in the root of this document and do
and quarto will pick up this env when it sees engine: julia
in the frontmatter.