[1] "570,009,660"
Time difference of 1.434131 secs
With examples in R, Python and Julia
2024-05-02
Credits
Grant did not coauthor those slides with me, as one could imagine from the first slide. Rather, he did most of the work for his workshop in 2024, I keep making updates and additions. Thanks Grant.
As clearly indicated on the landing page of this website:
Notice this material was originally developed by Grant McDermott for the Workshops for Ukraine Series. Check out the material here Website: https://grantmcdermott.com/duckdb-polars
These sparse slides are mostly intended to serve as a rough guide map.
Note: All of the material for today’s workshop are available on my website:
Important: Before continuing, please make sure that you have completed the requirements listed on the workshop website.
The data download step can take 15-20 minutes, depending on your internet connection.
It’s a trope, but “big data” is everywhere. This is true whether you work in tech (like I do now), or in academic research (like I used to).
OTOH many of datasets that I find myself working with aren’t at the scale of truly huge data that might warrant a Spark cluster.
Another factor is working in polyglot teams. It would be great to repurpose similar syntax and libraries across languages…
[1] "570,009,660"
Time difference of 1.434131 secs
We just read a ~570 million row dataset (from disk!) and did a group-by aggregation on it.
In about 1.5 seconds.
On a laptop.
🤯
Let’s do a quick horesrace comparison (similar grouped aggregation, but on a slightly smaller dataset)…
Two coinciding (r)evolutions enable faster, smarter computation:
Question: Do these benchmarks hold and scale more generally? Answer: Yes. See Database-like ops benchmark.
Moreover—and I think this is key—these kinds of benchmarks normally exclude the data I/O component… and the associated benefits of not having to hold the whole dataset in RAM.
Let’s head back to the website to work through some notebooks.