(Pretty) big data wrangling with DuckDB and Polars
With examples in R, Python and Julia
Disclaimer: This is a Clone made by Florian Oswald
Grant did not coauthor those slides with me, as one could imagine from the first slide. Rather, he did most of the work for his workshop in 2024, I keep making updates and additions. Thanks Grant.
Notice this material was originally developed by Grant McDermott for the Workshops for Ukraine Series. Check out the material here Website: https://grantmcdermott.com/duckdb-polars
Clone of Original Repository!
This repo is a clone of https://github.com/grantmcdermott/duckdb-polars. Now that some time has passed few of the APIs changed I did update a few function calls that no longer worked, also the data is no longer available at the old location. You can see the result of my updates on this website. The bulk of the work is still Grant’s effort. I added the julia version of duckdb. I also added the relevant julia setup in requirements. Grant has not vetted those additions in any way, so any additional errors were introduced without his knowledge. Please checkout the source code for his original work, and make a fork from it, if you want to use it, on github. Thanks Grant as ever for sharing those great resources with the rest of the world!
Description
This workshop will introduce you to DuckDB and Polars, two data wrangling libraries at the frontier of high-performance computation. (See benchmarks.) In addition to being extremely fast and portable, both DuckDB and Polars provide user-friendly implementations across multiple languages. This makes them very well suited to production and applied research settings, without the overhead of tools like Spark. We will provide a variety of real-life examples in both R and Python, with the aim of getting participants up and running as quickly as possible. We will learn how wrangle datasets extending over several hundred million observations in a matter of seconds or less, using only our laptops. And we will learn how to scale to even larger contexts where the data exceeds our computers’ RAM capacity. Finally, we will also discuss some complementary tools and how these can be integrated for an efficient end-to-end workflow (data I/O -> wrangling -> analysis).