Polars from Python and R

Pro-tip: Just swap . (Python) for $ (R), or vice versa

Load libraries

import polars as pl
import time
import matplotlib
library(polars)
Warning: package 'polars' was built under R version 4.5.2

Scan data

nyc = pl.scan_parquet("nyc-taxi/**/*.parquet", hive_partitioning=True)
nyc
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Parquet SCAN [nyc-taxi/2009/month=01/data.parquet, ... 39 other sources]

PROJECT */19 COLUMNS

ESTIMATED ROWS: 563696520
nyc = pl$scan_parquet("nyc-taxi/**/*.parquet", hive_partitioning=TRUE)
nyc
<polars_lazy_frame>

First example

Polars operations are registered as queries until they are collected.

q1 = (
    nyc
    .group_by(["passenger_count"])
    .agg([
            pl.mean("tip_amount")#.alias("mean_tip") ## alias is optional
        ])
    .sort("passenger_count")
)
q1
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SORT BY [col("passenger_count")]

AGGREGATE[maintain_order: false]

[col("tip_amount").mean()] BY [col("passenger_count")]

FROM

Parquet SCAN [nyc-taxi/2009/month=01/data.parquet, ... 39 other sources]

PROJECT */19 COLUMNS

ESTIMATED ROWS: 563696520
q1 = (
    nyc
    $group_by("passenger_count")
    $agg(
        pl$col("tip_amount")$mean()
    )
    $sort("passenger_count")
)
q1 
<polars_lazy_frame>
NoteR-polars multiline syntax

Polars-style x$method1()$method2()... chaining may seem a little odd to R users, especially for multiline queries. Here I have adopted the same general styling as Python: By enclosing the full query in parentheses (), we can start each $method() on a new line. If this isn’t to your fancy, you could also rewrite these multiline queries as follows:

nyc$group_by(
    "passenger_count"
)$agg(
    pl$col("tip_amount")$mean()$alias("mean_tip")
)$sort("passenger_count")

(Note: this is the naive query plan, not the optimized query that polars will actually implement for us. We’ll come back to this idea shortly.)

Calling collect() enforces computation.

tic = time.time()
dat1 = q1.collect()
toc = time.time()

dat1
shape: (53, 2)
passenger_count tip_amount
i8 f32
-127 0.5
-123 0.0
-122 5.0
-119 9.0
-115 2.0
70 0.0
84 5.0
97 2.0
113 0.0
125 2.0
# print(f"Time difference of {toc - tic} seconds")
tic = Sys.time()
dat1 = q1$collect()
toc = Sys.time()

dat1
shape: (53, 2)
passenger_count tip_amount
i8 f32
-127 0.5
-123 0.0
-122 5.0
-119 9.0
-115 2.0
70 0.0
84 5.0
97 2.0
113 0.0
125 2.0
toc - tic
Time difference of 0.872997 secs

Aggregation

Subsetting along partition dimensions allows for even more efficiency gains.

q2 = (
    nyc
    .filter(pl.col("month") <= 3)
    .group_by(["month", "passenger_count"])
    .agg([pl.mean("tip_amount").alias("mean_tip")])
    .sort("passenger_count")
)
q2 = (
    nyc
    $filter(pl$col("month") <= 3)
    $group_by("month", "passenger_count")
    $agg(pl$col("tip_amount")$mean()$alias("mean_tip"))
    $sort("passenger_count")
) 

Let’s take a look at the optimized query that Polars will implement for us.

# q2             # naive
# q2.show_graph()  # optimized
# q2              # naive
cat(q2$explain()) # optimized
SORT BY [col("passenger_count")]
  AGGREGATE[maintain_order: false]
    [col("tip_amount").mean().alias("mean_tip")] BY [col("month"), col("passenger_count")]
    FROM
    Parquet SCAN [nyc-taxi/2009/month=01/data.parquet, ... 10 other sources]
    PROJECT 3/19 COLUMNS
    SELECTION: [(col("month").cast(Float64)) <= (3.0)]
    ESTIMATED ROWS: 155016543

Now, let’s run the query and collect the results.

tic = time.time()
dat2 = q2.collect()
toc = time.time()

dat2
shape: (45, 3)
month passenger_count mean_tip
i64 i8 f32
2 -127 0.5
1 -48 0.0
2 -48 0.0
3 -48 0.496447
1 -45 2.0
1 49 0.0
3 61 4.34
1 65 0.0
2 97 2.0
1 113 0.0
# print(f"Time difference of {toc - tic} seconds")
tic = Sys.time()
dat2 = q2$collect()
toc = Sys.time()

dat2
shape: (45, 3)
month passenger_count mean_tip
i64 i8 f32
2 -127 0.5
1 -48 0.0
2 -48 0.0
3 -48 0.496447
1 -45 2.0
1 49 0.0
3 61 4.34
1 65 0.0
2 97 2.0
1 113 0.0
toc - tic
Time difference of 0.5320392 secs

High-dimensional grouping example. This query provides an example where polars is noticeably slower than DuckDB.

q3 = (
    nyc
    .group_by(["passenger_count", "trip_distance"])
    .agg([
        pl.mean("tip_amount").alias("mean_tip"),
        pl.mean("fare_amount").alias("mean_fare"),
        ])
    .sort(["passenger_count", "trip_distance"])
)

tic = time.time()
dat3 = q3.collect()
toc = time.time()

dat3
shape: (34_490, 4)
passenger_count trip_distance mean_tip mean_fare
i8 f32 f32 f32
-127 0.34 0.0 5.7
-127 0.71 0.0 4.5
-127 1.22 1.0 5.7
-127 1.37 0.0 9.3
-127 2.55 0.0 8.5
70 3.06 0.0 9.7
84 13.98 5.0 33.700001
97 1.87 2.0 6.9
113 0.0 0.0 13.3
125 3.83 2.0 14.1
# print(f"Time difference of {toc - tic} seconds")
q3 = (
    nyc
    $group_by("passenger_count", "trip_distance")
    $agg(
        pl$col("tip_amount")$mean()$alias("mean_tip"),
        pl$col("tip_amount")$mean()$alias("mean_fare")
        )
    $sort("passenger_count", "trip_distance")
)

tic = Sys.time()
dat3 = q3$collect()
toc = Sys.time()
 
dat3
shape: (34_490, 4)
passenger_count trip_distance mean_tip mean_fare
i8 f32 f32 f32
-127 0.34 0.0 0.0
-127 0.71 0.0 0.0
-127 1.22 1.0 1.0
-127 1.37 0.0 0.0
-127 2.55 0.0 0.0
70 3.06 0.0 0.0
84 13.98 5.0 5.0
97 1.87 2.0 2.0
113 0.0 0.0 0.0
125 3.83 2.0 2.0
toc - tic
Time difference of 23.48908 secs

As an aside, if we didn’t care about column aliases (or sorting), then the previous query could be shortened to:

(
    nyc
    .group_by(["passenger_count", "trip_distance"])
    .agg(pl.col(["tip_amount", "fare_amount"]).mean())
    .collect()
)
(
    nyc
    $group_by("passenger_count", "trip_distance")
    $agg(pl$col("tip_amount", "fare_amount")$mean())
    $collect()
)

Pivot (reshape)

In polars, we have two distinct reshape methods:

  • pivot: => long to wide
  • unpivot: => wide to long

Here we’ll unpivot to go from wide to long and use the eager execution engine (i.e., on the dat3 DataFrame object that we’ve already computed) for expediency.

dat3.unpivot(index = ["passenger_count", "trip_distance"])
shape: (68_980, 4)
passenger_count trip_distance variable value
i8 f32 str f32
-127 0.34 "mean_tip" 0.0
-127 0.71 "mean_tip" 0.0
-127 1.22 "mean_tip" 1.0
-127 1.37 "mean_tip" 0.0
-127 2.55 "mean_tip" 0.0
70 3.06 "mean_fare" 9.7
84 13.98 "mean_fare" 33.700001
97 1.87 "mean_fare" 6.9
113 0.0 "mean_fare" 13.3
125 3.83 "mean_fare" 14.1
dat3$unpivot(index = c("passenger_count", "trip_distance"))
shape: (68_980, 4)
passenger_count trip_distance variable value
i8 f32 str f32
-127 0.34 "mean_tip" 0.0
-127 0.71 "mean_tip" 0.0
-127 1.22 "mean_tip" 1.0
-127 1.37 "mean_tip" 0.0
-127 2.55 "mean_tip" 0.0
70 3.06 "mean_fare" 0.0
84 13.98 "mean_fare" 5.0
97 1.87 "mean_fare" 2.0
113 0.0 "mean_fare" 0.0
125 3.83 "mean_fare" 2.0

Joins (merges)

mean_tips  = nyc.group_by("month").agg(pl.col("tip_amount").mean())
mean_fares = nyc.group_by("month").agg(pl.col("fare_amount").mean())
(
    mean_tips
    .join(
        mean_fares,
        on = "month",
        how = "left" # default is inner join
    )
    .collect()
)
shape: (12, 3)
month tip_amount fare_amount
i64 f32 f32
4 0.814566 10.054076
12 0.819398 10.186188
7 0.778278 10.076624
8 0.783877 10.118335
2 0.828546 9.804122
6 0.7862 10.154768
9 0.824009 10.337792
5 0.840874 10.272066
3 0.853446 10.045835
10 0.821033 10.198154
mean_tips  = nyc$group_by("month")$agg(pl$col("tip_amount")$mean())
mean_fares = nyc$group_by("month")$agg(pl$col("fare_amount")$mean())
(
    mean_tips
    $join(
        mean_fares,
        on = "month",
        how = "left"  # default is inner join
    )
    $collect()
)
shape: (12, 3)
month tip_amount fare_amount
i64 f32 f32
8 0.783877 10.118335
6 0.7862 10.154768
1 0.753412 9.621945
4 0.814566 10.054076
3 0.853446 10.045835
5 0.840874 10.272066
10 0.821033 10.198154
2 0.828546 9.804122
11 0.834024 10.209248
12 0.819398 10.186188

Appendix: Alternate interfaces

The native polars API is not the only way to interface with the underlying computation engine. Here are two alternate approaches that you may prefer, especially if you don’t want to learn a new syntax.

Ibis (Python)

The great advantage of Ibis (like dbplyr) is that it supports multiple backends through an identical frontend. So, all of our syntax logic and workflow from the Ibis+DuckDB section carry over to an equivalent Ibis+Polars workflow too. All you need to do is change the connection type. For example:

import ibis
import ibis.selectors as s
from ibis import _

##! This next line is the only thing that's changed !##
con = ibis.polars.connect()

con.read_parquet("nyc-taxi/**/*.parquet", table_name = "nyc")
DatabaseTable: nyc
  vendor_id          string
  pickup_at          timestamp(9)
  dropoff_at         timestamp(9)
  passenger_count    int8
  trip_distance      float32
  pickup_longitude   float32
  pickup_latitude    float32
  rate_code_id       null
  store_and_fwd_flag string
  dropoff_longitude  float32
  dropoff_latitude   float32
  payment_type       string
  fare_amount        float32
  extra              float32
  mta_tax            float32
  tip_amount         float32
  tolls_amount       float32
  total_amount       float32
nyc = con.table("nyc")

(
  nyc
  .group_by(["passenger_count"])
  .agg(mean_tip = _.tip_amount.mean())
  .to_polars()
)
shape: (53, 2)
┌─────────────────┬───────────┐
│ passenger_count ┆ mean_tip  │
│ ---             ┆ ---       │
│ i8              ┆ f64       │
╞═════════════════╪═══════════╡
│ 8               ┆ 0.15      │
│ 66              ┆ 1.5       │
│ 15              ┆ 2.0       │
│ -96             ┆ 2.0       │
│ -48             ┆ 0.433162  │
│ …               ┆ …         │
│ 47              ┆ 0.0       │
│ 65              ┆ 0.0       │
│ 69              ┆ 0.0       │
│ -101            ┆ 46.599998 │
│ 84              ┆ 5.0       │
└─────────────────┴───────────┘

tidypolars (R)

The R package tidypolars (link) provides the “tidyverse” syntax while using polars as backend. The syntax and workflow should thus be immediately familar to R users.

It’s important to note that tidypolars is solely focused on the translation work. This means that you still need to load the main polars library alongside it for the actual computation, as well as dplyr (and potentially tidyr) for function generics.

library(polars) ## Already loaded
library(tidypolars)
Warning: package 'tidypolars' was built under R version 4.5.2
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)

nyc = scan_parquet_polars("nyc-taxi/**/*.parquet")

nyc |> 
    summarise(mean_tip = mean(tip_amount), .by = passenger_count) |>
    compute()
shape: (53, 2)
passenger_count mean_tip
i8 f32
6 0.923139
-122 5.0
33 1.0
36 11.25
15 2.0
-45 2.0
125 2.0
113 0.0
-6 0.333333
-33 1.5

Aside: Use collect() instead of compute() at the end if you would prefer to return a standard R data.frame instead of a Polars DataFrame.

See also polarssql (link) if you would like yet another “tidyverse”-esque alternative that works through DBI/d(b)plyr.