import ibis
import ibis.selectors as s
from ibis import _
# ibis.options.interactive = True # enforce eager execution of queriesDuckDB + Ibis (Python)
Use a dplyr-esque Python frontend
Load libraries
Connect and register
## Instantiate an in-memory DuckDB connection from Ibis
con = ibis.duckdb.connect()
## Register our parquet dataset as a table called "nyc" in our connection
con.read_parquet("nyc-taxi/**/*.parquet", table_name = "nyc")DatabaseTable: nyc vendor_id string pickup_at timestamp(6) dropoff_at timestamp(6) passenger_count int8 trip_distance float32 pickup_longitude float32 pickup_latitude float32 rate_code_id int32 store_and_fwd_flag string dropoff_longitude float32 dropoff_latitude float32 payment_type string fare_amount float32 extra float32 mta_tax float32 tip_amount float32 tolls_amount float32 total_amount float32 month string
Aside: Remember that you can create a persistent, disk-backed database by giving it an appropriate name/path. This also enables out-of-core computation for bigger than RAM data.
# con = ibis.duckdb.connect("nyc.dbb")
# con.register("nyc-taxi/**/*.parquet", "nyc")
# etc.Reference the table from Python. We’ll call this reference object nyc too for consistency, but you could call it whatever you want (e.g., you could call it nyc_ibis to avoid potential ambiguity with the “nyc” table in our actual DuckDB connection). Printing the object to screen will give us a lazy preview.
# con.list_tables() # Optional: confirm that our table is available
nyc = con.table("nyc")
nycDatabaseTable: nyc vendor_id string pickup_at timestamp(6) dropoff_at timestamp(6) passenger_count int8 trip_distance float32 pickup_longitude float32 pickup_latitude float32 rate_code_id int32 store_and_fwd_flag string dropoff_longitude float32 dropoff_latitude float32 payment_type string fare_amount float32 extra float32 mta_tax float32 tip_amount float32 tolls_amount float32 total_amount float32 month string
First example
q1 = (
nyc
.group_by(["passenger_count"])
.agg(mean_tip = _.tip_amount.mean())
)To see the underlying SQL translation, use ibis.to_sql()
ibis.to_sql(q1)SELECT
"t0"."passenger_count",
AVG("t0"."tip_amount") AS "mean_tip"
FROM "nyc" AS "t0"
GROUP BY
1To actually execute the query and bring the result into Python, we can use the execute() method. By default this will coerce to a pandas DataFrame.
dat1 = q1.execute()
dat1| passenger_count | mean_tip | |
|---|---|---|
| 0 | -127 | 0.500000 |
| 1 | -123 | 0.000000 |
| 2 | -122 | 5.000000 |
| 3 | -119 | 9.000000 |
| 4 | -115 | 2.000000 |
| 5 | -101 | 46.599998 |
| 6 | -98 | 2.000000 |
| 7 | -96 | 2.000000 |
| 8 | -93 | 1.000000 |
| 9 | -92 | 8.000000 |
| 10 | -91 | 0.900000 |
| 11 | -63 | 0.000000 |
| 12 | -48 | 0.433162 |
| 13 | -45 | 2.000000 |
| 14 | -43 | 0.000000 |
| 15 | -33 | 1.500000 |
| 16 | -31 | 3.000000 |
| 17 | -6 | 0.333333 |
| 18 | -1 | 4.780000 |
| 19 | 0 | 0.808656 |
| 20 | 1 | 0.838095 |
| 21 | 2 | 0.760080 |
| 22 | 3 | 0.672228 |
| 23 | 4 | 0.560272 |
| 24 | 5 | 0.820787 |
| 25 | 6 | 0.923139 |
| 26 | 7 | 0.530769 |
| 27 | 8 | 0.150000 |
| 28 | 9 | 0.483529 |
| 29 | 10 | 3.215625 |
| 30 | 13 | 2.500000 |
| 31 | 15 | 2.000000 |
| 32 | 17 | 4.260000 |
| 33 | 19 | 0.000000 |
| 34 | 25 | 1.200000 |
| 35 | 33 | 1.000000 |
| 36 | 34 | 2.000000 |
| 37 | 36 | 11.250000 |
| 38 | 37 | 0.000000 |
| 39 | 38 | 0.000000 |
| 40 | 47 | 0.000000 |
| 41 | 49 | 0.000000 |
| 42 | 53 | 0.000000 |
| 43 | 58 | 4.000000 |
| 44 | 61 | 4.340000 |
| 45 | 65 | 0.000000 |
| 46 | 66 | 1.500000 |
| 47 | 69 | 0.000000 |
| 48 | 70 | 0.000000 |
| 49 | 84 | 5.000000 |
| 50 | 97 | 2.000000 |
| 51 | 113 | 0.000000 |
| 52 | 125 | 2.000000 |
The q1.execute() method above is equivalent calling q1.to_pandas(). A q1.to_polars() equivalent has been added to the dev version of Ibis, but is not available with the latest offical release (8.0.0 at the time of writing).
Digression: Interactive Ibis use and eager execution
At the very top of this document, I commented out the ibis.options.interactive option as part of my Ibis configuration. This was because I wanted to demonstrate the default deferred (i.e., lazy) behaviour of Ibis, which is just the same as d(b)plyr in R. If you are building data wrangling pipelines, or writing scripts with potentially complex queries, you probably want to preserve this deferred behaviour and avoid eager execution.
However, there are times when you may want to default into eager execution. For example, if your dataset is of manageable size, or you are trying to iterate through different query operations… Or, you might just want to enable it so that you automatically get a nice print return object for your workshop materials. I’ll adopt the latter view, so that I can quickly demonstrate some Ibis syntax and results for the rest of this document.
ibis.options.interactive = TrueOkay, let’s speed through some of the same basic queries that we’ve already seen in the DuckDB SQL and R (dplyr) pages. I won’t bother to explain them in depth. Just consider them for demonstration purposes.
Aggregation
(
nyc
.group_by(["passenger_count", "trip_distance"])
.aggregate(
mean_tip = _.tip_amount.mean(),
mean_fare = _.fare_amount.mean()
)
)┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩ │ int8 │ float32 │ float64 │ float64 │ ├─────────────────┼───────────────┼──────────┼───────────┤ │ 2 │ 1.1 │ 0.372256 │ 6.070068 │ │ 1 │ 7.2 │ 1.664833 │ 19.794590 │ │ 2 │ 2.5 │ 0.644415 │ 9.722223 │ │ 1 │ 21.4 │ 4.348084 │ 46.363091 │ │ 2 │ 3.6 │ 0.872763 │ 12.422476 │ │ 1 │ 8.9 │ 2.439180 │ 22.956922 │ │ 2 │ 9.0 │ 2.001371 │ 23.424684 │ │ 1 │ 4.2 │ 1.143775 │ 13.616594 │ │ 4 │ 1.5 │ 0.296758 │ 7.278823 │ │ 1 │ 10.0 │ 2.770640 │ 25.357209 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴──────────┴───────────┘
Note that, even though we have enabled the interactive print mode, we still get lazy evalation if we assign a chain of query steps to an object (here: q3)…
q3 = (
nyc
.group_by(["passenger_count", "trip_distance"])
.agg(
mean_tip = _.tip_amount.mean(),
mean_fare = _.fare_amount.mean()
)
)… but printing the query to screen enforces computation.
q3┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩ │ int8 │ float32 │ float64 │ float64 │ ├─────────────────┼───────────────┼──────────┼───────────┤ │ 2 │ 1.1 │ 0.372256 │ 6.070068 │ │ 1 │ 7.2 │ 1.664833 │ 19.794590 │ │ 2 │ 2.5 │ 0.644415 │ 9.722223 │ │ 1 │ 21.4 │ 4.348084 │ 46.363091 │ │ 2 │ 3.6 │ 0.872763 │ 12.422476 │ │ 1 │ 8.9 │ 2.439180 │ 22.956922 │ │ 2 │ 9.0 │ 2.001371 │ 23.424684 │ │ 1 │ 4.2 │ 1.143775 │ 13.616594 │ │ 4 │ 1.5 │ 0.296758 │ 7.278823 │ │ 1 │ 10.0 │ 2.770640 │ 25.357209 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴──────────┴───────────┘
Pivot (reshape)
# now chain on pivoting (and enforce computation via printing)
long = q3.pivot_longer(
["mean_tip", "mean_fare"], # columns to pivot
names_to="metric",
values_to="value",
)
long┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ metric ┃ value ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩ │ int8 │ float32 │ string │ float64 │ ├─────────────────┼───────────────┼───────────┼───────────┤ │ 1 │ 6.0 │ mean_tip │ 1.422905 │ │ 1 │ 6.0 │ mean_fare │ 17.371023 │ │ 2 │ 1.1 │ mean_tip │ 0.372256 │ │ 2 │ 1.1 │ mean_fare │ 6.070068 │ │ 3 │ 9.3 │ mean_tip │ 1.757103 │ │ 3 │ 9.3 │ mean_fare │ 24.362922 │ │ 1 │ 19.0 │ mean_tip │ 4.455674 │ │ 1 │ 19.0 │ mean_fare │ 45.151622 │ │ 2 │ 5.0 │ mean_tip │ 1.128359 │ │ 2 │ 5.0 │ mean_fare │ 15.494622 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴───────────┴───────────┘
Joins (merges)
(As we did in the dplyr code, we’ll break this contrived join example into two steps)
mean_tips = nyc.group_by("month").agg(mean_tip = _.tip_amount.mean())
mean_fares = nyc.group_by("month").agg(mean_fare = _.fare_amount.mean())(
mean_tips
.left_join(mean_fares, "month")
)┏━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ month ┃ mean_tip ┃ month_right ┃ mean_fare ┃ ┡━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ string │ float64 │ string │ float64 │ ├────────┼──────────┼─────────────┼───────────┤ │ 12 │ 0.819398 │ 12 │ 10.186188 │ │ 02 │ 0.828546 │ 02 │ 9.804122 │ │ 07 │ 0.778278 │ 07 │ 10.076624 │ │ 09 │ 0.824009 │ 09 │ 10.337792 │ │ 11 │ 0.834024 │ 11 │ 10.209248 │ │ 04 │ 0.814566 │ 04 │ 10.054076 │ │ 10 │ 0.821033 │ 10 │ 10.198155 │ │ 08 │ 0.783877 │ 08 │ 10.118335 │ │ 05 │ 0.840875 │ 05 │ 10.272067 │ │ 03 │ 0.853446 │ 03 │ 10.045834 │ │ … │ … │ … │ … │ └────────┴──────────┴─────────────┴───────────┘