How to install `ogr2ogr` in a Databricks notebook (incl. Parquet support)

How to install `ogr2ogr` in a Databricks notebook (incl. Parquet support)
Photo by Mathieu Bigard / Unsplash
💡
This was tested on Serverless notebook, Environment version 2, as well as on a classic notebook with DBR 15.4 LTS.

ogr2ogr is a geospatial file conversion tool, part of GDAL. For example, you can use it to read in a directory of GML (geo XML) files, and write them out to GeoPackage (.gpkg), or even GeoParquet.

The libgdal-arrow-parquet extension package that we need can be installed via conda-forge. So let's first install conda-forge [1] (and update $PATH [2]):


  1. The curl download link comes from conda-forge and their installation instructions on GitHub. If this latter triggers a malicious content warning, then navigate there from https://github.com/conda-forge/miniforge ) ↩︎
  2. the reason we edit environment variables in a Python cell, not in a shell cell, is so that it persists across different cells in the notebook.) ↩︎
import os
!curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
!bash Miniforge3-$(uname)-$(uname -m).sh -b -p ~/miniforge

os.environ["PATH"] = "~/miniforge/bin:" + os.environ["PATH"]

Now we can add arrow/parquet support:

!conda install libgdal-arrow-parquet -y

os.environ["PROJ_LIB"] = f"{os.path.expanduser('~')}/miniforge/share/proj"

And that's it:

!ogr2ogr --formats | grep parquet
# Returns:
#   Parquet -vector- (rw+v): (Geo)Parquet (*.parquet)