How to install `ogr2ogr` in a Databricks notebook (incl. Parquet support)
💡
This was tested on Serverless notebook, Environment version 2, as well as on a classic notebook with DBR 15.4 LTS.
ogr2ogr
is a geospatial file conversion tool, part of GDAL. For example, you can use it to read in a directory of GML (geo XML) files, and write them out to GeoPackage (.gpkg
), or even GeoParquet.
The libgdal-arrow-parquet
extension package that we need can be installed via conda-forge. So let's first install conda-forge [1] (and update $PATH
[2]):
- The curl download link comes from conda-forge and their installation instructions on GitHub. If this latter triggers a malicious content warning, then navigate there from https://github.com/conda-forge/miniforge ) ↩︎
- the reason we edit environment variables in a Python cell, not in a shell cell, is so that it persists across different cells in the notebook.) ↩︎
import os
!curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
!bash Miniforge3-$(uname)-$(uname -m).sh -b -p ~/miniforge
os.environ["PATH"] = "~/miniforge/bin:" + os.environ["PATH"]
Now we can add arrow/parquet support:
!conda install libgdal-arrow-parquet -y
os.environ["PROJ_LIB"] = f"{os.path.expanduser('~')}/miniforge/share/proj"
And that's it:
!ogr2ogr --formats | grep parquet
# Returns:
# Parquet -vector- (rw+v): (Geo)Parquet (*.parquet)