Scandinavian temperature anomaly on a map (July 2018).

Geospatial data processing platform

Post-processing of massive geospatial data, productionised — Dask, Python packages, OpenShift, and OGC services for operational use.

Context and problem

A Finnish public authority produces and processes massive volumes of geospatial data. Post-processing requires parallel computation, geographic context, and reliable execution in a production environment. The need was for a solution that is both an operational tool for data scientists and a maintainable software platform — not scattered scripts and not separate infrastructure for every workload.

What was done

I built a processing platform based on the Dask ecosystem, together with a Python package around it and the infrastructure surrounding execution. The data flow relies on Xarray and Pandas; parallelisation carries the resource-intensive workloads. The platform covers a local development environment and two clusters — versioning, release code, and environment-specific configuration were designed for multi-environment use from the start.

I built and maintained the execution environments for the workloads on OpenShift: templates and deployment configuration for local, development, and production targets. CI/CD and containerisation enable repeatable releases of the code. Part of the work was developing targeted data analyses.

I packaged an AI-assisted toolkit for data scientists into a managed compute environment — ready-made contexts and tools on top of the operational platform. I investigated and worked with legacy C++ systems while building the data pipelines that feed or extend existing platforms. I built OGC-compliant interface services for spatial data retrieval.

Key technologies: Python, Dask, Xarray, Pandas, NumPy, OpenShift, Kubernetes, Podman, CI/CD, GIS, OGC, REST, software packaging.

Outcome

The platform is in operational use: data scientists post-process data at scale, and the same setup supports development, testing, and production through a consistent pipeline. The solution combines modern data processing, cluster execution, working alongside legacy systems, and standardised geodata access — as a production-ready whole in a demanding public-sector environment. It democratised the compute platform: data scientists can implement their own compute needs without ticketing features and waiting months.

← Back to assignments

Image: Scandinavia temperature anomaly 2018 — NASA, public domain.

social