Row Data in a Data Scientist's Daily Work
Row data is one of the most characteristic features of a data scientist's daily work. Excel spreadsheets, CSV files, database tables – they all follow the same principle: rows and columns. One observation per row, variables in columns. And this data is typically processed with pandas.
Row Data vs Multidimensional Data
I've been working with weather data lately. Weather data includes, for example, regular gridded data produced by various weather models and observational data. And processing weather data as row data is extremely difficult. A single temperature measurement is not just a number – it's a number at a specific location, at a specific altitude, at a specific point in time. It lives in four dimensions:
- Latitude
- Longitude
- Altitude (or hybrid pressure levels or height)
- Time (temporal)
The same applies to both model data (forecasts, reanalyses) and observational data (satellites, radars, weather stations). The natural structure of the data is a multidimensional grid, not a row table.
Weather data can technically be forced into row format – by putting latitude, longitude, altitude, and time as variables on each row. But processing it as row data is extremely impractical.
The Difficulty of Multidimensional Data as Row Data
Consider a situation where multidimensional data is forcibly processed as row data:
Indexing is cumbersome. If you want the temperature at point (60°N, 25°E, 850 hPa, 2024-01-15T12:00) from row data, in Pandas you would write:
df[(df.lat == 60) & (df.lon == 25) & (df.level == 850) & (df.time == '2024-01-15T12:00')]
Yuck!
Every lookup requires a full table scan. Row data has no dimension-defined structure. When you look up a single point, Pandas scans the entire table. And if you want indexes for the data, you need multi-indexes.
Dimensional redundancy wastes memory! Coordinates are repeated on every row. A billion measurement points means a billion times the same lat/lon/time/level values.
Spatiotemporal operations are horrible. "Get neighboring points", "calculate gradient", and "interpolate" require complex self-joins.
Dimensional aggregations require groupby acrobatics. "Calculate the time average for each point" is a groupby operation in row data.
Xarray Saves the Data Scientist
Xarray solves the problems described above with named dimensions and coordinates:
Indexing is natural. The same lookup as above:
ds.temperature.sel(time='2024-01-15T12:00', level=850, lat=60, lon=25)
Compact. No boolean masks, no mysterious index numbers.
Lookups are fast. Xarray knows the data structure and finds the right cell directly based on coordinates.
Memory usage is efficient. Coordinates are stored once and data remains in a compact array format (an array with more than 2 dimensions).
Spatiotemporal operations are built-in. Interpolation, neighbor lookups, and gradients work naturally across dimensions.
Dimensional aggregations are simple:
ds.temperature.mean(dim='time') # Time average
ds.temperature.mean(dim=['lat', 'lon']) # Spatial average
The code is readable and understandable a month or a year later.
But What About Data and ML Engineers?!
Let's shift perspective from data science to data engineering: the things that interest data engineers, MLOps people, and the like.
There are several working solutions for row data processing and scaling. For example, Databricks and Spark are a mature ecosystem: they separate data storage and compute. So you can store massive amounts of data without breaking the bank. And when you need to compute something, you can allocate a powerful cluster that calculates the requested values from terabytes of source data in minutes. You only pay for the time the cluster needs to complete the computation.
But what about multidimensional data? What tools and methods exist for processing and scaling multidimensional data?
Databricks and Spark are row data tools, and the problems that data scientists face with multidimensional data multiply in the work of data engineers running compute platforms. Despite the hype, Databricks and Spark are not directly suited for multidimensional data processing.
How Dask Serves Data Engineers
The answer to the scaling problems of multidimensional data processing is Dask. It's about parallelizing computation and leveraging clusters.
Think of a familiar NumPy array. With small data, there's one array that fits in a single machine's memory and is processed locally. Dask extends this to a cluster: large data is split into many NumPy arrays that together form the complete dataset. Cluster machines process the pieces in parallel. Dask combines the partial results into a whole. And throughout the computation, the user sees the data as a single array.
This solves the data engineer's scaling problems:
- Indexing and lookups: Dask preserves named dimensions and coordinate-based lookups even with terabytes of data
- Memory problem: Dask doesn't load everything into memory, but processes piece by piece
- Spatiotemporal operations: Work with the same syntax as with small data
- Dimensional aggregations:
ds.temperature.mean(dim='time')works even with massive data, say 40 years' worth
Chunking and Parallelization
Imagine your data: latitude × longitude × altitude × time. Dask divides this hypercube into small chunks – each chunk is one NumPy array, for example 100×100 points for one day. Each chunk is small enough to be processed on a single core.
Then Dask parallelizes the same operation across all chunks. If you ask for the time average of a variable, Dask calculates it for each NumPy array separately and combines the results.
Computation Graph
In traditional imperative programming, each line is executed immediately. Dask works differently: it builds a computation graph that only describes the dependencies between operations. Only when you request the result is the computation defined in the graph executed.
# Lazy evaluation - doesn't compute anything yet, just builds the graph
result = ds.temperature.mean(dim='time')
# Compute triggers the computation across the entire cluster
result.compute()
The computation graph enables smart optimizations: the system sees the entire pipeline at once and can plan execution optimally. In the Databricks/Spark ecosystem, a lot of optimization work has been done, but Dask computation optimization still has much work ahead.
Tool Maturity
Xarray is a mature, stable, and production-ready tool. Documentation is comprehensive, the community is active, edge cases are known.
Dask is promising and functional, but less mature. It still lacks what Databricks brought with Delta Lake: a complete, managed, and optimized ecosystem. There's work to do in Dask cluster management, chunking strategy optimization, and memory management. Automatic cluster scaling, smart chunking strategies, transactionality – these are being built piece by piece by the community.
Lack of Optimizations
Databricks and Spark aggressively leverage automatic optimizations enabled by the computation graph: unnecessary column pruning, data skipping via metadata filtering, and pushing other optimizations to the beginning of the graph (predicate pushdown). These dramatically reduce the amount of data to be processed.
This was one of my pain points, because I was used to the idea that the compute platform is smart enough to optimize its operations heuristically. But in Dask, data filtering, caching, and similar operations must be explicitly written into the computation code. The responsibility is still on the user.
In practice, this means Dask requires more manual work and understanding. But once you get it working, the results are impressive: you can run the same code on a laptop and on hundreds of cores in the cloud, just by changing the configuration.
Conclusion
If you work with multidimensional data – weather observations, satellite imagery, climate models – Xarray and Dask offer the best currently available toolchain (let me know if I'm wrong!). Xarray makes code readable and maintainable. Dask enables scaling to terabytes.
The path isn't always smooth. With Dask, you still need to tinker more than with Databricks. But the community is growing, tools are evolving, and for many organizations, this is the only realistic way to handle the data volumes of modern climate science and remote sensing.
And most importantly: the data stays in the form it should be – multidimensional, indexable, understandable.