Lazy loading: Making it easier to access vast datasets of weather & satellite data

Image created by Canva's text-to-image generator.

AI has developed at a stunning pace over the last few years in fields such as natural language processing and image recognition. Given all this development, it's surprising that we haven't seen similarly rapid advancement in the application of AI to electricity forecasting given its critical role in people's lives and the transition to net zero carbon emissions.

What's blocking progress? Yes, excellent folks are working on energy forecasting. But why aren't thousands of researchers experimenting with energy forecasting, just as there are countless researchers training AI to excel in computer vision and natural language?

We have observed that one of the main bottlenecks is a lack of access to data. Today, it can take years to gather, clean, and prepare the data required for energy forecasting at the standard required by industry. This is mostly weather data, which is often free but very difficult to access and process. This barrier to entry is dramatically slowing progress in energy forecasting.

Better energy forecasts should help electricity grids to schedule "dispatchable" generators, and help grids absorb more renewable generation. This, in turn, reduces CO2 emissions and energy costs. (Which is what we're passionate about at Open Climate Fix!)

Natural language processing, computer vision, protein folding, and many other domains experienced "big bang" moments when ML models were trained on vast amounts of data. Indeed, a big breakthrough in applying AI for weather forecasting occurred when a large dataset, ERA-5, was made easier to access by Google Research.

We believe that making it easy to train ML models on vast quantities of energy and weather data will catalyse a similar "big bang" in energy forecasting (and other domains which rely on weather forecasting).

To give an analogy: Imagine you're helping to build a new village from scratch, in the middle of nowhere. You have very high ambitions for this village. But the project is taking far longer than anticipated because there are no roads to your new village so all building materials have to be brought to site by donkey. This is very slow. So you decide to invest time in building good quality roads. This will dramatically speed up your ability to build your part of the village. And will help all the other builders, too. It's a no-brainer, right?

The "vision"

Wouldn't it be great if anyone with an Internet connection could lazily access petabytes of weather and satellite data with just a single line of Python like this (inspired by our friends at dynamical.org):

dataset = xarray.open_dataset("https://data-provider.org/dataset-xyz")

This line of Python would execute in a fraction of a second because it only loads the dataset's metadata (which is like the table of contents for a book: it tells you where to find the data you want). When you ask for, say, a satellite image for a specific geographical location and time, only that data is loaded from disk. (This approach of only loading what you need, when you need it, is called "lazy" data loading).

It'd be game-changing if all the world's numerical weather predictions (NWPs) and public satellite data were this easy to work with. No need to spend months waiting for data archives to be retrieved from tape and converted into a form that performs well for your use-case. No low-bandwidth home-brew APIs to deal with.

The time required to start working with weather and satellite data would drop from months to minutes!

To be specific, this "dream world" would have the following features:

  • Each dataset should have multiple years of historical data and should be updated in near-real-time.
  • Data should be available in a data structure that supports lazy loading, using standard software like the popular xarray.
  • Opening free datasets should be as easy as xr.open_dataset(URL). No sign-up procedures. No logins.
  • Users may have to pay for some datasets. For example, some live European NWPs cost money.
  • Variable names and physical units should be consistent across datasets. Why? Because adding a new NWP to your project should only require a one-line change to your code, not months of work! Comparing NWPs from two providers should be as simple as nwp1 - nwp2
  • Data must be capable of being read at gigabytes per second to each virtual machine. Why such high bandwidth? To train large machine learning models directly from the datasets. Or to perform analytics. Even at 10 gigabytes per second, it still takes over a day to read 1 petabyte.
  • At a stretch, it would be amazing to have tools capable of on-the-fly processing at gigabytes per second, per VM. This processing might include geospatial reprojection, normalisation, downsampling, etc. On-the-fly processing is important so users can avoid having to create their own, processed versions of huge datasets.

We're by no means the first to suggest this "vision". The Pangeo Forge project (Stern et al., 2022) aims "to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format." And Daniel Rothenberg talked about Kerchunking petabytes of legacy files in his January 2024 talk at the 104th Annual Meeting of the American Meteorological Society (AMS). And Carbon Plan published a great blog post summarising their experiences of using Kerchunk on a year of the CMIP6 climate dataset.

In the rest of this post, we'll discuss the remaining challenges, and the work that's still to be done.

Today's challenges of NWP & satellite data

The reality today is a long way from the "vision" outlined above. Today, it can take a small team of skilled data engineers years to collect and process all the satellite imagery and NWPs they require. This dramatically slows progress, and severely limits the number of people who can innovate in this area.

Broadly speaking, there are (at least) five challenges when working with NWP and satellite data:

  1. The datasets are huge! An ensemble NWP from a single provider can occupy over a petabyte per year. As of mid-2024, NOAA's Open Data Dissemination (NODD) program has shared 59 petabytes of data. This is far more data than most organisations can store. When people talk about "big data" they often mean "more data than can fit on a laptop". Weather data is on a whole different level! You'd need 100,000 laptops to store NODD's datasets! And NODD's data archive is growing rapidly!
  2. Data is hard to access. Some NWP providers don't provide any historical data. Others provide historical data but you have to wait a long time for data to be retrieved from tape. Or perhaps they provide historical data over a home-brew API with harsh rate-limits.
  3. Data is hard to interpret. NWPs and satellite data often use ancient data formats that are alien to most software engineers, and require the use of software packages designed for a human analyst to view a handful of images per day, not for pumping tens of thousands of images per second through an ML model.
  4. Slow random access. When training a large ML model from satellite and NWP data, each training example might be a small crop of data with a random start time and random geographical location. Yet, the data formats and software tools used for NWPs and satellite data are not optimised for random access. So it's often necessary to convert legacy file formats to a modern format optimised for random access. This conversion can take months and doubles the storage requirements.
  5. Most organisations aren't incentivised to build "community data plumbing". What we're proposing isn't sexy. Tech influencers aren't going to be raving about this on YouTube. No one is going to get rich. So, venture capitalists probably aren't keen to invest in this "community data plumbing". And academics probably can't justify spending much effort on this, either, because this isn't cutting-edge research. For-profit companies looking for a quick PR win also wouldn't be interested because if this stuff works then it will be invisible. It's the sort of thing that, when you see it working, it will appear obvious. As if things were always this way.

(These challenges are also described in the excellent "Taxonomy of Data Gaps" written by Climate Change AI & Google DeepMind in 2024.)

How the community is already working towards the vision

Data in legacy formats to cloud object storage

NOAA's Open Data Dissemination (NODD) program has shared 59 petabytes of data so far on cloud object storage! (See David Steube's breakdown of the size of each NODD NWP.) And some European data providers are moving in a similar direction. For example, the European Centre for Medium-range Weather Forecasting (ECMWF) already publishes lots of data openly on Google Cloud Storage, and plans to "achieve a fully open data status by the end of 2026" (see ECMWF's Open Data Roadmap). And the UK MetOffice recently announced a two-year rolling archive of two of their NWPs on Amazon Sustainability Data Initiative (ASDI). The vast majority of this data is in "legacy" file formats like GRIB2 and NetCDF.

Data converted to analysis-ready, cloud-optimised (ARCO) formats

There is also great work being done to convert legacy data into ARCO formats like Zarr. For example:

These datasets can already be opened lazily and read randomly using xarray and Zarr. The main outstanding challenge is performance. In its off-the-shelf configuration, Zarr-Python version 2 doesn't yet deliver gigabytes per second of throughput when reading Zarrs stored in cloud object storage.

Software engineering

The community is doing some awesome software engineering. To name just a few projects: Zarr-Python version 3.0 is rapidly nearing its first release and should offer substantial speed-ups compared to version 2. Kerchunk and VirtualiZarr make it easy to lazily access legacy files. Gribberish is a new GRIB reader written in Rust. And, of course, the mighty xarray continues to develop as a powerhouse for working with NWP and satellite data.

What still needs to be done (OCF can help!) 

1) Write software tools to lazily open legacy datasets

By "legacy file format" we mean file formats like GRIB and NetCDF.

There's close to 100 petabytes of legacy data available on cloud object storage. It would be too expensive and energy-intensive to convert all this to ARCO (analysis-ready, cloud-optimised) formats. Instead, we could first try our best to build software that can read from legacy file formats as quickly as possible.

Kerchunk and VirtualiZarr already provide many of these features, although users have to create their own kerchunk manifests.

But it may be possible to go much faster than kerchunk (for example, by only reading parts of each GRIB message). To explore how fast we can go, we have started work on an experimental tool called hypergrib (written in Rust) to lazily load petabytes of GRIB files into xarray. This could be a game changer (if it works!) We'll borrow ideas from recent computer science research such as the awesome "AnyBlob" paper (Durner, Leis & Neumann, 2023).

The ultimate aim is that, from the user's perspective, opening a huge dataset of legacy files should feel exactly like opening an ARCO format. For example, users should only have to write xr.open_dataset(URL) to lazily open petabytes of GRIB files and read data at gigabytes per second of throughput to each VM, even when the user requests random crops of the dataset. And this high performance networking has to be computationally efficient enough to leave enough CPU cycles for the computation that the user wants to perform! If you're transferring huge volumes of data, presumably that's because you want to do something with that data! And we shouldn't require users to pay for huge VMs.

Reading directly from legacy file formats will be sufficient for a lot of use-cases but not all. So it will still be necessary to convert some datasets to analysis-ready, cloud-optimised (ARCO) formats like Zarr.

2) Maintain public manifests of legacy files

Lazily opening legacy files requires a "manifest" which acts as a table of contents for the legacy files. When the user runs xr.open_dataset(URL), xarray would just load this "manifest". Open Climate Fix could run data processing pipelines to keep manifests up-to-date for a range of existing public datasets. Perhaps these manifests would be accessed through dynamical.org.

3) Regularly convert legacy file formats to analysis-ready, cloud-optimised (ARCO) formats

As mentioned above, we already maintain a public Zarr archive of the ICON NWP and we have published a one-off archive of EUMETSAT data in Zarr. But there's lots more we could do (in collaboration with our friends at dynamical.org). For example, we can't currently afford to regularly update these archives. If we could afford it, we could keep these datasets up to date by regularly running our data conversion pipelines.

4) Write software to process data on-the-fly

The "dream" is to move to a world where data scientists rarely (if ever) have to create their own local copies of NWP and satellite datasets. Instead users would stream data directly from cloud object storage. But data scientists will still want to transform the data. The problem is that existing software tools can be quite inefficient and slow, and so would struggle to transform data fast enough to keep up with modern network interface cards (gigabytes per second). So there's space for more computationally efficient and fast software tools. For example, wouldn't it be great to be able to stream an NWP and reproject it on-the-fly, all at a few gigabytes per second?! Perhaps a good first step would be to build software tools to coerce existing datasets to use the same physical units and variable names.

5) Publish more data on cloud object storage, in legacy file formats

The focus could be on datasets for which there are no public long-duration archives. At OCF, we already maintain a public Zarr archive of the ICON NWP. But we could convert data more frequently, and there are lots more datasets we could archive.

6) Visualise multiple datasets in a web browser

Once the foundations have been set to lazily read from huge datasets of NWP and satellite datasets, it would be great if tools existed to allow users to visually explore these datasets without writing a single line of code. For example, it might be interesting to see how well a bunch of different NWPs agree with data from weather sensors on the ground. Perhaps a tool like rerun could be repurposed for this?

Conclusion

We'd love to live in a world where it's super-easy to work with huge NWP and satellite datasets. We are already dedicating significant time and effort towards this, and would love to hear from people who support this vision and who can help! Please email [email protected].

Appendix

File formats used by NWPs