Streaming millions of TESSERA tiles over HTTP with Zarr v3
How we restructured TESSERA's geospatial embeddings from millions of individual numpy files into sharded Zarr v3 stores for efficient HTTP streaming, enabling everything from single-pixel mobile lookups to regional-scale analysis with just a couple of range requests.

In recent years, the demand for efficient geospatial data management and analysis has surged, driven by advancements in machine learning and the growing need for real-time spatial insights. One project that exemplifies this trend is TESSERA, a system designed to handle vast geospatial embeddings. Initially, TESSERA stored its data as millions of individual numpy files, a structure that became unwieldy as the dataset grew. The team faced challenges in scaling this setup, particularly when it came to serving data over HTTP for applications ranging from mobile devices to regional-scale analyses.
To address these issues, the TESSERA team embarked on a significant restructuring effort. They opted to transition from millions of individual numpy files to a more efficient solution: sharded Zarr v3 stores. Zarr, an open-source project, provides a scalable and efficient way to store large datasets, making it an ideal choice for handling TESSERA's geospatial embeddings. By leveraging Zarr's capabilities, the team aimed to enable efficient HTTP streaming, allowing users to access data ranging from single-pixel lookups to large-scale analyses with minimal latency.
The restructuring process involved several key steps. First, the team had to understand the unique requirements of TESSERA's data. The geospatial embeddings were highly structured, with each tile representing a specific location on the Earth's surface. These tiles were originally stored as numpy arrays, which, while efficient for computation, posed challenges when it came to serving data over a network.
The decision to use Zarr v3 was driven by its ability to handle large datasets in a distributed manner. Zarr allows data to be split into smaller chunks, or "shards," which can be stored across different storage backends. This sharding approach not only improves performance but also enables efficient range queries, a critical feature for geospatial data. By breaking down the dataset into manageable parts, Zarr enables users to retrieve only the necessary data, reducing bandwidth usage and improving response times.
Implementing Zarr v3 required careful planning. The team had to determine the optimal shard size, balancing the need for efficient storage with the ability to quickly serve range requests. After extensive testing, they settled on a shard size that allowed for efficient streaming while maintaining the integrity of the geospatial structure.
Once the sharding strategy was finalized, the team began the process of migrating the existing numpy files into Zarr stores. This involved converting each numpy file into a Zarr dataset, ensuring that the spatial relationships between tiles were preserved. The migration process was meticulous, with the team verifying that the data integrity was maintained throughout the transition.
With the data now stored in Zarr v3 stores, the next step was to enable HTTP streaming. The team implemented a custom server that could efficiently serve range requests. This server was designed to handle the specific needs of TESSERA's geospatial data, optimizing performance for both single-pixel and regional-scale queries. By leveraging Zarr's range query capabilities, the server could quickly assemble the necessary data, minimizing latency and improving user experience.
The benefits of this restructuring are significant. For mobile applications, the ability to retrieve single-pixel data with minimal latency has transformed how users interact with geospatial information. Applications can now offer real-time insights without the need for extensive local storage. For regional-scale analyses, the efficient range queries enable users to process large areas of data with ease, opening up new possibilities for spatial analysis and modeling.
Moreover, the shift to Zarr v3 has improved the scalability of the TESSERA system. As the dataset grows, the team can add more shards to the Zarr stores, ensuring that the system remains performant and responsive. This scalability is crucial in an era where geospatial data continues to expand at an unprecedented rate.
In conclusion, the TESSERA team's decision to restructure their geospatial embeddings using Zarr v3 represents a significant advancement in efficient data management and serving. By transitioning from millions of individual numpy files to sharded Zarr stores, they have enabled a range of use cases, from mobile lookups to regional analyses, all with the speed and efficiency required by today's data-driven applications. This restructuring not only addresses current challenges but also sets a foundation for future growth and innovation in geospatial data handling.









