Streaming millions of TESSERA tiles over HTTP with Zarr v3
How we restructured TESSERA's geospatial embeddings from millions of individual numpy files into sharded Zarr v3 stores for efficient HTTP streaming, enabling everything from single-pixel mobile lookups to regional-scale analysis with just a couple of range requests.

In recent years, the demand for efficient geospatial data management and analysis has surged, driven by advancements in machine learning and the increasing need for real-time spatial insights. TESSERA, a project focused on geospatial embeddings, faced the challenge of handling millions of numpy files, each containing spatial data, which posed significant storage and retrieval challenges. To address this, the team restructured their approach, opting to consolidate these files into a more efficient system using Zarr v3, a distributed array storage solution. This shift not only improved storage efficiency but also enabled faster data retrieval, paving the way for applications ranging from mobile-based single-pixel lookups to large-scale regional analysis.
Originally, TESSERA's geospatial embeddings were stored as individual numpy files, each representing a small tile of spatial data. While this approach was straightforward, it became increasingly difficult to manage as the dataset grew. Retrieving data for specific locations or regions required accessing numerous individual files, which was inefficient and slow. Moreover, scaling this system to handle larger datasets or higher traffic became a significant challenge.
The decision to migrate to Zarr v3 was driven by its ability to handle large-scale array data efficiently. Zarr v3 offers a distributed storage system that allows data to be sharded across multiple chunks, each stored in a separate file or location. This sharding mechanism enables efficient data retrieval, as only the necessary chunks are accessed, reducing the load on the system and minimizing latency.
The restructuring process involved several key steps. First, the team needed to understand the layout and structure of the existing data. Each numpy file represented a specific geographic tile, and the goal was to map these files into a Zarr v3 store. This involved determining the appropriate chunk size and sharding strategy to ensure that data could be accessed efficiently.
Once the data layout was established, the team began the migration process. Each numpy file was converted into a Zarr chunk, and these chunks were then stored in a distributed file system. The choice of file system was crucial, as it needed to support high-speed read and write operations and scale horizontally. Amazon S3 was selected for this purpose, offering a robust and scalable storage solution.
With the data migrated, the next step was to develop the infrastructure for efficient HTTP streaming. Zarr v3 provides a REST API that allows clients to request specific ranges of data. This capability was leveraged to enable users to retrieve data for single pixels or entire regions with minimal latency. The API was integrated into the TESSERA system, allowing clients to make range requests and receive the corresponding data in a streamed format.
The benefits of this restructuring are significant. For mobile applications requiring single-pixel lookups, the system can now retrieve data almost instantaneously, improving user experience and enabling real-time spatial analysis. Similarly, for regional-scale analysis, the ability to request a range of data in a few requests significantly reduces the time and resources needed compared to the old system.
Moreover, the use of Zarr v3 has made the TESSERA system more scalable. As the dataset grows, additional chunks can be added to the store without disrupting existing operations. This scalability ensures that the system can continue to meet the growing demands of users and applications.
In conclusion, TESSERA's migration from individual numpy files to a Zarr v3 store has transformed their geospatial data management and analysis capabilities. By leveraging the distributed storage and efficient range retrieval of Zarr v3, the project has achieved significant improvements in data access speed and system scalability. This restructuring not only enhances the performance of existing applications but also opens up new possibilities for geospatial analysis, from mobile devices to large-scale regional studies. The success of this migration highlights the potential of modern data storage solutions in addressing the challenges of handling large-scale spatial data efficiently.









