Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

jhamman · 2025-04-06T23:23:16Z

Summary

This feature request proposes adding an optional integration of OpenTelemetry to the Zarr-Python codebase. OpenTelemetry is a widely adopted, vendor-neutral standard for generating, collecting, and exporting telemetry data (traces, metrics, and logs) used by many modern observability platforms. The goal is to improve observability, facilitate performance tuning, and enable integration with full-stack monitoring systems — all while preserving a lightweight default behavior.

📌 Motivation

Zarr is widely used in performance-critical and production environments such as:

Large-scale data processing
Scientific computing
Cloud-native workflows
Backend data source for web APIs (e.g. Xpublish)

Currently, Zarr provides limited visibility into internal operations like:

Chunk reads/writes
Compression and decompression
Storage backend access
Performance bottlenecks

By integrating OpenTelemetry (OTel), Zarr users and developers would benefit from:

Enhanced observability into internal workflows
Easier performance tuning via traces and profiling tools (e.g., Jaeger, Zipkin, Grafana Tempo)
Seamless integration into modern observability pipelines

☝ Each of these are particularly important following Zarr's recent adoption of asyncio - where the execution of concurrent operations is increasingly hard to track explicitly.

🧩 Proposal

Introduce optional support for OpenTelemetry instrumentation in key parts of the Zarr codebase:
- Data access (inside stores)
- Compression/decompression
- Encoding/decoding
Provide a clean interface or hooks to register and emit OpenTelemetry traces.
Default behavior should be:
- No-op (i.e. tracing is disabled unless explicitly enabled)
- Optionally fall back to a basic Python logger for basic introspection
Ensure zero overhead when OpenTelemetry is not enabled

✅ Benefits

Opt-in observability with minimal performance impact
Compatibility with OpenTelemetry-native tools and frameworks
Aids in debugging and performance analysis
Foundation for future enhancements (e.g., metrics, structured logging)

🛠️ Implementation Notes

Introduce a tracing.py module (or similar) to encapsulate OpenTelemetry usage
Use @contextmanager or Tracer.start_as_current_span() decorators in key areas
Conditional instrumentation based on config or environment variable(s)

🙋‍♂️ Call for Feedback

We would love to hear from maintainers and the community:

Does OpenTelemetry seem like a good fit for Zarr?
Are there specific areas of the codebase that would benefit most from tracing?
Would a structured logger fallback be helpful in low-overhead environments?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2025-04-09T02:12:34Z

Does OpenTelemetry seem like a good fit for Zarr?

Probably, especially the distributed traces.

I'm guessing the most common usage of OTel would be in combination with a Cloud object store. Those libraries should already have OTel available, which will provide traces for the actual HTTP calls made to Blob Storage. The bit of context we'd be able to layer on top is that a particular Array.getitem(...) at some path triggered these N reads from Blob Storage. You can kinda infer that based on the paths in blob storage, but that's not foolproof.

I don't think that Store classes themselves will have much to add to the trace context.

I'm not sure what metrics we'd want to export, if any. I would need to think about that a bit more.

One note on the implementation: there's a split between whether libraries implement OTel natively or whether it's done through an "instrumentation library" https://opentelemetry.io/docs/languages/python/libraries/. Instrumentation libraries, like opentelemetry-instrumentation-httpx are separate packages. I think they are typically developed outside the main library and monkeypatch it to add tracing. If we're doing it here, then we wouldn't need to monkeypatch (which always scared me, but I never actually had any issues with the libraries I used for tracing httpx, FastAPI, and asyncpg)

Would a structured logger fallback be helpful in low-overhead environments?

I'm a big fan of structlog. In this case, though, I feel like logs (structured or otherwise) will be much more valuable if they can be correlated with logs from the storage provider.

I have a decent amount of experience with OpenTelemetry and would be happy to provide reviews.

jhamman · 2025-04-09T05:07:31Z

Thanks @TomAugspurger for the thoughtful reply! I'm still thinking on it but a monkey-patching approach could be a good fit here.

I don't think that Store classes themselves will have much to add to the trace context.

I did some basic experiments with a store wrapper last week and found even the store-level traces to be quite interesting -- illuminating the async behavior in some stores and the blocking behavior in others. In a perfect world, we could also instrument calling libraries (like Xarray). That would allow us to really understand behavior and usage all the way through the stack.

I'm a big fan of structlog. In this case, though, I feel like logs (structured or otherwise) will be much more valuable if they can be correlated with logs from the storage provider.

Me too! I've used ASGI correlation ID approaches with structlog elsewhere. The tricks for us would be a) defining the context to correlate under and b) passing that on to the storage provider (which will only be possible in certain store types).

TomAugspurger · 2025-04-09T12:23:27Z

illuminating the async behavior in some stores and the blocking behavior in others

Ah, that's a good point. I was assuming that any store-level .get() would overlap 1:1 exactly with whatever trace you get from the storage library, but we don't live in a perfect world :) And maybe obstore / object_store.rs doesn't yet have tracing so this would be the best way to get the timings for a particular get / put.

defining the context to correlate under and b) passing that on to the storage provider

Yeah, this is where integrating with opentelemetry probably makes it the right choice. Then we don't have to worry about different ways of setting / propagating the trace ID.

Working backwards from questions we'd like to answer, to spots that ought to be traced:

How many reads did this Zarr operation (Group.open, Array.getitem, etc.) take?

Spans at each Node / Store boundary

How long did each Store operation take?

Spans around each Store.get / Store.put

Where is time spent in this Array.getitem / Array.setitem call

Spans around each stage of the CodecPipeline (ByteGetter covered above; Also decompression / other transforms).

jeromekelleher mentioned this issue Apr 7, 2025

Add OpenTelemetry support? sgkit-dev/vcztools#196

Open

This was referenced May 1, 2025

Monthly issue metrics report #3030

Open

Monthly issue metrics report sanketverma1704/zarr-python#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

jhamman commented Apr 6, 2025

TomAugspurger commented Apr 9, 2025

jhamman commented Apr 9, 2025

TomAugspurger commented Apr 9, 2025

Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

Comments

jhamman commented Apr 6, 2025

Summary

📌 Motivation

🧩 Proposal

✅ Benefits

🛠️ Implementation Notes

🙋‍♂️ Call for Feedback

TomAugspurger commented Apr 9, 2025

jhamman commented Apr 9, 2025

TomAugspurger commented Apr 9, 2025