Skip to content

Feature Request: Optional OpenTelemetry Integration for Observability and Performance Tuning #2958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jhamman opened this issue Apr 6, 2025 · 3 comments

Comments

@jhamman
Copy link
Member

jhamman commented Apr 6, 2025

Summary

This feature request proposes adding an optional integration of OpenTelemetry to the Zarr-Python codebase. OpenTelemetry is a widely adopted, vendor-neutral standard for generating, collecting, and exporting telemetry data (traces, metrics, and logs) used by many modern observability platforms. The goal is to improve observability, facilitate performance tuning, and enable integration with full-stack monitoring systems — all while preserving a lightweight default behavior.


📌 Motivation

Zarr is widely used in performance-critical and production environments such as:

  • Large-scale data processing
  • Scientific computing
  • Cloud-native workflows
  • Backend data source for web APIs (e.g. Xpublish)

Currently, Zarr provides limited visibility into internal operations like:

  • Chunk reads/writes
  • Compression and decompression
  • Storage backend access
  • Performance bottlenecks

By integrating OpenTelemetry (OTel), Zarr users and developers would benefit from:

  • Enhanced observability into internal workflows
  • Easier performance tuning via traces and profiling tools (e.g., Jaeger, Zipkin, Grafana Tempo)
  • Seamless integration into modern observability pipelines

☝ Each of these are particularly important following Zarr's recent adoption of asyncio - where the execution of concurrent operations is increasingly hard to track explicitly.


🧩 Proposal

  • Introduce optional support for OpenTelemetry instrumentation in key parts of the Zarr codebase:
    • Data access (inside stores)
    • Compression/decompression
    • Encoding/decoding
  • Provide a clean interface or hooks to register and emit OpenTelemetry traces.
  • Default behavior should be:
    • No-op (i.e. tracing is disabled unless explicitly enabled)
    • Optionally fall back to a basic Python logger for basic introspection
  • Ensure zero overhead when OpenTelemetry is not enabled

✅ Benefits

  • Opt-in observability with minimal performance impact
  • Compatibility with OpenTelemetry-native tools and frameworks
  • Aids in debugging and performance analysis
  • Foundation for future enhancements (e.g., metrics, structured logging)

🛠️ Implementation Notes

  • Introduce a tracing.py module (or similar) to encapsulate OpenTelemetry usage
  • Use @contextmanager or Tracer.start_as_current_span() decorators in key areas
  • Conditional instrumentation based on config or environment variable(s)

🙋‍♂️ Call for Feedback

We would love to hear from maintainers and the community:

  • Does OpenTelemetry seem like a good fit for Zarr?
  • Are there specific areas of the codebase that would benefit most from tracing?
  • Would a structured logger fallback be helpful in low-overhead environments?
@TomAugspurger
Copy link
Contributor

Does OpenTelemetry seem like a good fit for Zarr?

Probably, especially the distributed traces.

I'm guessing the most common usage of OTel would be in combination with a Cloud object store. Those libraries should already have OTel available, which will provide traces for the actual HTTP calls made to Blob Storage. The bit of context we'd be able to layer on top is that a particular Array.getitem(...) at some path triggered these N reads from Blob Storage. You can kinda infer that based on the paths in blob storage, but that's not foolproof.

I don't think that Store classes themselves will have much to add to the trace context.

I'm not sure what metrics we'd want to export, if any. I would need to think about that a bit more.

One note on the implementation: there's a split between whether libraries implement OTel natively or whether it's done through an "instrumentation library" https://opentelemetry.io/docs/languages/python/libraries/. Instrumentation libraries, like opentelemetry-instrumentation-httpx are separate packages. I think they are typically developed outside the main library and monkeypatch it to add tracing. If we're doing it here, then we wouldn't need to monkeypatch (which always scared me, but I never actually had any issues with the libraries I used for tracing httpx, FastAPI, and asyncpg)

Would a structured logger fallback be helpful in low-overhead environments?

I'm a big fan of structlog. In this case, though, I feel like logs (structured or otherwise) will be much more valuable if they can be correlated with logs from the storage provider.

I have a decent amount of experience with OpenTelemetry and would be happy to provide reviews.

@jhamman
Copy link
Member Author

jhamman commented Apr 9, 2025

Thanks @TomAugspurger for the thoughtful reply! I'm still thinking on it but a monkey-patching approach could be a good fit here.

I don't think that Store classes themselves will have much to add to the trace context.

I did some basic experiments with a store wrapper last week and found even the store-level traces to be quite interesting -- illuminating the async behavior in some stores and the blocking behavior in others. In a perfect world, we could also instrument calling libraries (like Xarray). That would allow us to really understand behavior and usage all the way through the stack.

I'm a big fan of structlog. In this case, though, I feel like logs (structured or otherwise) will be much more valuable if they can be correlated with logs from the storage provider.

Me too! I've used ASGI correlation ID approaches with structlog elsewhere. The tricks for us would be a) defining the context to correlate under and b) passing that on to the storage provider (which will only be possible in certain store types).

@TomAugspurger
Copy link
Contributor

illuminating the async behavior in some stores and the blocking behavior in others

Ah, that's a good point. I was assuming that any store-level .get() would overlap 1:1 exactly with whatever trace you get from the storage library, but we don't live in a perfect world :) And maybe obstore / object_store.rs doesn't yet have tracing so this would be the best way to get the timings for a particular get / put.

defining the context to correlate under and b) passing that on to the storage provider

Yeah, this is where integrating with opentelemetry probably makes it the right choice. Then we don't have to worry about different ways of setting / propagating the trace ID.


Working backwards from questions we'd like to answer, to spots that ought to be traced:

  1. How many reads did this Zarr operation (Group.open, Array.getitem, etc.) take?
  • Spans at each Node / Store boundary
  1. How long did each Store operation take?
  • Spans around each Store.get / Store.put
  1. Where is time spent in this Array.getitem / Array.setitem call
  • Spans around each stage of the CodecPipeline (ByteGetter covered above; Also decompression / other transforms).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants