âHow do I structure my observability team?â is one of the most common questions folks leading software teams ask me. My advice: Don’t create a centralized âobservability teamâ thatâs responsible for all the observability within an organization.
Observability shouldnât exist as a silo. It touches many parts of an organization, from development to production, and should be treated as a team sport.
As we know, our systems can only be considered observable if they emit telemetry. No data means that we canât understand what is happening in our systems. Fortunately, the OpenTelemetryÂź (OTel) ecosystem from the Cloud Native Computing Foundation (CNCF) has become the de-facto standard for instrumenting, generating, collecting, and exporting telemetry data.
What does this mean for observability adoption in an organization? Letâs dig in.
Observability is everyoneâs responsibility
Reliability canât happen without observability. Observability must be looked at holistically. It is not the sole responsibility of any one team or individual. Everyone has an important part to play, and to a certain extent, the parts weave into each other.
Instrumenting code
There are two types of OpenTelemetry instrumentation:
Code-based instrumentation should be done by application developers, and not by an âobservability team.â Developers know their applications best. Asking someone else to instrument your application is like asking someone else to write your code comments. Please never do that.
Zero-code instrumentation usually involves a shim or bytecode instrumentation wrapper around your code. If youâre a developer writing code in a language that supports OpenTelematry auto-instrumentation, you should understand how to implement both zero-code and code-based instrumentation. In doing so, you can use the instrumentation to troubleshoot your own code.
In some environments, zero-code instrumentation may be managed by the OTel Operator. If this is the case, the responsibility often falls to SRE or platform engineering teams. Event in those cases, developers should understand; at least at a high level, how zero-code instrumentation is configured with the OTel Operator.
Managing observability infrastructure
Observability infrastructure still needs to be managed, whether youâre using a SaaS vendor (e.g. Dynatrace) or an open source stack. If youâre using OpenTelemetry, chances are youâre managing at least one OTel Collector, and perhaps many. If youâre running your applications on Kubernetes, youâll likely deploy and manage Collectors within the cluster as well. In most organizations, this responsibility falls under platform engineering or SRE teams, and these teams are essential to robust, reliable software delivery in large, complex environments.
That said, developers should still understand how the OpenTelemetry Collector is configured. Itâs true that you donât need to go through a Collector to send OTel data to an observability backend for non-production. However, the Collector still offers some nice things that direct-from-application doesnât (e.g. batching data, masking data, and automatic retries), and I still highly recommend using it, even in development.
Making CI/CD pipelines observable
DevOps engineers canât escape observability either, because guess what? We can make CI/CD pipelines observable too. While CI/CD pipelines may not be a production environment that external users interact with, they most certainly are a production environment that internal users interact with (i.e. software engineers, platform engineers, and SREs).
CI/CD pipelines are defined by code, and like it or not, that code can still fail. Making our application code observable helps us make sense of things when they fail in production. So, it stands to reason that having pipeline observability can help us understand whatâs going on when CI/CD pipelines fail.
Thereâs been some great buzz around the observability of CI/CD pipelines, especially now that thereâs an official OTel CI/CD Special Interest Group (SIG). This will give our favorite CI/CD tools a shared language for the observability of CI/CD pipelines, creating a foundation for them to support OpenTelemetry tools in this context.
Weâre not there yet, which means that right now we must stitch a few tools together to achieve CI/CD observability. Fortunately, things are moving nicely in this space, and if you havenât considered CI/CD pipeline observability in your organization before, nowâs the time to start thinking about it. To learn more about whatâs happening with OTel CI/CD observability, check out the #otel-cicd channel on CNCF Slack.
Troubleshooting
The beauty of observability is that once you instrument your code, you put the ability to troubleshoot in the hands of many. Consider the ripple effect when developers instrument their code:
- Developers: Instrumentation allows developers to debug their code as theyâre writing it.
- QA testers: Instrumentation allows testers to troubleshoot failed tests, allowing them to file more detailed bug reports. If QAs canât track down the issue, then it means that there is missing instrumentation that developers need to add to their code. This turns observability into a quality gate.
- SREs: Instrumentation allows SREs to troubleshoot production issues, gain insight into system performance, and ensure overall system reliability.
Ensuring adherence to observability practices
Remember how I advised against creating an âobservability teamâ responsible for all observability within an organization? I still stand by that. That said, I do believe that organizations should have an observability team responsible for enterprise-wide observability oversight and advocacy. A team that defines and disseminates observability standards and practices within that organization. This team would need to stay up to date in the latest observability practices, vendor offerings, and the OpenTelemetryâ ecosystemâânot just as an observer, but also as a project contributor, while also encouraging developers, platform engineers, and SREs to contribute.
This âobservability practices team,â canât, however, exist on an island. First off, it needs to be aligned with leadership to ensure that everyone is on the same page when it comes to observability. The team also needs support from individual practitioners. As a result, the team also needs to work with developers, SREs, platform engineers, QAs, and DevOps engineers to ensure that the practices and standards that it comes up with make sense.
If observability is to be a team sport, it needs coordination and guidance. There should be guardrails in place, to ensure that you have standard tooling, practices, and enforcement of said practices. Practices and standards include things like standard Collector configurations, and standard attributes emitted to your chosen observability backend(s).
Standardizing tooling is important because Iâve seen far too many âtool junglesâ in organizations, where each team or department has their own tooling and practices, and it ends up being a recipe for disaster. Too much redundancy and overlap.
In addition, the observability practices team should not be responsible for instrumenting developersâ code, nor should it be managing infrastructure. Itâs there to work with these other groups and to make sure that things are done right.
Final thoughts
Observability weaves its way into various aspects of an organization. Itâs not just a developer concern. Itâs not just an SRE concern. Itâs not just a QA concern. Itâs certainly not the concern of a single âobservability team.” Doing so downplays its importance, takes away our collective responsibility towards observability, and dilutes the promise of observability. The only way to make this work is by ensuring that the teams participating in this team sport that we call observability donât operate in silos.
The post Observability is a team sport appeared first on Dynatrace news.