/Introducing Coinbase Watchdog: A new approach to UI and code-driven automation for Datadog

Introducing Coinbase Watchdog: A new approach to UI and code-driven automation for Datadog

At Coinbase, we use Datadog to collect system and application metrics, implement SLIs and SLOs, create dashboards, and more. As the number of dashboards and monitors grew, we began to see the need to codify them. We were worried that we didn’t have tools to detect accidental or malicious modification. Imagine a production incident that was not noticed by engineers because of an accidentally muted monitor.

Codification solves this because modifications are explicit (through code) and are stored in version control (benefitting from notification and code review systems).

One way of solving this problem would be to store the Datadog components (e.g. Datadog timeboards, screenboards, and monitors) in a version control system and apply changes done through code. The downside of this approach is that managing screenboards or timeboards though the code is not convenient or friendly.

Another way of solving this problem would be to use a Datadog UI-driven approach. This would be much more convenient and friendlier to use. This would require detecting changes, creating a request to submit it back to a version control system and the ability to revert/rollback changes that were not approved.

To achieve the best of both potential solutions (code and UI-driven approaches) we first started by reviewing existing projects we could find:

Each project had their pros and cons but this came down to two issues: not being able to codify dashboards in addition to monitors (even if we forked and contributed back) or being too complex for our customers.

Introducing Coinbase Watchdog

Coinbase Watchdog is a GitHub app and a Golang service that uses the Datadog API to watch for changes in Datadog, achieving the best of both a code and UI-driven approach. When it sees a change, it automatically creates a Pull Request (PR) with the changes in a dedicated Datadog GitHub repository. We have control and consensus mechanisms (you can read more about Heimdall here) that provide us the guarantees that a sufficient number of people have reviewed the change before it can land on master. If a PR was not approved and closed by a customer, Watchdog will call Datadog APIs to restore the components from the master branch in source control.

This approach gives us a UI-driven codification bot. All changes made in the Datadog UI will be automatically picked up by the bot and corresponding Pull Requests will be created. Watchdog can also detect if a user modified the code (Datadog component JSON) and apply the change to Datadog.

The two workflows are pictured here:

Coinbase Watchdog workflows

Configuring Watchdog

The Watchdog service has two types of configuration:

  1. System configuration: this configuration includes all required parameters such as Datadog API/APP keys, GitHub application private key, GitHub project URL, GitHub app installation ID, etc. This configuration is passed to the service with environment variables.
  2. User configuration: this is used by customers. Simple YAML files with a list of Datadog component IDs and metadata such as team, project name, etc. You may have many YAML files in the configuration directory and a configuration folder can have any arbitrary number of subfolders.

This is an example of a User configuration YAML file:

The components that you see above are:

  • meta: team: An arbitrary team name identifier.
  • meta: slack: A Slack room name to notify of changes.
  • dashboards: A list of dashboard IDs to watch.
  • monitors: A list of monitor IDs to watch.
  • screenboards: A list of screenboard IDs to watch.

How does Coinbase Watchdog detect changes?

The Watchdog service is completely stateless. There are two ways in which Watchdog detects changes: Full sync and Incremental.

Full sync

When the Watchdog service starts for the first time, it will query all components by ID and check each against the components stored in GitHub. If some component files are different, new PRs will be created per user configuration file. Depending on users’ configuration files, this step could potentially make many Datadog API calls. However, this will only happen once on service startup to verify that the current state in Datadog is consistent with the source in GitHub.

Watching for incremental changes

This is a fairly straight-forward task, and consists of several steps:

  1. The Datadog APIs conveniently expose a field “modified” which contains a component modification date. Watchdog polls the APIs https://api.datadoghq.com/api/v1/monitor, https://api.datadoghq.com/api/v1/screen, https://api.datadoghq.com/api/v1/dash . every N minutes (in our case every 10 minutes) and checks if the current time minus modified field value is less then 10 minutes.
  2. Watchdog uses a git implementation written in Golang under the hood. If step 1 was successful, Watchdog will pull the latest changes from the master branch, create a new local branch, make an HTTP GET request to retrieve the Datadog component from the Datadog API, save the component to a file, and run the equivalent of the git status command to see if the file in the master branch is different from the API response.
  3. Watchdog will then query GitHub APIs to find if duplicate PRs have been opened and if so it will ignore the current one.

An example API response from the Datadog API showing the “modified” key:

Notifications

When Watchdog creates a new PR it is important to notify the relevant team so that they can review the PR. We use a GitHub’s CODEOWNERS feature for that. In the GitHub repository root we have a CODEOWNERS file with the following lines:

config/reliability/ @engineering/reliability

data/infra/reliability/ @engineering/reliability

If a PR affects files in config/reliability/* or data/infra/reliability/* the Reliability Engineering team will be notified by email.

Furthermore, a team can opt-in to Slack notifications by setting a “slack” field under “meta” in user config file with a slack channel (see above picture).

Future features

In the future, we would like to add more features and are already working on a way to automatically revert changes based on PR expiration. We’re also looking forward to see if others who use Datadog find this service useful and are able and willing to contribute.

Datadog introduced new dashboard APIs which is currently not supported by Watchdog. We are planning to add this feature soon and meanwhile PRs are highly appreciated 😉

If you’re interested in contributing to this project, check it out on GitHub here!

If you’re interested in helping us build a modern, scalable platform for the future of crypto markets, we’re hiring in San Francisco!

This website may contain links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.

Unless otherwise noted, all images provided herein are by Coinbase.