This article describes how EverQuote uses AWS-managed OpenDistro ElasticSearch clusters as persistent storage for the Jaeger backend (Collector + Query).

Disclaimer: Jaeger also supports Cassandra as a storage backend as well as a variety of deployment strategies. This particular setup should therefore not be construed as an endorsement or recommendation. Use your own judgment before adopting anything covered by this article.

In order to keep this article to a reasonable length, I will skip over the various sizing-, performance-, and operational concerns that go into provisioning an ElasticSearch cluster.

Instead I will focus on the configuration of the ElasticSearch cluster itself once it has been provisioned.

Specifically, I will show how to configure auth/z for Jaeger and how to use Index State Management (which is OpenDistro’s flavor of Index Lifecycle Management) to roll over and migrate indices.

Note that the configuration of both these features in OpenDistro differs from how it is done in an Elastic.co ElasticSearch cluster.

Assumptions

Before I proceed I have to point out a handful of assumptions I will be making about how the ElasticSearch cluster was provisioned. For that, I will use Terraform code for brevity.

resource "aws_elasticsearch_domain" "default" {

  domain_name           = "jaeger"
  elasticsearch_version = "7.10"

  cluster_config {
    dedicated_master_enabled = true
    warm_enabled             = true
    ...
  }

  advanced_options = {
    "rest.action.multi.allow_explicit_index" = "true"
    ...
  }

  advanced_security_options {
    enabled                        = true
    internal_user_database_enabled = true
    master_user_options {
      master_user_name     = "admin"
      master_user_password = "Ch8nge-me!"
    }
  }

  ...
}

As promised, I will not go into what the various options do. The above is an incomplete config and is only meant to show that we are adding Amazon’s UltraWarm nodes to our cluster and that we are using the internal user database for auth/z. These two particular choices are relevant for the configurations you will see below.

I am not entirely sure if the advanced option rest.action.multi.allow_explicit_index is required but since Jaeger performs bulk operations it may be. Be aware that this setting has security implications so read up on it if Jaeger is going to share an ElasticSearch cluster with other services.

⚠ Note: If you use Terraform to provision an ElasticSearch cluster in AWS, be advised that Terraform stores sensitive information in plain text in its state file. You should therefore do something like the following after provisioning the cluster:

$ CREDENTIALS=$(
    jq --null-input --arg user 'admin' --arg password 'Super5ecre+' '
    {
      MasterUserOptions: {
        MasterUserName: $user,
        MasterUserPassword: $password
      }
    }'
  )

$ history -c    # don’t leave sensitive information in your shell history

$ aws es update-elasticsearch-domain-config \
         --domain-name jaeger \
         --advanced-security-options "$CREDENTIALS"

⚠ Note: I am using the jq utility here in order to guarantee that I end up with valid JSON (in case the password contains e.g. double quotes or backslashes or anything else that must be handled specially in JSON). This is entirely optional of course.

Jaeger Authentication and Authorization

We don’t want just anyone with network access to our ElasticSearch cluster to be able to command it.

Amazon allows you to restrict access to ElasticSearch clusters via IAM policies and/or the ElasticSearch Security Plugin.

We are not using the former so we have to tell IAM to defer auth/z to ElasticSearch with the following IAM Domain Policy:

data "aws_iam_policy_document" "default" {
  statement {
    actions = ["es:*"]
    resources = [format("%s/*", aws_elasticsearch_domain.default.arn)]
    principals {
      type        = "*"
      identifiers = ["*"]
    }
  }
}

resource "aws_elasticsearch_domain_policy" "default" {
  domain_name     = aws_elasticsearch_domain.default.domain_name
  access_policies = data.aws_iam_policy_document.default.json
}

With that out of the way, we can create a user in the ElasticSearch cluster. Let’s call it jaeger.

$ CREDENTIALS=$(
    jq --null-input --arg password '[email protected]' '{password: $password}'
  )

$ history -c    # don’t leave sensitive information in your shell history

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data "$CREDENTIALS" \
       https://jaeger-es/_opendistro/_security/api/internalusers/jaeger

⚠ Note: For brevity / clarity I will use the base URL https://jaeger-es/ (rather than the much more lengthy e.g. https://vpc-jaeger-<random_gibberish>.<region>.es.amazonaws.com/) to denote the ElasticSearch cluster endpoint.

Next, we have to create an ElasticSearch role with the right permissions for Jaeger. Let’s create the following file:

{
  "description": "Jaeger permissions",
  "cluster_permissions": [
    "cluster:monitor/main",
    "indices:data/write/bulk",
    "indices:data/read/msearch"
  ],
  "index_permissions": [
    {
      "index_patterns": [
        "*jaeger-*"
      ],
      "fls": [],
      "masked_fields": [],
      "allowed_actions": [
        "*"
      ]
    },
    {
      "index_patterns": [
        "*"
      ],
      "fls": [],
      "masked_fields": [],
      "allowed_actions": [
        "indices:admin/aliases/get",
        "indices_monitor"
      ]
    }
  ]
}
jaeger_role.json

⚠ Note: I used this GitHub Issue as a starting point to figure out what permissions Jaeger needs. The above permissions assume that Jaeger will not do index lifecycle management.

Now let’s create the ElasticSearch role and also call it jaeger.

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data @jaeger_role.json \
       https://jaeger-es/_opendistro/_security/api/roles/jaeger

Finally, we have to associate the jaeger user with the jaeger role.

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data '{"users": ["jaeger"]}' \
       https://jaeger-es/_opendistro/_security/api/rolesmapping/jaeger

Voila. We now have a jaeger user with the right permissions. Go ahead and verify that it works:

$ curl --silent --user "jaeger" --request "GET" https://jaeger-es/

Index State Management

Since we want to take advantage of Amazon’s S3-backed UltraWarm nodes (which requires more advanced index lifecycle management than what was available in Jaeger when I wrote this article), we have to configure ElasticSearch to do the lifecycle management on its own.

The below diagram shows the index lifecycle strategy we want to implement and how the various pieces fit together.

⚠ Note: Jaeger uses two indices but the diagram only shows one of them for clarity. The other index is configured with the same index state management policy.

As you can see from the diagram, we want to create new indices (=rollover) on a regular cadence, migrate the older, less frequently accessed indices to the UltraWarm tier, and eventually delete the really old indices that (hopefully) nobody will be using anymore.

The first step towards this goal is to use the jaegertracing/jaeger-es-rollover docker image to initialize the required indices and index templates in ElasticSearch.

$ docker run --rm --net=host \
             --env SHARDS=6 \
             --env REPLICAS=1 \
             --env ES_USERNAME=admin \
             --env ES_PASSWORD='Super5ecre+' \
             --env ES_TLS_CA=/etc/ssl/cert.pem \
             jaegertracing/jaeger-es-rollover:latest \
             init https://jaeger-es

$ history -c    # don’t leave sensitive information in your shell history

⚠ Note: Obviously adjust the number of shards / replicas to fit your particular ElasticSearch design.

⚠ Note: Do not use ES_USE_ILM=true as Jaeger is currently (at the time I wrote this article) incompatible with OpenDistro's flavor of index lifecycle management. We will have to configure this part ourselves.

When ElasticSearch performs a rollover it uses index templates to configure the new index. The settings from all matching templates are merged such that settings from a higher priority (=order) template beat out conflicting settings from a lower priority template.

So, let’s start by creating the required templates.

{
  "order": 10,
  "index_patterns": [
    "jaeger-span-*"
  ],
  "aliases": {
    "jaeger-span-read": {}
  },
  "settings": {
    "index": {
      "opendistro": {
        "index_state_management": {
          "policy_id": "jaeger",
          "rollover_alias": "jaeger-span-write"
        }
      }
    }
  }
}
jaeger_ism_span_template.json

⚠ Note: Jaeger supports index prefixes. If you use an index prefix, you obviously have to prepend that prefix here. I.e. replace jaeger-span with <prefix>-jaeger-span.

{
  "order": 10,
  "index_patterns": [
    "jaeger-service-*"
  ],
  "aliases": {
    "jaeger-service-read": {}
  },
  "settings": {
    "index": {
      "opendistro": {
        "index_state_management": {
          "policy_id": "jaeger",
          "rollover_alias": "jaeger-service-write"
        }
      }
    }
  }
}
jaeger_ism_service_template.json

⚠ Note: Jaeger supports index prefixes. If you use an index prefix, you obviously have to prepend that prefix here. I.e. replace jaeger-service with <prefix>-jaeger-service.

Send the templates to ElasticSearch.

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data @jaeger_ism_span_template.json \
       https://jaeger-es/_template/jaeger-span-ism

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data @jaeger_ism_service_template.json \
       https://jaeger-es/_template/jaeger-service-ism

⚠ Note: ElasticSearch 7.8 introduced new composable index templates. Those new templates (located at /_index_template/ as opposed to /_template/) supersede legacy templates. So if / when Jaeger switches to the new templates, the above must be updated accordingly.

Check your work.

$ curl --silent --user admin https://jaeger-es/_cat/templates
>>>
jaeger-span        [*jaeger-span-*] 0
jaeger-service     [*jaeger-service-*] 0
jaeger-span-ism    [jaeger-span-*] 10
jaeger-service-ism [jaeger-service-*] 10

Now we can create the ISM (Index State Management) policy.

{
  "policy": {
    "description": "Jaeger index lifecycle management policy",
    "default_state": "hot",
    "schema_version": 1,
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "warm",
            "conditions": {
              "min_index_age": "10d"
            }
          }
        ]
      },
      {
        "name": "warm",
        "actions": [
          {
            "warm_migration": {},
            "timeout": "24h",
            "retry": {
              "count": 5,
              "delay": "1h"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "30d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ]
  }
}
jaeger_ism_policy.json

Send the ISM policy to ElasticSearch.

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data @jaeger_ism_policy.json \
       https://jaeger-es/_opendistro/_ism/policies/jaeger

⚠ Note: OpenDistro for ElasticSearch v1.13.0 introduced a better way to map ISM policies to indices. Once this change takes effect, the opendistro.index_state_management.policy_id config must be removed from the index templates above and an ism_template block must be added to the ISM policy instead. This change will also obviate the next (and final) step.

The final step is to patch the two indices that were created by the jaegertracing/jaeger-es-rollover docker image.

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data '{"index": {"opendistro": {"index_state_management": {"policy_id" : "jaeger", "rollover_alias": "jaeger-span-write"}}}}' \
       https://jaeger-es/jaeger-span-000001/_settings

$ curl --silent \
       --user admin \
       --header "Content-Type: application/json" \
       --request PUT \
       --data '{"index": {"opendistro": {"index_state_management": {"policy_id" : "jaeger", "rollover_alias": "jaeger-service-write"}}}}' \
       https://jaeger-es/jaeger-service-000001/_settings

Check your work.

$ curl --silent --user admin \
       https://jaeger-es/_opendistro/_ism/explain/jaeger-*?pretty
>>>
{
  "jaeger-span-000001" : {
    "index.opendistro.index_state_management.policy_id" : "jaeger",
    "index" : "jaeger-span-000001",
    "index_uuid" : "qIP4e1L5SnmnULRM0nEAfg",
    "policy_id" : "jaeger",
    ...
  },
  "jaeger-service-000001" : {
    "index.opendistro.index_state_management.policy_id" : "jaeger",
    "index" : "jaeger-service-000001",
    "index_uuid" : "j4XaocW3Qg2GOiVvXUxvcA",
    "policy_id" : "jaeger",
    ...
  }
}

⚠ Note: In case anyone is wondering whether the is_write_index setting is needed here, I refer you to this post in the OpenDistro community forum.

And that’s it. We now have an Amazon-managed OpenDistro ElasticSearch cluster that does auth/z and handles index lifecycle management for Jaeger’s indices.

The next article picks up from here and shows how we configure the Jaeger Backend (Collector & Query) to use this ElasticSearch cluster.