Usage

Druid requires a Zookeeper to run, as well as a database. If HDFS is used as the backend-storage (so called "deep storage") then the HDFS operator is required as well.

Setup Prerequisites

Zookeeper operator

Please refer to the Zookeeper operator and docs.

HDFS operator (optional)

Please refer to the HDFS operator and docs.

SQL Database

Druid requires a MySQL or Postgres database.

For testing purposes, you can spin up a PostgreSQL database with the bitnami PostgreSQL helm chart. Add the bitnami repository:

helm repo add bitnami https://charts.bitnami.com/bitnami

And set up the Postgres database:

helm install druid bitnami/postgresql \
--version=11 \
--set auth.username=druid \
--set auth.password=druid \
--set auth.database=druid

Creating a Druid Cluster

With the prerequisites fulfilled, the CRD for this operator must be created:

kubectl apply -f /etc/stackable/druid-operator/crd

Then a cluster can be deployed using the example below.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: simple-druid
spec:
  version: 0.22.1
  zookeeperConfigMapName: simple-zk
  metadataStorageDatabase:
    dbType: postgresql
    connString: jdbc:postgresql://druid-postgresql/druid
    host: druid-postgresql    # this is the name of the Postgres service
    port: 5432
    user: druid
    password: druid
  deepStorage:
    storageType: hdfs
    storageDirectory: hdfs://path/to/druidDeepStorage
  brokers:
    roleGroups:
      default:
        config: {}
        replicas: 1
  coordinators:
    roleGroups:
      default:
        config: {}
        replicas: 1
  historicals:
    roleGroups:
      default:
        config: {}
        replicas: 1
  middleManagers:
    roleGroups:
      default:
        config: {}
        replicas: 1
  routers:
    roleGroups:
      default:
        config: {}
        replicas: 1

The Router is hosting the web UI, a NodePort service is created by the operator to access the web UI. Connect to the simple-druid-router NodePort service and follow the druid documentation on how to load and query sample data.

Using S3

The Stackable Platform uses a common set of resource definitions for s3 across all operators, explained in detail on the S3 resources page. In general, you can configure an S3 connection or bucket inline, or as a reference to a dedicated object.

In Druid, S3 can be used for two things:

Ingesting data from a bucket
Using it as a backend for deep storage

You can specify just a connection/bucket or just one of these or for both, but Druid only supports a single S3 endpoint under the hood, so if two connections are specified, they must be the same. This is easiest if a dedicated S3 Connection Resource is used.

TLS for S3 is not yet supported.

S3 for ingestion

To ingest data from s3, you need to specify at least an endpoint to use, but there are more settings that can be set:

spec:
  ingestion:
    s3connection:
      host: yourhost.com  (1)
      port: 80 # optional (2)
      secretClass: my-s3-credentials  # optional (3)

1	The S3 host, not optional
2	Port, optional, defaults to 80
3	Credentials to use. Since these might be bucket-dependent, they can instead be given in the ingestion job

S3 deep storage

Druid can use S3 as a backend for deep storage:

spec:
  deepStorage:
    s3:
      inline:
        bucketName: my-bucket  (1)
        connection:
          inline:
            host: test-minio  (2)
            port: 9000  (3)
            secretClass: minio-credentials  (4)

1	Bucket name.
2	Bucket host.
3	Optional bucket port.
4	Name of the `Secret` object expected to contain the following keys: `ACCESS_KEY_ID` and `SECRET_ACCESS_KEY`.

It is also possible to configure the bucket connection details as a separate Kubernetes resource and only refer to that object from the DruidCluster like this:

spec:
  deepStorage:
    s3:
      reference: my-bucket-resource (1)

1	Name of the bucket resource with connection details.

The resource named my-bucket-resource is then defined as shown below:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-bucket-name
  connection:
    inline:
      host: test-minio
      port: 9000
      secretClass: minio-credentials

This has the advantage that bucket configuration can be shared across `DruidClusters`s (and other stackable CRDs) and reduces the cost of updating these details.

Using Open Policy Agent (OPA) for Authorization

Druid can connect to an Open Policy Agent (OPA) instance for authorization policy decisions. You need to run an OPA instance to connect to, for which we refer to the OPA Operator docs. How you can write RegoRules for Druid is explained below.

Once you have defined your rules, you need to configure the OPA cluster name and endpoint to use for Druid authorization requests. Add a section to the spec for OPA:

opa:
  configMapName: simple-opa (1)
  package: my-druid-rules (2)

1	The name of your OPA cluster (`simple-opa` in this case)
2	The RegoRule package to use for policy decisions. The package should contain an `allow` rule. This is optional and will default to the name of the Druid cluster.

Defining RegoRules

For a general explanation of how rules are written, we refer to the OPA documentation. Inside your rule you will have access to input from Druid. Druid provides this data to you to base your policy decisions on:

{
  "user": "someUsername", (1)
  "action": "READ", (2)
  "resource": {
    "type": "DATASOURCE", (3)
    "name": "myTable" (4)
  }
}

1	The authenticated identity of the user that wants to perform the action
2	The action type, can be either `READ` or `WRITE`.
3	The resource type, one of `STATE`, `CONFIG` and `DATASOURCE`.
4	In case of a datasource this is the table name, for `STATE` this will simply be `STATE`, the same for `CONFIG`.

For more details consult the Druid Authentication and Authorization Model.

Connecting to Druid from other Services

The operator creates a ConfigMap with the name of the cluster which contains connection information. Following our example above (the name of the cluster is simple-druid) a ConfigMap with the name simple-druid will be created containing 3 keys:

DRUID_ROUTER with the format <host>:<port>, which points to the router processes HTTP endpoint. Here you can connect to the web UI, or use REST endpoints such as /druid/v2/sql/ to query data. More information in the Druid Docs.
DRUID_AVATICA_JDBC contains a JDBC connect string which can be used together with the Avatica JDBC Driver to connect to Druid and query data. More information in the Druid Docs.
DRUID_SQALCHEMY contains a connection string used to connect to Druid with SQAlchemy, in - for example - Apache Superset.

Monitoring

The managed Druid instances are automatically configured to export Prometheus metrics. See Monitoring for more details.

Configuration & Environment Overrides

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems.

Configuration Properties

For a role or role group, at the same level of config, you can specify: configOverrides for the runtime.properties. For example, if you want to set the druid.server.http.numThreads for the router to 100 adapt the routers section of the cluster resource like so:

routers:
  roleGroups:
    default:
      config: {}
      configOverrides:
        runtime.properties:
          druid.server.http.numThreads: "100"
      replicas: 1

Just as for the config, it is possible to specify this at role level as well:

routers:
  configOverrides:
    runtime.properties:
      druid.server.http.numThreads: "100"
  roleGroups:
    default:
      config: {}
      replicas: 1

All override property values must be strings.

For a full list of configuration options we refer to the Druid Configuration Reference.

Environment Variables

In a similar fashion, environment variables can be (over)written. For example per role group:

routers:
  roleGroups:
    default:
      config: {}
      envOverrides:
        MY_ENV_VAR: "MY_VALUE"
      replicas: 1

or per role:

routers:
  envOverrides:
    MY_ENV_VAR: "MY_VALUE"
  roleGroups:
    default:
      config: {}
      replicas: 1