Usage

Trino works together with the Apache Hive metastore and S3 bucket.

Prerequisites

  • Deployed Stackable Apache Hive metastore

  • Accessible S3 Bucket

    • Endpoint, access-key and secret-key

    • Data in the Bucket (we use the Iris dataset here)

  • Optional deployed Stackable Secret-Operator for certificates when deploying for HTTPS

  • Optional for authorization: Deployed Stackable OPA-Operator

  • Optional Trino CLI to test SQL queries

Installation

In the following we explain or link the required installation steps.

S3 bucket

Please refer to the S3 provider.

Hive operator

Please refer to the Hive operator and docs.

Both Hive and Trino need the same S3 authentication.

OPA operator

Please refer to the OPA operator and docs.

Authentication

We provide user authentication via secret that can be referred in the custom resource:

authentication:
  method:
    multiUser:
      userCredentialsSecret:
        namespace: default
        name: simple-trino-users-secret

These secrets need to be created manually before startup. The secret may look like the following snippet:

apiVersion: v1
kind: Secret
metadata:
  name: simple-trino-users-secret
type: kubernetes.io/opaque
stringData:
  admin: $2y$10$89xReovvDLacVzRGpjOyAOONnayOgDAyIS2nW9bs5DJT98q17Dy5i
  alice: $2y$10$HcCa4k9v2DRrD/g7e5vEz.Bk.1xg00YTEHOZjPX7oK3KqMSt2xT8W
  bob: $2y$10$xVRXtYZnYuQu66SmruijPO8WHFM/UK5QPHTr.Nzf4JMcZSqt3W.2.

The <user>:<password> combinations are provided in the stringData field. The hashes are created using bcrypt with 10 rounds or more.

htpasswd -nbBC 10 admin admin

Authorization

In order to authorize Trino via OPA, a ConfigMap containing Rego rules for Trino has to be applied. The following example is an all access Rego rule for testing with the user admin. Do not use it in production!

apiVersion: v1
kind: ConfigMap
metadata:
  name: opa-bundle-trino
  labels:
    opa.stackable.tech/bundle: "trino"
data:
  trino.rego: |
    package trino

    import future.keywords.in

    default allow = false

    allow {
      is_admin
    }

    is_admin() {
      input.context.identity.user == "admin"
    }

Users should write their own rego rules for more complex OPA authorization.

Trino

With the prerequisites fulfilled, the CRD for this operator must be created:

kubectl apply -f /etc/stackable/trino-operator/crd/trinocluster.crd.yaml

Insecure for testing:

Create an insecure single node Trino (v387) cluster for testing. You will access the UI/CLI via http and no user / password or authorization is required. Please adapt the s3 settings with your credentials (check examples/simple-trino-cluster.yaml for an example setting up Hive and Trino):

apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
  name: simple-trino
spec:
  version: 387-stackable0.1.0
  hiveConfigMapName: simple-hive-derby
  s3:
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: minio-credentials
  coordinators:
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        replicas: 1
        config: {}
  workers:
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        replicas: 1
        config: {}

To access the CLI please execute:

./trino-cli-387-executable.jar --debug --server http://<node>:<http-port> --user=admin

Secure (https) for production:

Create a secure single node Trino (v387) cluster. This will disable the UI access via http and requires username and password from the secret above. Please adapt the s3 settings with your credentials (check examples/simple-trino-cluster-authentication-opa-authorization.yaml for a full example setting up Hive, OPA, Secrets and Trino):

apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
  name: simple-trino
spec:
  version: 387-stackable0.1.0
  hiveConfigMapName: simple-hive-derby
  opa:
    configMapName: simple-opa
    package: trino
  s3:
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: minio-credentials
  authentication:
    method:
      multiUser:
        userCredentialsSecret:
          namespace: default
          name: simple-trino-users-secret
  coordinators:
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        replicas: 1
        config: {}
  workers:
    roleGroups:
      default:
        selector:
          matchLabels:
            kubernetes.io/os: linux
        replicas: 1
        config: {}

To access the CLI please download the Trino CLI and execute:

./trino-cli-387-executable.jar --debug --server https://<host>:<https-port> --user=admin --password --insecure

If you use self signed certificates, you also need the --insecure flag above which can be omitted otherwise.

S3 connection specification

You can specify S3 connection details directly inside the Trino specification or by referring to an external S3Connection custom resource.

To specify S3 connection details directly as part of the Trino resource, you add an inline connection configuration as shown below:

s3: (1)
  inline:
    host: test-minio (2)
    port: 9000 (3)
    pathStyleAccess: true (4)
    secretClass: minio-credentials  (5)
1 Entry point for the connection configuration
2 Connection host
3 Optional connection port
4 Optional flag if path-style URLs should be used; This defaults to false which means virtual hosted-style URLs are used.
5 Name of the Secret object expected to contain the following keys: accessKey and secretKey

It is also possible to configure the bucket connection details as a separate Kubernetes resource and only refer to that object from the Trino specification like this:

s3:
  reference: my-connection-resource (1)
1 Name of the connection resource with connection details

The resource named my-connection-resource is then defined as shown below:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
  name: my-connection-resource
spec:
  host: test-minio
  port: 9000
  accessStyle: Path
  credentials:
    secretClass: minio-credentials

This has the advantage that the connection configuration can be shared across applications and reduces the cost of updating these details.

Test Trino with Hive and S3

Create a schema and a table for the Iris data located in S3 and query data. This assumes to have the Iris data set in the PARQUET format available in the S3 bucket which can be downloaded here

Create schema

CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://iris/');

which should return:

CREATE SCHEMA

Create table

CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
  sepal_length DOUBLE,
  sepal_width  DOUBLE,
  petal_length DOUBLE,
  petal_width  DOUBLE,
  class        VARCHAR
)
WITH (
  external_location = 's3a://iris/parq',
  format = 'PARQUET'
);

which should return:

CREATE TABLE

Query data

SELECT
    sepal_length,
    class
FROM hive.iris.iris_parquet
LIMIT 10;

which should return something like this:

 sepal_length |    class
--------------+-------------
          5.1 | Iris-setosa
          4.9 | Iris-setosa
          4.7 | Iris-setosa
          4.6 | Iris-setosa
          5.0 | Iris-setosa
          5.4 | Iris-setosa
          4.6 | Iris-setosa
          5.0 | Iris-setosa
          4.4 | Iris-setosa
          4.9 | Iris-setosa
(10 rows)

Query 20220210_161615_00000_a8nka, FINISHED, 1 node
https://172.18.0.5:30299/ui/query.html?20220210_161615_00000_a8nka
Splits: 18 total, 18 done (100.00%)
CPU Time: 0.7s total,    20 rows/s, 11.3KB/s, 74% active
Per Node: 0.3 parallelism,     5 rows/s, 3.02KB/s
Parallelism: 0.3
Peak Memory: 0B
2.67 [15 rows, 8.08KB] [5 rows/s, 3.02KB/s]

Monitoring

The managed Trino instances are automatically configured to export Prometheus metrics. See Monitoring for more details.

Configuration & Environment Overrides

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

Do not override port numbers. This will lead to faulty installations.

Configuration Properties

For a role or role group, at the same level of config, you can specify: configOverrides for:

  • config.properties

  • node.properties

  • log.properties

  • password-authenticator.properties

  • hive.properties

For a list of possible configuration properties consult the Trino Properties Reference.

workers:
  roleGroups:
    default:
      config: {}
      replicas: 1
      configOverrides:
        config.properties:
          query.max-memory-per-node: "2GB"

Just as for the config, it is possible to specify this at role level as well:

workers:
  configOverrides:
    config.properties:
      query.max-memory-per-node: "2GB"
  roleGroups:
    default:
      config: {}
      replicas: 1

All override property values must be strings. The properties will be passed on without any escaping or formatting.

Environment Variables

Environment variables can be (over)written by adding the envOverrides property.

For example per role group:

workers:
  roleGroups:
    default:
      config: {}
      replicas: 1
      envOverrides:
        JAVA_HOME: "path/to/java"

or per role:

workers:
  envOverrides:
    JAVA_HOME: "path/to/java"
  roleGroups:
    default:
      config: {}
      replicas: 1

Here too, overriding properties such as http-server.https.port will lead to broken installations.