Usage
Trino works together with the Apache Hive metastore and S3 bucket.
Prerequisites
-
Deployed Stackable Apache Hive metastore
-
Accessible S3 Bucket
-
Endpoint, access-key and secret-key
-
Data in the Bucket (we use the Iris dataset here)
-
-
Optional deployed Stackable Secret Operator for certificates when deploying for TLS
-
Optional deployed Stackable Commons Operator for certificates when deploying for TLS authentication
-
Optional for authorization: Deployed Stackable OPA Operator[OPA-Operator]
-
Optional Trino CLI to test SQL queries
Installation
In the following we explain or link the required installation steps.
Hive operator
Please refer to the Hive Operator docs.
Both Hive and Trino need the same S3 authentication.
OPA operator
Please refer to the OPA Operator docs.
Authentication
We provide user authentication via secret that can be referred in the custom resource:
authentication:
method:
multiUser:
userCredentialsSecret:
namespace: default
name: simple-trino-users-secret
These secrets need to be created manually before startup. The secret may look like the following snippet:
apiVersion: v1
kind: Secret
metadata:
name: simple-trino-users-secret
type: kubernetes.io/opaque
stringData:
admin: $2y$10$89xReovvDLacVzRGpjOyAOONnayOgDAyIS2nW9bs5DJT98q17Dy5i
alice: $2y$10$HcCa4k9v2DRrD/g7e5vEz.Bk.1xg00YTEHOZjPX7oK3KqMSt2xT8W
bob: $2y$10$xVRXtYZnYuQu66SmruijPO8WHFM/UK5QPHTr.Nzf4JMcZSqt3W.2.
The <user>:<password> combinations are provided in the stringData
field. The hashes are created using bcrypt with 10 rounds or more.
htpasswd -nbBC 10 admin admin
Authorization
In order to authorize Trino via OPA, a ConfigMap
containing Rego rules for Trino has to be applied. The following example is an all access Rego rule for testing with the user admin
. Do not use it in production!
apiVersion: v1
kind: ConfigMap
metadata:
name: opa-bundle-trino
labels:
opa.stackable.tech/bundle: "trino"
data:
trino.rego: |
package trino
import future.keywords.in
default allow = false
allow {
is_admin
}
is_admin() {
input.context.identity.user == "admin"
}
Users should write their own rego rules for more complex OPA authorization.
Trino
With the prerequisites fulfilled, the CRD for this operator must be created:
kubectl apply -f /etc/stackable/trino-operator/crd/trinocluster.crd.yaml
Insecure for testing:
Create an insecure single node Trino (v387) cluster for testing. You will access the UI/CLI via http and no user / password or authorization is required. Please adapt the s3
settings with your credentials (check examples/simple-trino-cluster.yaml
for an example setting up Hive and Trino):
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
name: simple-trino
spec:
version: 387-stackable0.1.0
catalogLabelSelector:
matchLabels:
trino: simple-trino
coordinators:
roleGroups:
default:
replicas: 1
workers:
roleGroups:
default:
replicas: 1
---
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCatalog
metadata:
name: hive
labels:
trino: simple-trino
spec:
connector:
hive:
metastore:
configMap: simple-hive-derby
s3:
inline:
host: test-minio
port: 9000
accessStyle: Path
credentials:
secretClass: minio-credentials
To access the CLI please execute:
./trino-cli-387-executable.jar --debug --server http://<node>:<http-port> --user=admin
Secure for production:
There are multiple steps that must be taken to secure a Trino cluster:
-
Enable authentication
-
Enable TLS between the clients and coordinator
-
Enable internal TLS for communications between coordinators and workers
For testing purposes we use the Trino CLI.
Via authentication
If authentication is enabled, TLS for the coordinator as well as a shared secret for internal communications (this is base64 and not encrypted) must be configured.
Securing the Trino cluster will disable all HTTP ports and disable the web interface on the HTTP port as well.
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
name: simple-trino
spec:
version: 387-stackable0.1.0
config:
tls:
secretClass: trino-tls
authentication:
method:
multiUser:
userCredentialsSecret:
name: simple-trino-users-secret
[..]
If no config.tls.secretClass
is provided but authentication is enabled, it will default to tls
provided by the Secret Operator.
./trino-cli-387-executable.jar --debug --server https://<host>:<https-port> --user=admin --keystore-path=keystore.p12 --keystore-password=changeit
or
./trino-cli-387-executable.jar --debug --server https://<host>:<https-port> --user=admin --insecure
Via TLS only
This will disable the HTTP port and UI access and encrypt client-server communications.
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
name: simple-trino
spec:
version: 387-stackable0.1.0
config:
tls:
secretClass: trino-tls
[..]
./trino-cli-387-executable.jar --debug --server https://<host>:<https-port> --user=admin --keystore-path=keystore.p12 --keystore-password=changeit
Via internal TLS
Internal TLS is for encrypted and authenticated communications between coordinators and workers. Since this applies to all the data send and processed between the processes, this may reduce the performance significantly.
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
name: simple-trino
spec:
version: 387-stackable0.1.0
config:
internalTls:
secretClass: trino-internal-tls
[..]
Since Trino has internal and external communications running over a single port, this will enable the HTTPS port but not expose it. Cluster access is only possible via HTTP.
./trino-cli-387-executable.jar --debug --server http://<host>:<http-port> --user=admin
S3 connection specification
You can specify S3 connection details directly inside the TrinoCatalog
specification
or by referring to an external S3Connection
custom resource.
To specify S3 connection details directly as part of the TrinoCatalog
resource, you
add an inline connection configuration as shown below:
s3: (1)
inline:
host: test-minio (2)
port: 9000 (3)
pathStyleAccess: true (4)
secretClass: minio-credentials (5)
tls:
verification:
server:
caCert:
secretClass: minio-tls-certificates (6)
1 | Entry point for the connection configuration |
2 | Connection host |
3 | Optional connection port |
4 | Optional flag if path-style URLs should be used; This defaults to false
which means virtual hosted-style URLs are used. |
5 | Name of the Secret object expected to contain the following keys:
accessKey and secretKey |
6 | Optional TLS settings for encrypted traffic. The secretClass can be provided by the Secret Operator or yourself. |
A self provided S3 TLS secret can be specified like this:
apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
name: minio-tls-certificates
spec:
backend:
k8sSearch:
searchNamespace:
pod: {}
---
apiVersion: v1
kind: Secret
metadata:
name: minio-tls-certificates
labels:
secrets.stackable.tech/class: minio-tls-certificates
data:
ca.crt: <your-base64-encoded-ca>
tls.crt: <your base64-encoded-public-key>
tls.key: <your-base64-encoded-private-key>
It is also possible to configure the bucket connection details as a separate
Kubernetes resource and only refer to that object from the TrinoCatalog
specification
like this:
s3:
reference: my-connection-resource (1)
1 | Name of the connection resource with connection details |
The resource named my-connection-resource
is then defined as shown below:
---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
name: my-connection-resource
spec:
host: test-minio
port: 9000
accessStyle: Path
credentials:
secretClass: minio-credentials
This has the advantage that the connection configuration can be shared across applications and reduces the cost of updating these details.
Test Trino with Hive and S3
Create a schema and a table for the Iris data located in S3 and query data. This assumes to have the Iris data set in the PARQUET
format available in the S3 bucket which can be downloaded here
Create schema
CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://iris/');
which should return:
CREATE SCHEMA
Create table
CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
sepal_length DOUBLE,
sepal_width DOUBLE,
petal_length DOUBLE,
petal_width DOUBLE,
class VARCHAR
)
WITH (
external_location = 's3a://iris/parq',
format = 'PARQUET'
);
which should return:
CREATE TABLE
Query data
SELECT
sepal_length,
class
FROM hive.iris.iris_parquet
LIMIT 10;
which should return something like this:
sepal_length | class --------------+------------- 5.1 | Iris-setosa 4.9 | Iris-setosa 4.7 | Iris-setosa 4.6 | Iris-setosa 5.0 | Iris-setosa 5.4 | Iris-setosa 4.6 | Iris-setosa 5.0 | Iris-setosa 4.4 | Iris-setosa 4.9 | Iris-setosa (10 rows) Query 20220210_161615_00000_a8nka, FINISHED, 1 node https://172.18.0.5:30299/ui/query.html?20220210_161615_00000_a8nka Splits: 18 total, 18 done (100.00%) CPU Time: 0.7s total, 20 rows/s, 11.3KB/s, 74% active Per Node: 0.3 parallelism, 5 rows/s, 3.02KB/s Parallelism: 0.3 Peak Memory: 0B 2.67 [15 rows, 8.08KB] [5 rows/s, 3.02KB/s]
Monitoring
The managed Trino instances are automatically configured to export Prometheus metrics. See Monitoring for more details.
Configuration & Environment Overrides
The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
Do not override port numbers. This will lead to faulty installations. |
Configuration Properties
For a role or role group, at the same level of config
, you can specify: configOverrides
for:
-
config.properties
-
node.properties
-
log.properties
-
password-authenticator.properties
For a list of possible configuration properties consult the Trino Properties Reference.
workers:
roleGroups:
default:
config: {}
replicas: 1
configOverrides:
config.properties:
query.max-memory-per-node: "2GB"
Just as for the config
, it is possible to specify this at role level as well:
workers:
configOverrides:
config.properties:
query.max-memory-per-node: "2GB"
roleGroups:
default:
config: {}
replicas: 1
All override property values must be strings. The properties will be passed on without any escaping or formatting.
Environment Variables
Environment variables can be (over)written by adding the envOverrides
property.
For example per role group:
workers:
roleGroups:
default:
config: {}
replicas: 1
envOverrides:
JAVA_HOME: "path/to/java"
or per role:
workers:
envOverrides:
JAVA_HOME: "path/to/java"
roleGroups:
default:
config: {}
replicas: 1
Here too, overriding properties such as http-server.https.port
will lead to broken installations.
Storage for data volumes
You can mount a volume where data (config and logs of Trino) is stored by specifying PersistentVolumeClaims for each individual role or role group:
workers:
config:
resources:
storage:
data:
capacity: 2Gi
roleGroups:
default:
config:
resources:
storage:
data:
capacity: 3Gi
In the above example, all Trino workers in the default group will store data (the location of the property --data-dir
) on a 3Gi
volume. Additional role groups not specifying any resources will inherit the config provided on the role level (2Gi
volume). This works the same for memory or CPU requests.
By default, in case nothing is configured in the custom resource for a certain role group, each Pod will have a 2Gi
large local volume mount for the data location containing mainly logs.
Memory requests
You can request a certain amount of memory for each individual role group as shown below:
workers:
roleGroups:
default:
config:
resources:
memory:
limit: '2Gi'
In this example, each Trino container in the default
group will have a maximum of 2 gigabytes of memory. To be more precise, these memory limits apply to the container running Trino but not to any sidecar containers that are part of the pod.
Setting this property will also automatically set the maximum Java heap size for the corresponding process to 80% of the available memory. Be aware that if the memory constraint is too low, the cluster might fail to start. If pods terminate with an 'OOMKilled' status and the cluster doesn’t start, try increasing the memory limit.
For more details regarding Kubernetes memory requests and limits see: Assign Memory Resources to Containers and Pods.
CPU requests
Similarly to memory resources, you can also configure CPU limits, as shown below:
workers:
roleGroups:
default:
config:
resources:
cpu:
max: '500m'
min: '250m'
Defaults
If nothing is specified, the operator will automatically set the following default values for resources:
workers:
roleGroups:
default:
config:
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
cpu: "4"
memory: 2Gi
storage:
data:
capacity: 2Gi
The default values are most likely not sufficient to run a proper cluster in production. Please adapt according to your requirements. |
For more details regarding Kubernetes CPU limits see: Assign CPU Resources to Containers and Pods.