trino-taxi-data
This guide assumes that you already have the demo |
This demo will
-
Install the required Stackable operators
-
Spin up the follow data products
-
Superset: A modern data exploration and visualization platform. This demo utilizes Superset to retrieve data from Trino via SQL queries and build dashboards on top of that data
-
Trino: A fast distributed SQL query engine for big data analytics that helps you explore your data universe. This demo uses it to enable SQL access to the data
-
MinIO: A S3 compatible object store. This demo uses it as persistent storage to store all the data used
-
Hive metastore: A service that stores metadata related to Apache Hive and other services. This demo uses it as metadata storage for Trino
-
Open policy agent (OPA): An open source, general-purpose policy engine that unifies policy enforcement across the stack. This demo uses it as the authorizer for Trino, which decides which user is able to query which data.
-
-
Load testdata into S3. It contains 2.5 years of New York City taxi trips
-
Make data accessible via SQL in Trino
-
Create Superset dashboards for visualization of the data
You can see the deployed products as well as their relationship in the following diagram:
List deployed Stackable services
To list the installed Stackable services run the following command:
$ stackablectl services list --all-namespaces
PRODUCT NAME NAMESPACE ENDPOINTS EXTRA INFOS
hive hive default hive 172.18.0.5:30298
metrics 172.18.0.5:31633
opa opa default http http://172.18.0.5:30316
superset superset default external-superset http://172.18.0.4:32295 Admin user: admin, password: adminadmin
trino trino default coordinator-http http://172.18.0.3:30167
coordinator-metrics 172.18.0.3:31818
coordinator-https https://172.18.0.3:30141
minio minio-trino default http http://172.18.0.3:31062 Third party service
console-http http://172.18.0.3:30503 Admin user: admin, password: adminadmin
When a product instance has not finished starting yet, the service will have no endpoint. Starting all the product instances might take an considerable amount of time depending on your internet connectivity. In case the product is not ready yet a warning might be shown. |
Inspect data in S3
The S3 provided by MinIO is used as a persistent storage to store all the data used.
You can have a look at the testdata within the MinIO web interface by opening the given minio
endpoint console-http
from your stackablectl services list
command output.
You have to use the endpoint from your command output, in this case it is http://172.18.0.3:30503.
Open it with your favorite browser.
Log in with the username admin
and password adminadmin
.
Click on the blue button Browse
on the bucket demo
and open the folders ny-taxi-data
→ raw
.
As you can see the demo uploaded 1GB of parquet files, one file per month.
The data contain taxi rides in New York City.
You can see the file size (and therefore the number of rides) decrease drastically because of the Covid-19 pandemic starting from 2020-03
.
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Use Trino webinterface
Trino offers SQL access to the data within S3.
Open the trino
endpoint coordinator-https
in your browser (https://172.18.0.3:30141
in this case).
If you get a warning regarding the self signed certificate (e.g. Warning: Potential Security Risk Ahead
) you have to tell your browser to trust the website and continue.
Log in with the username admin
and password adminadmin
.
When you start executing SQL queries you will see the queries getting processed here.
Use Superset webinterface
Superset gives the ability to execute SQL queries and build dashboards.
Open the superset
endpoint external-superset
in your browser (http://172.18.0.4:32295
in this case).
Log in with the username admin
and password adminadmin
.
View the dashboard
On the top click on the tab Dashboards
.
Click on the dashboard called Taxi data
.
It might take some time until the dashboards renders all the included charts.
You can clearly see the impact of Covid-19 on the taxi business.
Execute arbitrary SQL statements
Within Superset you can not only create dashboards but also run arbitrary SQL statements.
On the top click on the tab SQL Lab
→ SQL Editor
.
On the left select the database Trino
, the schema demo
and set See table schema
to ny_taxi_data
.
On the right textbox enter the desired SQL statement. If you do not want to make on up you can use the following:
select
format_datetime(tpep_pickup_datetime, 'YYYY/MM') as month,
count(*) as trips,
sum(total_amount) as sales,
avg(duration_min) as avg_duration_min
from ny_taxi_data
group by 1
order by 1
Summary
The demo loaded 2.5 years of taxi trip data from New York City with 68 million records and a total size of 1GB in parquet files. The data was put into the S3 storage. Trino enables you to query the data via SQL. Superset was used as a web-based frontend to execute SQL statements and build dashboards.
Where to go from here
There are multiple paths to go from here. The following sections can give you some ideas on what to explore next. You can find the description of the taxi data on the New York City website.
Execute arbitrary SQL statements
Within Superset you can execute arbitrary SQL statements to explore the taxi data. Can you answer the following questions by executing SQL statements? The Trino documentation on their SQL language might help you.
How many taxi trips there where in the year 2021?
See the answer
select
count(*) as trips
from ny_taxi_data
where year(tpep_pickup_datetime) = 2021
returns 30.903.982 trips.
What was the maximum amount of passengers?
See the answer
select
max(passenger_count) as max_passenger_count
from ny_taxi_data;
returns 112 passengers.
Well that’s weird. Lets examine the passengers distribution.
select
passenger_count,
count(*) as frequency
from ny_taxi_data
group by 1
order by 1 desc
limit 100
returns
passenger_count | frequency
-----------------+-----------
112.0 | 1
96.0 | 1
9.0 | 98
8.0 | 156
7.0 | 229
6.0 | 1089568
5.0 | 1715439
4.0 | 1052834
3.0 | 2504112
2.0 | 9575299
1.0 | 48133494
0.0 | 1454268
NULL | 2698591
We can see that one trip had 112 and another one 96 passengers. All the other trips start with a more "realistic" number of 9 passengers.
As a bonus question: What exactly did the large number of passenger do?
select *
from ny_taxi_data
where passenger_count > 50
returns
vendorid | tpep_pickup_datetime | tpep_dropoff_datetime | duration_min | passenger_count | trip_distance | payment_type | fare_amount | tip_amount | total_amount
----------+-------------------------+-------------------------+--------------+-----------------+---------------+--------------+-------------+------------+--------------
2 | 2021-08-01 19:47:43.000 | 2021-08-01 19:57:54.000 | 10 | 112.0 | 1.8 | Credit card | 9.0 | 2.46 | 14.76
2 | 2021-08-03 11:51:58.000 | 2021-08-03 12:09:29.000 | 17 | 96.0 | 1.56 | Credit card | 11.5 | 2.22 | 17.02
Pretty cheap for that amount of people! This probably are invalid records.
What was the highest tip (measured in percentage of the original fee) ever given?
See the answer
select
total_amount as fee,
tip_amount as tip,
tip_amount / total_amount * 100 as tip_percentage
from ny_taxi_data
where total_amount > 0
order by 3 desc
limit 5
returns
fee | tip | tip_percentage
------+------+--------------------
4.2 | 10.0 | 238.0952380952381
18.2 | 25.0 | 137.36263736263737
8.24 | 9.24 | 112.13592233009709
0.66 | 0.66 | 100.0
0.01 | 0.01 | 100.0
Create additional dashboards
You also have the possibility to create additional charts and bundle them together in a Dashboard. Have a look at the Superset documentation on how to do that.
Load additional data
You can use the MinIO webinterface to upload additional data.
As an alternative you can use the S3 API with an S3 client like s3cmd.
It is recommended to put the data into a folder (prefix) in the demo
bucket.
Have a look at the defined tables inside the hive
.demo
schema on how to inform Trino about the newly available data.
Table definitions
show create table hive.demo.ny_taxi_data_raw
produces something like
CREATE TABLE IF NOT EXISTS hive.demo.ny_taxi_data_raw (
VendorID BIGINT,
tpep_pickup_datetime TIMESTAMP,
tpep_dropoff_datetime TIMESTAMP,
passenger_count DOUBLE,
trip_distance DOUBLE,
payment_type BIGINT,
Fare_amount DOUBLE,
Tip_amount DOUBLE,
Total_amount DOUBLE
) WITH (
external_location = 's3a://demo/ny-taxi-data/raw/',
format = 'parquet'
)
If you want to transform or filter your data in any way before using it e.g. in Superset you can create a view as follows:
show create view hive.demo.ny_taxi_data
produces something like
create or replace view hive.demo.ny_taxi_data as
select
vendorid,
tpep_pickup_datetime,
tpep_dropoff_datetime,
date_diff('minute', tpep_pickup_datetime, tpep_dropoff_datetime) as duration_min,
passenger_count,
trip_distance,
case payment_type when 1 then 'Credit card' when 2 then 'Cash' when 3 then 'No charge' when 4 then 'Dispute' when 6 then 'Voided trino' else 'Unknown' end as payment_type,
fare_amount,
tip_amount,
total_amount
from hive.demo.ny_taxi_data_raw
where tpep_pickup_datetime >= from_iso8601_timestamp('2019-12-01T00:00:00')
and tpep_pickup_datetime <= from_iso8601_timestamp('2022-05-31T00:00:00')
Connect to Trino via CLI, Python or DBeaver
If you prefer running your SQL statements via command-line, a Python script or a graphical Database manager like DBeaver please have a look at the the Trino documentation on how to do that.