Querying Hive Data as a Graph
Summary
In this tutorial, you will:
- Create a Hive database on HDFS with Kerberos, and load it with example data;
- Start a PuppyGraph Docker container and query the Hive data as a graph.
Prerequisites
Docker
Please ensure that docker
is available.
Docker is required to run the PuppyGraph server. You can download Docker from here.
The installation can be verified by running:
Hive Deployment and configuration preparation
Run the following command to create docker network and volume. In this guide We will use the volume to share files between Hive and PuppyGraph containers.
Start a Kerberized Hive server in docker.
docker run -d --name datametaserver \
--hostname datametaserver.com \
-p 88:88 -p 9000:9000 -p 9083:9083 -p 10000:10000 -p 10002:10002 \
--network puppy-hive \
-v puppy-hive:/home/share \
puppygraph/hive:kerberos-hdfs-1.0
Enter the Hive docker container, and copy XML configuration files to the share folder. Replace localhost
with the container host name datametaserver.com
in krb5.conf
because PuppyGraph will use this configuration to connect to KDC server in Hive docker container. Replace krb5cc_cli_
with krb5cc_
because PuppyGraph reads tickets from file cache which locate in /tmp/krb5cc_
.
docker exec -it datametaserver bash
cp /usr/local/hive/conf/hive-site.xml /home/share/
cp /usr/local/hadoop/etc/hadoop/core-site.xml /home/share/
cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml /home/share/
cp /etc/krb5.conf /home/share/
sed -i 's/localhost/datametaserver.com/g' /home/share/krb5.conf
sed -i "s/krb5cc_cli_/krb5cc_/" /home/share/krb5.conf
Generate keytab file for PuppyGraph in Hive docker container.
docker exec -it datametaserver bash
kadmin.local -q "add_principal puppygraph/puppygraph.com"
kadmin.local -q "ktadd -k /home/share/puppygraph.keytab puppygraph/puppygraph.com@HADOOPKRB"
After executing these command, there should be 5 files in the /home/share
folder
root@datametaserver:/# ls /home/share/
core-site.xml hdfs-site.xml hive-site.xml krb5.conf puppygraph.keytab
Data Preparation
This tutorial is designed to be comprehensive and standalone, so it includes steps to populate data in Hive. In practical scenarios, PuppyGraph can query data directly from your existing Hive databases.
Run the following command to start a beeline shell:
Then execute the following SQL statements in the shell to create tables and insert data.
CREATE database modern_demo location "hdfs://datametaserver.com:9000/user/hive/warehouse/modern_demo.db";
create table modern_demo.person (id string, name string, age int) location "hdfs://datametaserver.com:9000/user/hive/warehouse/modern_demo.db/person";
insert into modern_demo.person values
('v1', 'marko', 29),
('v2', 'vadas', 27),
('v4', 'josh', 32),
('v6', 'peter', 35);
create table modern_demo.software (id string, name string, lang string) location "hdfs://datametaserver.com:9000/user/hive/warehouse/modern_demo.db/software";
insert into modern_demo.software values
('v3', 'lop', 'java'),
('v5', 'ripple', 'java');
create table modern_demo.created (id string, from_id string, to_id string, weight double) location "hdfs://datametaserver.com:9000/user/hive/warehouse/modern_demo.db/created";
insert into modern_demo.created values
('e9', 'v1', 'v3', 0.4),
('e10', 'v4', 'v5', 1.0),
('e11', 'v4', 'v3', 0.4),
('e12', 'v6', 'v3', 0.2);
create table modern_demo.knows (id string, from_id string, to_id string, weight double) location "hdfs://datametaserver.com:9000/user/hive/warehouse/modern_demo.db/knows";
insert into modern_demo.knows values
('e7', 'v1', 'v2', 0.5),
('e8', 'v1', 'v4', 1.0);
The above SQL creates the following tables:
id | name | age |
---|---|---|
v1 | marko | 29 |
v2 | vadas | 27 |
v4 | josh | 32 |
v6 | peter | 35 |
id | name | lang |
---|---|---|
v3 | lop | java |
v5 | ripple | java |
id | from_id | to_id | weight |
---|---|---|---|
e7 | v1 | v2 | 0.5 |
e8 | v1 | v4 | 1.0 |
id | from_id | to_id | weight |
---|---|---|---|
e9 | v1 | v3 | 0.4 |
e10 | v4 | v5 | 1.0 |
e11 | v4 | v3 | 0.4 |
e12 | v6 | v3 | 0.2 |
Type !exit
to quit the beeline shell.
PuppyGraph Deployment
Run the follow commands to start PuppyGraph
docker run -d --name puppygraph \
--hostname puppygraph.com \
-p 8081:8081 -p 8182:8182 \
--network puppy-hive \
-v puppy-hive:/home/share \
-e AUTHENTICATION_HADOOP_CORESITEXMLPATH=/home/share/core-site.xml \
-e AUTHENTICATION_HADOOP_HIVESITEXMLPATH=/home/share/hive-site.xml \
-e AUTHENTICATION_HADOOP_HDFSSITEXMLPATH=/home/share/hdfs-site.xml \
-e AUTHENTICATION_KERBEROS_CONFIGPATH=/home/share/krb5.conf \
puppygraph/puppygraph:stable
Enter PuppyGraph container to init Kerberos ticket.
docker exec -uroot -it puppygraph bash
export KRB5_CONFIG=/home/share/krb5.conf
kinit -kt /home/share/puppygraph.keytab puppygraph/puppygraph.com@HADOOPKRB
After these operations, execute klist
inside the container to confirm PuppyGraph container get the correct tickets.
root@puppygraph:/home/ubuntu# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: puppygraph/puppygraph.com@HADOOPKRB
Valid starting Expires Service principal
01/10/25 01:56:32 01/11/25 01:56:32 krbtgt/HADOOPKRB@HADOOPKRB
renew until 01/17/25 01:56:32
Modeling a Graph
We then define a graph on top of the data tables we just created. Actually, this is the "Modern" graph defined by Apache Tinkerpop.
A schema instructs PuppyGraph on mapping data from the Hive into a graph. PuppyGraph offers various methods for schema creation. For this tutorial, we've already prepared a schema to help save time.
Create a PuppyGraph schema file schema.json
with the following content:
schema.json
{
"catalogs": [
{
"name": "puppygraph",
"type": "hive",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://datametaserver.com:9083"
}
}
],
"vertices": [
{
"label": "person",
"attributes": [
{
"type": "String",
"name": "name"
},
{
"type": "Int",
"name": "age"
}
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern_demo",
"table": "person",
"metaFields": {
"id": "id"
}
}
},
{
"label": "software",
"attributes": [
{
"type": "String",
"name": "name"
},
{
"type": "String",
"name": "lang"
}
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern_demo",
"table": "software",
"metaFields": {
"id": "id"
}
}
}
],
"edges": [
{
"label": "created",
"from": "person",
"to": "software",
"attributes": [
{
"type": "Double",
"name": "weight"
}
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern_demo",
"table": "created",
"metaFields": {
"from": "from_id",
"id": "id",
"to": "to_id"
}
}
},
{
"label": "knows",
"from": "person",
"to": "person",
"attributes": [
{
"type": "Double",
"name": "weight"
}
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern_demo",
"table": "knows",
"metaFields": {
"from": "from_id",
"id": "id",
"to": "to_id"
}
}
}
]
}
Log into PuppyGraph Web UI at http://localhost:8081 with username puppygraph
and password puppygraph123
.
Upload the schema by selecting the file schema.json
in the Upload Graph Schema JSON
block and clicking on Upload
.
Once the schema is uploaded, the schema page shows the visualized graph schema as follows.
Alternative: Schema Uploading via CLI
Alternatively, run the following command to upload the schema file:
curl -XPOST -H "content-type: application/json" --data-binary @./schema.json --user "puppygraph:puppygraph123" localhost:8081/schema
The response shows that graph schema has been uploaded successfully:
Querying the Graph
In this tutorial we will use the Gremlin query language to query the Graph. Gremlin is a graph query language developed by Apache TinkerPop. Prior knowledge of Gremlin is not necessary to follow the tutorial. To learn more about it, visit https://tinkerpop.apache.org/gremlin.html.
Click on the Query
panel the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin.
Queries are entered on the left side, and the right side displays the graph visualization.
The first query retrieves the property of the person named "marko".
Copy the following query, paste it in the query input, and click on the run button.
The output is plain text like the following:
Now let's also leverage the visualization. The next query gets all the software created by people known to "marko".
Copy the following query, paste it in the query input, and click on the run button.
The output is as follows. There are two paths in the result as "marko" knows "josh" who created "lop" and "ripple".
Alternative: Querying the graph via CLI
Alternatively, we can query the graph via CLI.
Execute the following command to access the PuppyGraph Gremlin CLI
The welcome screen appears as follows:
____ ____ _
| _ \ _ _ _ __ _ __ _ _ / ___| _ __ __ _ _ __ | |__
| |_) | | | | | | '_ \ | '_ \ | | | | | | _ | '__| / _` | | '_ \ | '_ \
| __/ | |_| | | |_) | | |_) | | |_| | | |_| | | | | (_| | | |_) | | | | |
|_| \__,_| | .__/ | .__/ \__, | \____| |_| \__,_| | .__/ |_| |_|
|_| |_| |___/ |_|
Welcome to PuppyGraph!
version: 0.10
puppy-gremlin>
Run the following queries in the console to query the Graph.
Properties of the person named "marko":
To exit PuppyGraph Gremlin Console, enter the command:
Cleaning up
Run the following command to shut down and remove the containers:
Remove the data volume if needed: