Connecting to Hive
Prerequisites
- Hive Metastore and HDFS are accessible over network from the PuppyGraph instance.
- PuppyGraph supports Kerberos authentication for Hive. If your Hive server is Kerberized, you need to set up the application configuration before starting it. See Querying Kerberized Hive Data as a Graph for more details.
Configuration
Hive Metastore Configuration
Configuration | Explanation |
---|---|
Hive metastore URI | The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port> . |
Tutorials
See Querying Kerberized Hive Data as a Graph for a tutorial on how to query Kerberized Hive data as a graph in production.
Here is another tutorial on how to connect PuppyGraph to tables managed by an existing Apache Hive server.
Demo
Assumptions
This guide demonstrates how to connect PuppyGraph to tables managed by an existing Apache Hive server.
It assumes the Hive server is running at localhost:10000
, with its metastore service available at localhost:9083
. If you need help setting up a Hive server with a metastore, refer to the official documentation at https://hive.apache.org/.
Additionally, Hive data should reside on HDFS, and the necessary ports must be accessible.
PuppyGraph is assumed to be deployed at http://localhost:8081, following one of the instructions in the Installation section. For this demo, the username is puppygraph
and the password is puppygraph123
.
Prepare Data
The guide will create the following two tables in the Hive database hive_onhdfs
. Feel free to skip this step if you would like to query some existing tables.
ID | Age | Name |
---|---|---|
v1 | 29 | marko |
v2 | 27 | vadas |
RefID | Source | Referred | Weight |
---|---|---|---|
e1 | v1 | v2 | 0.5 |
Use the Hive beeline client to connect to the Hive Server. The command assumes that the Hive home path is /opt/hive
. If the Hive Server is not at localhost
, change the URL accordingly.
Create the tables by typing the following statements in the Hive beeline console.
CREATE DATABASE hive_onhdfs;
CREATE TABLE hive_onhdfs.person (ID string, age int, name string);
CREATE TABLE hive_onhdfs.referral (refId string, source string, referred string, weight double);
INSERT INTO hive_onhdfs.person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
INSERT INTO hive_onhdfs.referral VALUES ('e1', 'v1', 'v2', 0.5);
Define the Graph
We then define a graph on top of the Hive tables we just created. Create a PuppyGraph schema file named hive_hdfs.json
with the following content:
hive_hdfs.json
{
"catalogs": [
{
"name": "hive_hdfs",
"type": "hive",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://localhost:9083"
}
}
],
"vertices": [
{
"label": "person",
"mappedTableSource": {
"catalog": "hive_hdfs",
"schema": "hive_onhdfs",
"table": "person",
"metaFields": {
"id": "id"
}
},
"attributes": [
{
"type": "Int",
"name": "age"
},
{
"type": "String",
"name": "name"
}
]
}
],
"edges": [
{
"label": "knows",
"mappedTableSource": {
"catalog": "hive_hdfs",
"schema": "hive_onhdfs",
"table": "referral",
"metaFields": {
"id": "refId",
"from": "source",
"to": "referred"
}
},
"from": "person",
"to": "person",
"attributes": [
{
"type": "Double",
"name": "weight"
}
]
}
]
}
The schema defines a Hive Catalog:
{
"name": "hive_hdfs",
"type": "hive",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://localhost:9083"
}
}
- The name
hive_hdfs
defines a reference within the JSON schema. It is used by the definition of nodes (vertices) and edges. - The catalog type must be
hive
, and its metastore type has to beHMS
. - The
metastore.hiveMetastoreUrl
specifies the URL of the Hive Metastore Service. Change the hostname accordingly if it is not deployed at localhost.
Once the schema file hive_hdfs.json
is created, upload it in the PuppyGraph Web GUI at http://localhost:8081 or using the following shell command:
curl -XPOST -H "content-type: application/json" --data-binary @./hive_hdfs.json --user "puppygraph:puppygraph123" localhost:8081/schema
Query the Graph
Connect to PuppyGraph Web GUI at http://localhost:8081 and start a gremlin console by clicking at the "Start query" button:
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Welcome to PuppyGraph!
...
gremlin>
Now we have connected to the Gremlin Console. In order to query the graph on top of the Hive tables, we run the following query which finds out the names of people known by someone:
The result is like this: