Connecting to Hive

Prerequisites

Hive Metastore and HDFS are accessible over network from the PuppyGraph instance.
PuppyGraph supports Kerberos authentication for Hive. If your Hive server is Kerberized, you need to set up the application configuration before starting it. See Querying Kerberized Hive Data as a Graph for more details.

Configuration

Hive Metastore Configuration

Configuration	Explanation
Hive metastore URI	The URI of your Hive metastore. Format: `thrift://<metastore_IP_address>:<metastore_port>`.

Tutorials

See Querying Kerberized Hive Data as a Graph for a tutorial on how to query Kerberized Hive data as a graph in production.

Here is another tutorial on how to connect PuppyGraph to tables managed by an existing Apache Hive server.

Demo

Assumptions

This guide demonstrates how to connect PuppyGraph to tables managed by an existing Apache Hive server.

It assumes the Hive server is running at localhost:10000, with its metastore service available at localhost:9083. If you need help setting up a Hive server with a metastore, refer to the official documentation at https://hive.apache.org/.

Additionally, Hive data should reside on HDFS, and the necessary ports must be accessible.

PuppyGraph is assumed to be deployed at http://localhost:8081, following one of the instructions in the Installation section. For this demo, the username is puppygraph and the password is puppygraph123.

Prepare Data

The guide will create the following two tables in the Hive database hive_onhdfs. Feel free to skip this step if you would like to query some existing tables.

Person TableReferral Table

ID	Age	Name
v1	29	marko
v2	27	vadas

RefID	Source	Referred	Weight
e1	v1	v2	0.5

Use the Hive beeline client to connect to the Hive Server. The command assumes that the Hive home path is /opt/hive. If the Hive Server is not at localhost, change the URL accordingly.

/opt/hive/bin/beeline -u 'jdbc:hive2://localhost:10000/'

Create the tables by typing the following statements in the Hive beeline console.

CREATE DATABASE hive_onhdfs;
CREATE TABLE hive_onhdfs.person (ID string, age int, name string);
CREATE TABLE hive_onhdfs.referral (refId string, source string, referred string, weight double);
INSERT INTO hive_onhdfs.person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
INSERT INTO hive_onhdfs.referral VALUES ('e1', 'v1', 'v2', 0.5);

Define the Graph

We then define a graph on top of the Hive tables we just created. Create a PuppyGraph schema file named hive_hdfs.json with the following content:

hive_hdfs.json

{
  "catalogs": [
    {
      "name": "hive_hdfs",
      "type": "hive",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://localhost:9083"
      }
    }
  ],
  "vertices": [
    {
      "label": "person",
      "mappedTableSource": {
        "catalog": "hive_hdfs",
        "schema": "hive_onhdfs",
        "table": "person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "type": "Int",
          "name": "age"
        },
        {
          "type": "String",
          "name": "name"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "mappedTableSource": {
        "catalog": "hive_hdfs",
        "schema": "hive_onhdfs",
        "table": "referral",
        "metaFields": {
          "id": "refId",
          "from": "source",
          "to": "referred"
        }
      },
      "from": "person",
      "to": "person",
      "attributes": [
        {
          "type": "Double",
          "name": "weight"
        }
      ]
    }
  ]
}

The schema defines a Hive Catalog:

{
  "name": "hive_hdfs",
  "type": "hive",
  "metastore": {
    "type": "HMS",
    "hiveMetastoreUrl": "thrift://localhost:9083"
  }
}

The name hive_hdfs defines a reference within the JSON schema. It is used by the definition of nodes (vertices) and edges.
The catalog type must be hive, and its metastore type has to be HMS.
The metastore.hiveMetastoreUrl specifies the URL of the Hive Metastore Service. Change the hostname accordingly if it is not deployed at localhost.

Once the schema file hive_hdfs.json is created, upload it in the PuppyGraph Web GUI at http://localhost:8081 or using the following shell command:

curl -XPOST -H "content-type: application/json" --data-binary @./hive_hdfs.json --user "puppygraph:puppygraph123" localhost:8081/schema

Query the Graph

Connect to PuppyGraph Web GUI at http://localhost:8081 and start a gremlin console by clicking at the "Start query" button:

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Welcome to PuppyGraph!
...
gremlin>

Now we have connected to the Gremlin Console. In order to query the graph on top of the Hive tables, we run the following query which finds out the names of people known by someone:

g.V().hasLabel("person").out("knows").values("name")

The result is like this:

gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

FAQ

MetaException: Got exception: java.net.URISyntaxException Illegal character in hostname at index.

Please check whether your Hive metastore url's host contains IP address (or fully-qualified domain name) and port of the metastore host, like thrift://127.0.0.1:9083 or thrift://hms.puppygraph:9085