Connecting to Hive

Prerequisites

  • Hive Metastore and HDFS are accessible over network from the PuppyGraph instance.

Configuration

Hive Metastore Configuration

Demo

Prerequisites

In this guide, PuppyGraph connects to the tables in an existing Apache Hive server.

The guide assumes that the Hive Server is at localhost:10000 and its metastore service is at localhost:9083. See https://hive.apache.org/ for more information if you need to start a Hive server with metastore.

Moreover, the Hive data should be stored on HDFS whose ports need to be accessible.

The guide also assumes that PuppyGraph has been deployed at http://localhost:8081 according to one of guides in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker. In this demo, it uses the username puppygraph and password puppygraph123.

Prepare Data (Optional)

The guide will create the following two tables in the Hive database hive_onhdfs. Feel free to skip this step if you would like to query some existing tables.

Use the Hive beeline client to connect to the Hive Server. The command assumes that the Hive home path is /opt/hive. If the Hive Server is not at localhost, change the URL accordingly.

/opt/hive/bin/beeline -u 'jdbc:hive2://localhost:10000/'

Create the tables by typing the following statements in the Hive beeline console.

CREATE DATABASE hive_onhdfs;
CREATE TABLE hive_onhdfs.person (ID string, age int, name string);
CREATE TABLE hive_onhdfs.referral (refId string, source string, referred string, weight double);
INSERT INTO hive_onhdfs.person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
INSERT INTO hive_onhdfs.referral VALUES ('e1', 'v1', 'v2', 0.5);

Define the Graph

We then defines a graph on top of the Hive tables we just created. Create a PuppyGraph schema file named hive_hdfs.json with the following content:

hive_hdfs.json
{
  "catalogs": [
    {
      "name": "hive_hdfs",
      "type": "hive",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://localhost:9083"
      }
    }
  ],
  "vertices": [
    {
      "label": "person",
      "mappedTableSource": {
        "catalog": "hive_hdfs",
        "schema": "hive_onhdfs",
        "table": "person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "type": "Int",
          "name": "age"
        },
        {
          "type": "String",
          "name": "name"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "mappedTableSource": {
        "catalog": "hive_hdfs",
        "schema": "hive_onhdfs",
        "table": "referral",
        "metaFields": {
          "id": "refId",
          "from": "source",
          "to": "referred"
        }
      },
      "from": "person",
      "to": "person",
      "attributes": [
        {
          "type": "Double",
          "name": "weight"
        }
      ]
    }
  ]
}

The schema defines a Hive Catalog:

{
  "name": "hive_hdfs",
  "type": "hive",
  "metastore": {
    "type": "HMS",
    "hiveMetastoreUrl": "thrift://localhost:9083"
  }
}
  • The name hive_hdfs defines a reference within the JSON schema. It is used by the definition of vertices and edges.

  • The catalog type must be hive, and its metastore type has to be HMS.

  • The metastore.hiveMetastoreUrl specifies the URL of the Hive Metastore Service. Change the hostname accordingly if it is not deployed at localhost.

Once the schema file hive_hdfs.json is created, upload it in the PuppyGraph Web GUI at http://localhost:8081 or using the following shell command:

curl -XPOST -H "content-type: application/json" --data-binary @./hive_hdfs.json --user "puppygraph:puppygraph123" localhost:8081/schema

Query the Graph

Connect to PuppyGraph Web GUI at http://localhost:8081 and start a gremlin console by clicking at the "Start query" button:

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Welcome to PuppyGraph!
...
gremlin> 

Now we have connected to the Gremlin Console. In order to query the graph on top of the Hive tables, we run the following query which finds out the names of people known by someone:

g.V().hasLabel("person").out("knows").values("name")

The result is like this:

gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

Last updated