Connecting to Delta Lake

It is also possible to query existing data on Delta Lake directly without loading it to PuppyGraph. Here is an demo of the feature.

In the demo, the delta lake data source stores people and referral information. To query the data as a graph, we model people as vertices and the referral relationship between people as edges.

Prerequisites

The demo assumes that PuppyGraph has been deployed at localhost according to the instruction in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker.

In this demo, we use the username puppygraph and password puppygraph123.

Data Preparation (Optional)

IDAgeName

v1

29

marko

v2

27

vadas

The demo uses people and referral information as shown above.

Here is the shell command to start a SparkSQL instance for data preparation assuming that the delta lake data are stored on HDFS at 172.31.19.123:9000 and the Hive metastore is at 172.31.31.125:9083.

spark-sql \
--packages io.delta:delta-core_2.12:2.3.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 \
--conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.puppy_delta=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.puppy_delta.type=hive \
--conf spark.sql.catalog.puppy_delta.uri=thrift://172.31.31.125:9083

Now we can use the following SparkSQL query to create data in the database onhdfs. The catalog name is puppy_delta as specified in the command above.

CREATE DATABASE puppy_delta.onhdfs;
USE puppy_delta.onhdfs;
CREATE EXTERNAL TABLE person (ID string, age int, name string) using delta;
INSERT INTO person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
CREATE EXTERNAL TABLE referral (refId string, source string, referred string, weight double) using delta;
INSERT INTO referral VALUES ('e1', 'v1', 'v2', 0.5);

Upload the schema

Now the data are ready in Delta Lake. We need a PuppyGraph schema before querying it. Let's create a schema file deltalake.json:

deltalake.json
{
  "catalogs": [
    {
      "name": "catalog_test",
      "type": "deltalake",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://172.31.31.125:9083"
      }
    }
  ],
  "vertices": [
    {
      "label": "person",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "onhdfs",
        "table": "person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "type": "Int",
          "name": "age"
        },
        {
          "type": "String",
          "name": "name"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "onhdfs",
        "table": "referral",
        "metaFields": {
          "id": "refId",
          "from": "source",
          "to": "referred"
        }
      },
      "from": "person",
      "to": "person",
      "attributes": [
        {
          "type": "Double",
          "name": "weight"
        }
      ]
    }
  ]
}

Here are some notes on this schema:

  1. A catalog catalog_test is added to specify the remote data source in Delta Lake. Note the hiveMetastoreUrl field has the same value as the one we used to create data.

  2. The label of the vertices and edges do not have to be the same as the names of corresponding tables in Delta Lake. There is a mappedTableSource field in each of the vertex and edge types specifying the actual schema (onhdfs) and table (referral).

  3. Additionally, the mappedTableSource marks meta columns in the tables. For example, the fieldsfrom and to describe which columns in the table form the endpoints of edges.

PuppyGraph supports query Iceberg / Hudi / Delta Lake with metastore: Hive metastore/ AWS Glue and with storage: HDFS/ AWS S3/ MinIO.

You can refer to catalog configuration examples we provide: Examples.

For more catalog parameters details, please refer to Data Lake Catalog.

Now we can upload the schema file deltalake.json to PuppyGraph with the following shell command, assuming that the PuppyGraph is running on localhost:

curl -XPOST -H "content-type: application/json" --data-binary @./deltalake.json --user "puppygraph:puppygraph123" localhost:8081/schema

Query the data

Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:

[PuppyGraph]> console
         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph

Now we have connected to the Gremlin Console. We can query the graph:

gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

\

Last updated