Connecting to Hudi

Prerequisites

Both the Hudi Catalog and storage are accessible over the network from the PuppyGraph instance.

Configuration

The configuration consists of two parts: Metastore (for Hudi Catalog) and Data Storage. Please configure them according to you Hudi setup.

Metastore Configuration

AWS Glue

Configuration	Explanation
Region	The region of the AWS Glue Data Catalog. Example: `us-east-1`. See AWS Glue endpoints and quotas for more details.
Use instance profile	Whether to use role-based authentication (Explicit IAM roles or instance-profile attached)
IAM Role ARN	The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.
Access key	The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
Secret key	The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

Hive Metastore

Configuration	Explanation
Hive metastore URI	The URI of your Hive metastore. Format: `thrift://<metastore_IP_address>:<metastore_port>`.

Data Storage Configuration

Amazon S3 (Simple Storage Service)

PuppyGraph supports Amazon S3 (Simple Storage Service) for Hudi.

Configuration	Explanation
Region	The region of the Amazon S3. Example: `us-east-1`. See Amazon Simple Storage Service endpoints and quotas for more details.
Use instance profile	Whether to use role-based authentication (Explicit IAM roles or instance-profile attached).
IAM Role ARN	The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.
Access key	The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
Secret key	The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Hudi.

Configuration	Explanation
Endpoint	The S3 compatible storage endpoint.
Access key	The access key of an IAM user for accessing the S3 compatible storage.
Secret key	The secret key of an IAM user for accessing the S3 compatible storage.
Enable SSL	Whether to enable SSL connection for accessing the S3 compatible storage.
Enable path style access	Whether to use path-style access method when accessing the S3 compatible storage.

Demo

In the demo, the Hudi data source stores people and referral information. To query the data as a graph, we model people as nodes (vertices) and the referral relationship between people as edges.

Prerequisites

The demo assumes that PuppyGraph has been deployed at localhost according to the instruction in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker.

In this demo, we use the username puppygraph and password puppygraph123.

Data Preparation (Optional)

Person TableReferral Table

ID	Age	Name
v1	29	marko
v2	27	vadas

RefID	Source	Referred	Weight
e1	v1	v2	0.5

The demo uses people and referral information as shown above.

Here is the shell command to start a SparkSQL instance for data preparation assuming that the hudi data are stored on HDFS at 172.31.19.123:9000 and the Hive metastore is at 172.31.31.125:9083.

spark-sql --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 \
--conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 \
--conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.puppy_hudi=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.catalog.puppy_hudi.type=hive \
--conf spark.sql.catalog.puppy_hudi.uri=thrift://172.31.31.125:9083

Now we can use the following SparkSQL query to create data in the database hudi_onhdfs. The catalog name is puppy_hudi as specified in the command above.

CREATE DATABASE hudi_onhdfs;
USE hudi_onhdfs;
CREATE EXTERNAL TABLE person (
   ID string,
   age int,
   name string
) using hudi
tblproperties (
   primaryKey = 'ID'
);
INSERT INTO person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');

CREATE EXTERNAL TABLE referral (
   refId string,
   source string,
   referred string,
   weight double
) using hudi
tblproperties (
   primaryKey = 'refId'
);
INSERT INTO referral VALUES ('e1', 'v1', 'v2', 0.5);

Upload the schema

Now the data are ready in Hudi. We need a PuppyGraph schema before querying it. Let's create a schema file hudi.json:

hudi.json

{
  "catalogs": [
    {
      "name": "catalog_test",
      "type": "hudi",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://172.31.31.125:9083"
      }
    }
  ],
  "vertices": [
    {
      "label": "person",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "hudi_onhdfs",
        "table": "person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "type": "Int",
          "name": "age"
        },
        {
          "type": "String",
          "name": "name"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "hudi_onhdfs",
        "table": "referral",
        "metaFields": {
          "id": "refId",
          "from": "source",
          "to": "referred"
        }
      },
      "from": "person",
      "to": "person",
      "attributes": [
        {
          "type": "Double",
          "name": "weight"
        }
      ]
    }
  ]
}

Here are some notes on this schema:

A catalog catalog_test is added to specify the remote data source in Hudi. Note the hiveMetastoreUrl field has the same value as the one we used to create data.
The label of the nodes (vertices) and edges do not have to be the same as the names of corresponding tables in Hudi. There is a mappedTableSource field in each of the node (vertex) and edge types specifying the actual schema (onhdfs) and table (referral).
Additionally, the mappedTableSource marks meta columns in the tables. For example, the fields from and to describe which columns in the table form the endpoints of edges.

PuppyGraph supports query Iceberg / Hudi / Delta Lake with metastore: Hive metastore/ AWS Glue and with storage: HDFS/ AWS S3/ MinIO.

You can refer to catalog configuration examples we provide: Examples.

For more catalog parameters details, please refer to Data Lake Catalog.

Now we can upload the schema file hudi.json to PuppyGraph with the following shell command, assuming that the PuppyGraph is running on localhost:

curl -XPOST -H "content-type: application/json" --data-binary @./hudi.json --user "puppygraph:puppygraph123" localhost:8081/schema

Query the data

Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:

[PuppyGraph]> console
         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph

Now we have connected to the Gremlin Console. We can query the graph:

gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

Examples

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog Type	Storage Type	Example Configuration
Hive Metastore	Amazon S3	#hive-metastore-s3
Hive Metastore	MinIO	#hive-metastore-minio

Hive Metastore + S3

"catalogs": [
  {
    "name": "hudi_hms_s3",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + MinIO

"catalogs": [
  {
    "name": "hudi_hms_minio",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false",
      "endpoint": "<s3_endpoint>",
      "enablePathStyleAccess": "true"
    }
  }
]