Connecting to Delta Lake
Prerequisites
- Both the Delta Lake Data Catalog and Storage are accessible over the network from the PuppyGraph instance
Configuration
The configuration consists of two parts: Metastore (for Data Catalog) and Data Storage. Please configure them according to you Delta Lake setup.
Metastore Configuration
AWS Glue
Configuration | Explanation |
---|---|
Region | The region of the AWS Glue Data Catalog. Example: us-east-1 . See AWS Glue endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached) |
IAM Role ARN | The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Secret key | The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Hive Metastore
Configuration | Explanation |
---|---|
Hive metastore URI | The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port> . |
Unity Catalog
Configuration | Explanation |
---|---|
Databricks host | The hostname of the Databricks URL. The format is $databricks-customer-prefix.cloud.databricks.com |
Databricks token | The access token of the Databricks user. See this page for more details. |
Databricks catalog name | The catalog name under the Unity Catalog instance. See this page for more details. |
Data Storage Configuration
Amazon S3 (Simple Storage Service)
PuppyGraph supports Amazon S3 (Simple Storage Service) for Delta Lake.
Configuration | Explanation |
---|---|
Region | The region of the Amazon S3. Example: us-east-1 . See Amazon Simple Storage Service endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). |
IAM Role ARN | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Secret key | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Amazon S3 Compatible Storage
PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Delta Lake.
Configuration | Explanation |
---|---|
Endpoint | The S3 compatible storage endpoint. |
Access key | The access key of an IAM user for accessing the S3 compatible storage. |
Secret key | The secret key of an IAM user for accessing the S3 compatible storage. |
Enable SSL | Whether to enable SSL connection for accessing the S3 compatible storage. |
Enable path style access | Whether to use path-style access method when accessing the S3 compatible storage. |
Demo
In the demo, the delta lake data source stores people and referral information. To query the data as a graph, we model people as vertices and the referral relationship between people as edges.
Prerequisites
The demo assumes that PuppyGraph has been deployed at localhost
according to the instruction in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker.
In this demo, we use the username puppygraph
and password puppygraph123
.
Data Preparation (Optional)
ID | Age | Name |
---|---|---|
v1 | 29 | marko |
v2 | 27 | vadas |
RefID | Source | Referred | Weight |
---|---|---|---|
e1 | v1 | v2 | 0.5 |
The demo uses people and referral information as shown above.
Here is the shell command to start a SparkSQL instance for data preparation assuming that the delta lake data are stored on HDFS at 172.31.19.123:9000
and the Hive metastore is at 172.31.31.125:9083
.
spark-sql \
--packages io.delta:delta-core_2.12:2.3.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 \
--conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.puppy_delta=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.puppy_delta.type=hive \
--conf spark.sql.catalog.puppy_delta.uri=thrift://172.31.31.125:9083
Now we can use the following SparkSQL query to create data in the database onhdfs
. The catalog name is puppy_delta
as specified in the command above.
CREATE DATABASE puppy_delta.onhdfs;
USE puppy_delta.onhdfs;
CREATE EXTERNAL TABLE person (ID string, age int, name string) using delta;
INSERT INTO person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
CREATE EXTERNAL TABLE referral (refId string, source string, referred string, weight double) using delta;
INSERT INTO referral VALUES ('e1', 'v1', 'v2', 0.5);
Upload the schema
Now the data are ready in Delta Lake. We need a PuppyGraph schema before querying it. Let's create a schema file deltalake.json
:
{
"catalogs": [
{
"name": "catalog_test",
"type": "deltalake",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://172.31.31.125:9083"
}
}
],
"vertices": [
{
"label": "person",
"mappedTableSource": {
"catalog": "catalog_test",
"schema": "onhdfs",
"table": "person",
"metaFields": {
"id": "id"
}
},
"attributes": [
{
"type": "Int",
"name": "age"
},
{
"type": "String",
"name": "name"
}
]
}
],
"edges": [
{
"label": "knows",
"mappedTableSource": {
"catalog": "catalog_test",
"schema": "onhdfs",
"table": "referral",
"metaFields": {
"id": "refId",
"from": "source",
"to": "referred"
}
},
"from": "person",
"to": "person",
"attributes": [
{
"type": "Double",
"name": "weight"
}
]
}
]
}
Here are some notes on this schema:
- A catalog
catalog_test
is added to specify the remote data source in Delta Lake. Note thehiveMetastoreUrl
field has the same value as the one we used to create data. - The label of the vertices and edges do not have to be the same as the names of corresponding tables in Delta Lake. There is a
mappedTableSource
field in each of the vertex and edge types specifying the actual schema (onhdfs
) and table (referral
). - Additionally, the
mappedTableSource
marks meta columns in the tables. For example, the fieldsfrom
andto
describe which columns in the table form the endpoints of edges.
PuppyGraph supports query Iceberg / Hudi / Delta Lake with metastore: Hive metastore/ AWS Glue and with storage: HDFS/ AWS S3/ MinIO.
You can refer to catalog configuration examples we provide: Examples.
For more catalog parameters details, please refer to Data Lake Catalog.
Now we can upload the schema file deltalake.json
to PuppyGraph with the following shell command, assuming that the PuppyGraph is running on localhost
:
curl -XPOST -H "content-type: application/json" --data-binary @./deltalake.json --user "puppygraph:puppygraph123" localhost:8081/schema
Query the data
Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:
[PuppyGraph]> console
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Now we have connected to the Gremlin Console. We can query the graph:
Example Configurations
Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.
Catalog Type | Storage Type | Example Configuration |
---|---|---|
AWS Glue | Amazon S3 | #aws-glue--s3 |
Hive Metastore | HDFS | #hive-metastore--hdfs |
Hive Metastore + HDFS
"catalogs": [
{
"name": "delta_hms_hdfs",
"type": "deltalake",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "<hive_metastore_uri>"
}
}
]
AWS Glue + S3
"catalogs": [
{
"name": "delta_glue_s3",
"type": "deltalake",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "<aws_glue_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>"
},
"storage": {
"useInstanceProfile": "false",
"region": "<aws_s3_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>",
"enableSsl": "false"
}
}
]