Connecting to Delta Lake
Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture. This document describes how to connect to Delta Lake tables in PuppyGraph.
See also Tutorials for end-to-end tutorials for connecting to different Delta Lake implementations, including Databricks.
Prerequisites
- Both the Delta Lake Metastore and Data Storage are accessible over the network from the PuppyGraph instance.
Configuration
The configuration consists of two parts: Metastore and Data Storage. Please configure them according to you Delta Lake setup.
Metastore Configuration
Unity Catalog
Configuration | Explanation |
---|---|
Databricks host | The hostname of the Databricks URL. The format is $databricks-customer-prefix.cloud.databricks.com |
Databricks token | The access token of the Databricks user. See this page for more details. |
Databricks catalog name | The catalog name under the Unity Catalog instance. See this page for more details. |
AWS Glue
Configuration | Explanation |
---|---|
Region | The region of the AWS Glue Data Catalog. Example: us-east-1 . See AWS Glue endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached) |
IAM Role ARN | The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Secret key | The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Hive Metastore
Configuration | Explanation |
---|---|
Hive metastore URI | The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port> . |
Data Storage Configuration
Get from metastore
There is no need to specify Storage configuration when Delta Lake uses one of the following metastore and storage combinations:
- Hive Metastore with HDFS as the storage.
- Databricks Unity catalog (credential vending enabled) with AWS S3 or Azure Data Lake Gen2 as the storage.
Select Get from metastore
in the Web UI for these implementation combinations.
Amazon S3 (Simple Storage Service)
PuppyGraph supports Amazon S3 (Simple Storage Service) for Delta Lake.
Configuration | Explanation |
---|---|
Region | The region of the Amazon S3. Example: us-east-1 . See Amazon Simple Storage Service endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). |
IAM Role ARN | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Secret key | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Amazon S3 Compatible Storage
PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Delta Lake.
Configuration | Explanation |
---|---|
Endpoint | The S3 compatible storage endpoint. |
Access key | The access key of an IAM user for accessing the S3 compatible storage. |
Secret key | The secret key of an IAM user for accessing the S3 compatible storage. |
Enable SSL | Whether to enable SSL connection for accessing the S3 compatible storage. |
Enable path style access | Whether to use path-style access method when accessing the S3 compatible storage. |
Google Cloud Storage
PuppyGraph supports Google Cloud Storage for Delta Lake.
Configuration | Explanation |
---|---|
useComputeEngineServiceAccount | Use Compute Engine service account |
Service Account Email | Service account email address associated with your Google Cloud project |
Service Account Private Key Id | Unique identifier for the service account's private key |
Service Account Private Key | Service account's private key for authentication |
Impersonation Service Account | Impersonation service account |
Azure Data Lake Gen2
PuppyGraph supports Azure Data Lake Gen2 for Delta Lake.
Configuration | Explanation |
---|---|
Storage Account | The name of your Azure v2 Storage account |
Shared Key | The access key used to authenticate with the storage account |
Tenant Id | The tenant ID for your organization in Entra ID |
Client Id | The OAuth 2.0 Client ID |
Client Secret | The OAuth 2.0 Client secret |
Client Endpoint | The OAuth 2.0 token endpoint |
Azure Blog Storage
PuppyGraph supports Azure Blog Storage for Delta Lake.
Configuration | Explanation |
---|---|
Storage Account | The name of the Azure Storage account |
Shared Key | The Shared Key of the Azure Storage account |
Storage Container | The name of the Storage Container |
SAS Token | The account or container SAS (Shared Access Signatures) Token. |
Tutorials
- Querying Databricks Data as a Graph
- Unity Catalog PuppyGraph Integration
- Querying Unity Catalog Data as a Graph
Data Type Mapping
Delta Lake Type | PuppyGraph Type | Description |
---|---|---|
BOOLEAN |
Boolean |
True or false |
TINYINT |
Byte |
8-bit signed integers |
SMALLINT |
Short |
16-bit signed integers |
INT |
Int |
32-bit signed integers |
BIGINT |
Long |
64-bit signed integers |
FLOAT |
Float |
32-bit IEEE 754 floating point |
DOUBLE |
Double |
64-bit IEEE 754 floating point |
DECIMAL(P, S) |
Decimal(P, S) |
Fixed-point decimal; precision P, scale S |
STRING |
String |
Arbitrary-length character sequences |
BINARY |
Binary |
Byte sequence values |
DATE |
Date |
Calendar date without timezone or time |
TIMESTAMP_NTZ |
DateTime |
Timestamp, microsecond precision, without timezone |
TIMESTAMP |
DateTime |
Timestamp, microsecond precision, with timezone1 |
ARRAY |
Array<E> |
A collection of values with some element type |
STRUCT |
Struct<field1 E1[,field2 E2...]> |
A tuple of typed values |
- Datetime with timezone will be converted to UTC in PuppyGraph. e.g.
2024-12-01T12:00:00-08:00
is equivalent to2024-12-01T20:00:00Z
and will be stored as2024-12-01T20:00:00
without timezone.
Example Configurations
Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.
Catalog Type | Storage Type | Example Configuration |
---|---|---|
AWS Glue | Amazon S3 | #aws-glue-s3 |
Hive Metastore | HDFS | #hive-metastore-hdfs |
Unity Catalog (without credential vending) | AWS S3 | #unity-catalog-s3-without-credential |
Unity Catalog (credential vending) | AWS S3/Azure | #unity-catalog-credential-vending |
Hive Metastore + HDFS
"catalogs": [
{
"name": "delta_hms_hdfs",
"type": "deltalake",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "<hive_metastore_uri>"
}
}
]
AWS Glue + S3
"catalogs": [
{
"name": "delta_glue_s3",
"type": "deltalake",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "<aws_glue_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>"
},
"storage": {
"useInstanceProfile": "false",
"region": "<aws_s3_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>",
"enableSsl": "false"
}
}
]
Unity Catalog + AWS S3 (without credential vending)
"catalogs": [
{
"name": "unity_s3",
"type": "deltalake",
"metastore": {
"type": "unity",
"host": "<unity_server_host>",
"token": "<unity_access_token>",
"databricksCatalogName": "<catalog_name>"
},
"storage": {
"useInstanceProfile": "false",
"region": "<aws_s3_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>",
"enableSsl": "false"
}
}
]