Skip to content

Connecting to Delta Lake

Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture. This document describes how to connect to Delta Lake tables in PuppyGraph.

See also Tutorials for end-to-end tutorials for connecting to different Delta Lake implementations, including Databricks.

Prerequisites

  • Both the Delta Lake Metastore and Data Storage are accessible over the network from the PuppyGraph instance.

Configuration

The configuration consists of two parts: Metastore and Data Storage. Please configure them according to you Delta Lake setup.

Metastore Configuration

Unity Catalog

Configuration Explanation
Databricks host The hostname of the Databricks URL. The format is $databricks-customer-prefix.cloud.databricks.com
Databricks token The access token of the Databricks user. See this page for more details.
Databricks catalog name The catalog name under the Unity Catalog instance. See this page for more details.

AWS Glue

Configuration Explanation
Region The region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached)
IAM Role ARN The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
Secret key The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

Hive Metastore

Configuration Explanation
Hive metastore URI The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

Data Storage Configuration

Get from metastore

There is no need to specify Storage configuration when Delta Lake uses one of the following metastore and storage combinations:

Select Get from metastore in the Web UI for these implementation combinations.

Amazon S3 (Simple Storage Service)

PuppyGraph supports Amazon S3 (Simple Storage Service) for Delta Lake.

Configuration Explanation
Region The region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached).
IAM Role ARN The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
Secret key The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

Amazon S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Delta Lake.

Configuration Explanation
Endpoint The S3 compatible storage endpoint.
Access key The access key of an IAM user for accessing the S3 compatible storage.
Secret key The secret key of an IAM user for accessing the S3 compatible storage.
Enable SSL Whether to enable SSL connection for accessing the S3 compatible storage.
Enable path style access Whether to use path-style access method when accessing the S3 compatible storage.

Google Cloud Storage

PuppyGraph supports Google Cloud Storage for Delta Lake.

Configuration Explanation
useComputeEngineServiceAccount Use Compute Engine service account
Service Account Email Service account email address associated with your Google Cloud project
Service Account Private Key Id Unique identifier for the service account's private key
Service Account Private Key Service account's private key for authentication
Impersonation Service Account Impersonation service account

Azure Data Lake Gen2

PuppyGraph supports Azure Data Lake Gen2 for Delta Lake.

Configuration Explanation
Storage Account The name of your Azure v2 Storage account
Shared Key The access key used to authenticate with the storage account
Tenant Id The tenant ID for your organization in Entra ID
Client Id The OAuth 2.0 Client ID
Client Secret The OAuth 2.0 Client secret
Client Endpoint The OAuth 2.0 token endpoint

Azure Blog Storage

PuppyGraph supports Azure Blog Storage for Delta Lake.

Configuration Explanation
Storage Account The name of the Azure Storage account
Shared Key The Shared Key of the Azure Storage account
Storage Container The name of the Storage Container
SAS Token The account or container SAS (Shared Access Signatures) Token.

Tutorials

Data Type Mapping

Delta Lake Type PuppyGraph Type Description
BOOLEAN Boolean True or false
TINYINT Byte 8-bit signed integers
SMALLINT Short 16-bit signed integers
INT Int 32-bit signed integers
BIGINT Long 64-bit signed integers
FLOAT Float 32-bit IEEE 754 floating point
DOUBLE Double 64-bit IEEE 754 floating point
DECIMAL(P, S) Decimal(P, S) Fixed-point decimal; precision P, scale S
STRING String Arbitrary-length character sequences
BINARY Binary Byte sequence values
DATE Date Calendar date without timezone or time
TIMESTAMP_NTZ DateTime Timestamp, microsecond precision, without timezone
TIMESTAMP DateTime Timestamp, microsecond precision, with timezone1
ARRAY Array<E> A collection of values with some element type
STRUCT Struct<field1 E1[,field2 E2...]> A tuple of typed values
  1. Datetime with timezone will be converted to UTC in PuppyGraph. e.g. 2024-12-01T12:00:00-08:00 is equivalent to 2024-12-01T20:00:00Z and will be stored as 2024-12-01T20:00:00 without timezone.

Example Configurations

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog Type Storage Type Example Configuration
AWS Glue Amazon S3 #aws-glue-s3
Hive Metastore HDFS #hive-metastore-hdfs
Unity Catalog (without credential vending) AWS S3 #unity-catalog-s3-without-credential
Unity Catalog (credential vending) AWS S3/Azure #unity-catalog-credential-vending

Hive Metastore + HDFS

"catalogs": [
  {
    "name": "delta_hms_hdfs",
    "type": "deltalake",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    }
  }
]

AWS Glue + S3

"catalogs": [
  {
    "name": "delta_glue_s3",
    "type": "deltalake",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "<aws_glue_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Unity Catalog + AWS S3 (without credential vending)

"catalogs": [
  {
    "name": "unity_s3",
    "type": "deltalake",
    "metastore": {
      "type": "unity",
      "host": "<unity_server_host>",
      "token": "<unity_access_token>",
      "databricksCatalogName": "<catalog_name>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Unity Catalog + AWS S3/Azure (credential vending)

"catalogs": [
  {
    "name": "unity_credential_vending",
    "type": "deltalake",
    "metastore": {
      "type": "unity",
      "host": "<unity_server_host>",
      "token": "<unity_access_token>",
      "databricksCatalogName": "<catalog_name>"
    }
  }
]