Connecting to Delta Lake

Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture. This document describes how to connect to Delta Lake tables in PuppyGraph.

See also Tutorials for end-to-end tutorials for connecting to different Delta Lake implementations, including Databricks.

Prerequisites

Both the Delta Lake Metastore and Data Storage are accessible over the network from the PuppyGraph instance.

Configuration

The configuration consists of two parts: Metastore and Data Storage. Please configure them according to you Delta Lake setup.

Metastore Configuration

Unity Catalog

Configuration	Explanation
Databricks host	The hostname of the Databricks URL. The format is `$databricks-customer-prefix.cloud.databricks.com`
Databricks token	The access token of the Databricks user. See this page for more details.
Databricks catalog name	The catalog name under the Unity Catalog instance. See this page for more details.

AWS Glue

Configuration	Explanation
Region	The region of the AWS Glue Data Catalog. Example: `us-east-1`. See AWS Glue endpoints and quotas for more details.
Use instance profile	Whether to use role-based authentication (Explicit IAM roles or instance-profile attached)
IAM Role ARN	The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.
Access key	The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
Secret key	The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

Hive Metastore

Configuration	Explanation
Hive metastore URI	The URI of your Hive metastore. Format: `thrift://<metastore_IP_address>:<metastore_port>`.

Data Storage Configuration

Get from metastore

There is no need to specify Storage configuration when Delta Lake uses one of the following metastore and storage combinations:

Hive Metastore with HDFS as the storage.
Databricks Unity catalog (credential vending enabled) with AWS S3 or Azure Data Lake Gen2 as the storage.

Select Get from metastore in the Web UI for these implementation combinations.

Amazon S3 (Simple Storage Service)

PuppyGraph supports Amazon S3 (Simple Storage Service) for Delta Lake.

Configuration	Explanation
Region	The region of the Amazon S3. Example: `us-east-1`. See Amazon Simple Storage Service endpoints and quotas for more details.
Use instance profile	Whether to use role-based authentication (Explicit IAM roles or instance-profile attached).
IAM Role ARN	The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.
Access key	The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
Secret key	The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

Amazon S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Delta Lake.

Configuration	Explanation
Endpoint	The S3 compatible storage endpoint.
Access key	The access key of an IAM user for accessing the S3 compatible storage.
Secret key	The secret key of an IAM user for accessing the S3 compatible storage.
Enable SSL	Whether to enable SSL connection for accessing the S3 compatible storage.
Enable path style access	Whether to use path-style access method when accessing the S3 compatible storage.

Google Cloud Storage

PuppyGraph supports Google Cloud Storage for Delta Lake.

Configuration	Explanation
useComputeEngineServiceAccount	Use Compute Engine service account
Service Account Email	Service account email address associated with your Google Cloud project
Service Account Private Key Id	Unique identifier for the service account's private key
Service Account Private Key	Service account's private key for authentication
Impersonation Service Account	Impersonation service account

Azure Data Lake Gen2

PuppyGraph supports Azure Data Lake Gen2 for Delta Lake.

Configuration	Explanation
Storage Account	The name of your Azure v2 Storage account
Shared Key	The access key used to authenticate with the storage account
Tenant Id	The tenant ID for your organization in Entra ID
Client Id	The OAuth 2.0 Client ID
Client Secret	The OAuth 2.0 Client secret
Client Endpoint	The OAuth 2.0 token endpoint

Azure Blog Storage

PuppyGraph supports Azure Blog Storage for Delta Lake.

Configuration	Explanation
Storage Account	The name of the Azure Storage account
Shared Key	The Shared Key of the Azure Storage account
Storage Container	The name of the Storage Container
SAS Token	The account or container SAS (Shared Access Signatures) Token.

Tutorials

Data Type Mapping

Delta Lake Type	PuppyGraph Type	Description
`BOOLEAN`	`Boolean`	True or false
`TINYINT`	`Byte`	8-bit signed integers
`SMALLINT`	`Short`	16-bit signed integers
`INT`	`Int`	32-bit signed integers
`BIGINT`	`Long`	64-bit signed integers
`FLOAT`	`Float`	32-bit IEEE 754 floating point
`DOUBLE`	`Double`	64-bit IEEE 754 floating point
`DECIMAL(P, S)`	`Decimal(P, S)`	Fixed-point decimal; precision P, scale S
`STRING`	`String`	Arbitrary-length character sequences
`BINARY`	`Binary`	Byte sequence values
`DATE`	`Date`	Calendar date without timezone or time
`TIMESTAMP_NTZ`	`DateTime`	Timestamp, microsecond precision, without timezone
`TIMESTAMP`	`DateTime`	Timestamp, microsecond precision, with timezone¹
`ARRAY`	`Array<E>`	A collection of values with some element type
`STRUCT`	`Struct<field1 E1[,field2 E2...]>`	A tuple of typed values

Datetime with timezone will be converted to UTC in PuppyGraph. e.g. 2024-12-01T12:00:00-08:00 is equivalent to 2024-12-01T20:00:00Z and will be stored as 2024-12-01T20:00:00 without timezone.

Example Configurations

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog Type	Storage Type	Example Configuration
AWS Glue	Amazon S3	#aws-glue-s3
Hive Metastore	HDFS	#hive-metastore-hdfs
Unity Catalog (without credential vending)	AWS S3	#unity-catalog-s3-without-credential
Unity Catalog (credential vending)	AWS S3/Azure	#unity-catalog-credential-vending

Hive Metastore + HDFS

"catalogs": [
  {
    "name": "delta_hms_hdfs",
    "type": "deltalake",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    }
  }
]

AWS Glue + S3

"catalogs": [
  {
    "name": "delta_glue_s3",
    "type": "deltalake",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "<aws_glue_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Unity Catalog + AWS S3 (without credential vending)

"catalogs": [
  {
    "name": "unity_s3",
    "type": "deltalake",
    "metastore": {
      "type": "unity",
      "host": "<unity_server_host>",
      "token": "<unity_access_token>",
      "databricksCatalogName": "<catalog_name>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Unity Catalog + AWS S3/Azure (credential vending)

"catalogs": [
  {
    "name": "unity_credential_vending",
    "type": "deltalake",
    "metastore": {
      "type": "unity",
      "host": "<unity_server_host>",
      "token": "<unity_access_token>",
      "databricksCatalogName": "<catalog_name>"
    }
  }
]