Skip to content

Data Lake Catalog

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "HMS/glue/rest/unity",
    "hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
    "useInstanceProfile": "glue.useInstanceProfile",
    "region": "glue.region",
    "accessKey": "glue.accessKey",
    "secretKey": "glue.secretKey",
    "iamRoleArn": "glue.iamRoleArn",
    "uri": "rest.uri",
    "warehouse": "rest.warehouse",
    "security": "rest.security",
    "session": "rest.session",
    "credential": "rest.session",
    "host": "databricks.token",
    "token": "databricks.host",
    "databricksCatalogName": "databricks.catalog"
  },
  "storage": {
    "type": "S3/GCS/AzureBlob/AzureDLS2",
    "useInstanceProfile": "S3.useInstanceProfile",
    "region": "S3.region",
    "accessKey": "S3.accessKey",
    "secretKey": "S3.secretKey",
    "iamRoleArn": "S3.iamRoleArn",
    "enableSsl": "S3.enableSsl",
    "endpoint": "S3.endpoint",
    "enablePathStyleAccess": "S3.enablePathStyleAccess",
    "useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
    "serviceAccountEmail": "GCS.serviceAccountEmail",
    "serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
    "serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
    "impersonationServiceAccount": "GCS.impersonationServiceAccount",
    "storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
    "sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
    "storageContainer": "AzureBlob.storageContainer",
    "sasToken": "AzureBlob.sasToken",
    "clientId": "AzureDLS2.clientId",
    "clientSecret": "AzureDLS2.clientSecret",
    "clientEndpoint": "AzureDLS2.clientEndpoint",
    "useManagedIdentity": "AzureDLS2.useManagedIdentity",
    "tenantId": "AzureDLS2.tenantId"
  }
}
Parameter Required Description
name Yes The name of the catalog
type Yes The type of the catalog
metastore Yes Metastore parameters
storage Yes Data storage parameters

Metastore Parameters

Hive Metastore

PuppyGraph supports Hive Metastore (HMS) as a catalog metastore:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}

The table below outlines the Hive Metastore parameters in the metastore section.

Parameter Required Description
type Yes The type of the metastore. Set the value to HMS.
hiveMetastoreUrl Yes The URI of the Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

AWS Glue

PuppyGraph supports AWS Glue as a catalog metastore with the following authentication methods:

Authentication with Instance profile

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}

Authentication with IAM Roles

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS Glue parameters in the metastore section.

Parameter Required Description
type Yes The type of the metastore. Set the value to glue.
useInstanceProfile Yes Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true or false.
region Yes The region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.
accessKey No The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
secretKey No The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
iamRoleArn No The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.

Iceberg REST Catalog

PuppyGraph supports Iceberg REST Catalog (including Tabular) as a catalog metastore. See the REST Catalog API to learn more about the details.

Iceberg REST

The minimal configuration of an Iceberg REST metastore is as follows:

"metastore": {
  "type": "rest",
  "uri": "http://rest:8182"
}

Tabular

Tabular (tabular.io) is a managed Iceberg platform. An example of the Tabular metastore configuration is as follows:

"metastore": {
  "type": "rest",
  "uri": "https://api.tabular.io/ws",
  "warehouse": "sample_warehouse",
  "security": "oauth2",
  "session": "user",
  "credential": "t-xxxxxxxxxx:Gy_yyyyyyyyy"
}

The table below outlines the Iceberg REST parameters in the metastore section.

Parameter Required Description
type Yes The type of metastore. Set the value to rest.
uri Yes The server endpoint URI of the REST Catalog
warehouse No The name of the Tabular warehouse. Required by Tabular metastore.
credential No The Tabular authentication credential. Required by Tabular metastore.
security No Security Schema of the REST catalog. Set it to oauth2 when using Tabular metastore.
session No Set it to user when using Tabular metastore.

Unity catalog

PuppyGraph supports both Databricks Unity catalog and OSS Unity catalog as a catalog metastore:

"metastore": {
  "type": "unity",
  "host": "databricks/unity server host name",
  "token": "Access tokens",
  "databricksCatalogName": "catalog name"
}

The table below outlines the Unity catalog parameters in the metastore section.

Parameter Required Description
type Yes The type of the metastore. Set the value to unity.
host Yes The server host name of the unity catalog.
token Yes Access token for user to request the unity catalog server.
databricksCatalogName Yes The catalog name under the Unity Catalog instance.

Data Storage Parameters

HDFS

PuppyGraph supports HDFS as data storage with Hive Metastore. There is no storage section needed with this combination.

Amazon S3

PuppyGraph supports Amazon S3 (Simple Storage Service) as data storage with the following authentication methods:

Authentication with Instance profile

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}

Authentication with IAM Roles

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS S3 parameters in the storage section.

Parameter Required Description
useInstanceProfile Yes Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true or false.
region Yes The region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.
accessKey No The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
secretKey No The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.
iamRoleArn No The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.

S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) as data storage.

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The table below outlines the S3 Compatible parameters in the storage section.

Parameter Required Description
useInstanceProfile Yes Set the value to false.
accessKey Yes The access key of an IAM user for accessing the S3 compatible storage.
secretKey Yes The secret key of an IAM user for accessing the S3 compatible storage.
enableSsl Yes Whether to enable SSL connection for accessing the S3 compatible storage.
Set the value to true or false.
endpoint Yes The S3 compatible storage endpoint.
enablePathStyleAccess Yes Whether to use path-style access method when accessing the S3 compatible storage.
Set the value to true or false.

Google Cloud Storage

PuppyGraph supports Google Cloud Storage (GCS) as data storage with the following authentication methods:

Authentication with Instance-associated Service Account

"storage": {
  "type": "GCS",
  "useComputeEngineServiceAccount": "true"
}

Authentication with Service Account Key

"storage": {
  "type": "GCS",
  "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
  "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
  "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}

The table below outlines the GCS parameters in the storage section.

Parameter Required Description
type Yes The type of the data storage. Set the value to GCS.
useComputeEngineServiceAccount No Whether to use the service account associated to the compute engine instance for accessing GCS. Set the value to true or false.
serviceAccountEmail No The email address of the service account for accessing GCS. Required by authentication with Service Account Key.
serviceAccountPrivateKeyId No The private key id of the service account for accessing GCS. Required by authentication with Service Account Key.
serviceAccountPrivateKey No The private key of the service account for accessing GCS. Required by authentication with Service Account Key.

Azure Blob Storage

PuppyGraph supports Azure Blob Storage as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}

Authentication with SAS (Shared Access Signatures) Token

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<azure_storage_account_name>",
  "storageContainer": "<azure_storage_container_name>",
  "sasToken": "<azure_storage_container_sas_token>"
}

The following table describes the parameters you need to configure in storage.

Parameter Required Description
type Yes The type of the data storage. Set the value to AzureBlob
storageAccount Yes The name of the Azure Storage Account
sharedKey No The Shared Key of the Azure Storage account
storageContainer No The name of the Storage Container.
sasToken No The account or container SAS Token. Required by Authentication with SAS (Shared Access Signatures) Token.

Azure Data Lake Storage Gen2

PuppyGraph supports Azure Data Lake Storage Gen2 as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureDLS2",
  "storageAccount": "<azure_storage_account_name>",
  "sharedKey": "<azure_shared_key>"
}

Authentication with Client Secret of Service Principal

"storage": {
  "type": "AzureDLS2",
  "clientId": "<azure_service_principal_client_id>",
  "clientSecret": "<azure_service_principal_secret>",
  "clientEndpoint": "<azure_service_principal_endpoint>"
}

Authentication with Managed Identities

"storage": {
  "type": "AzureDLS2",
  "useManagedIdentity": "true",
  "clientId": "<azure_service_principal_client_id>",
  "tenantId": "<azure_service_principal_tenant_id>"
}

The following table describes the parameters you need to configure in storage.

Parameter Required Description
type Yes The type of the data storage. Set the value to AzureDLS2
storageAccount No The name of the Azure Storage Account
sharedKey No The Shared Key of the Azure Storage account.
clientId No The Client ID of the service principal.
tenantId No The Tenant ID of the managed identity.
clientSecret No The Client Secret of the service principal
clientEndpoint No The Client Endpoint of service principal.
useManagedIdentity No Whether to authenticate with Managed Identities. Set the value to true or false.