Skip to content

Data Lake Catalog

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "HMS/glue/rest/unity",
    "hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
    "useInstanceProfile": "glue.useInstanceProfile",
    "region": "glue.region",
    "accessKey": "glue.accessKey",
    "secretKey": "glue.secretKey",
    "iamRoleArn": "glue.iamRoleArn",
    "uri": "rest.uri",
    "warehouse": "rest.warehouse",
    "security": "rest.security",
    "session": "rest.session",
    "credential": "rest.session",
    "host": "databricks.token",
    "token": "databricks.host",
    "databricksCatalogName": "databricks.catalog"
  },
  "storage": {
    "type": "S3/GCS/AzureBlob/AzureDLS2",
    "useInstanceProfile": "S3.useInstanceProfile",
    "region": "S3.region",
    "accessKey": "S3.accessKey",
    "secretKey": "S3.secretKey",
    "iamRoleArn": "S3.iamRoleArn",
    "enableSsl": "S3.enableSsl",
    "endpoint": "S3.endpoint",
    "enablePathStyleAccess": "S3.enablePathStyleAccess",
    "useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
    "serviceAccountEmail": "GCS.serviceAccountEmail",
    "serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
    "serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
    "impersonationServiceAccount": "GCS.impersonationServiceAccount",
    "storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
    "sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
    "storageContainer": "AzureBlob.storageContainer",
    "sasToken": "AzureBlob.sasToken",
    "clientId": "AzureDLS2.clientId",
    "clientSecret": "AzureDLS2.clientSecret",
    "clientEndpoint": "AzureDLS2.clientEndpoint",
    "useManagedIdentity": "AzureDLS2.useManagedIdentity",
    "tenantId": "AzureDLS2.tenantId"
  }
}
ParameterRequiredDescription
nameYesThe name of the catalog
typeYesThe type of the catalog
metastoreYesMetastore parameters
storageYesData storage parameters

Metastore Parameters

Hive Metastore

PuppyGraph supports Hive Metastore (HMS) as a catalog metastore:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}

The table below outlines the Hive Metastore parameters in the metastore section.

ParameterRequiredDescription
typeYesThe type of the metastore. Set the value to HMS.
hiveMetastoreUrlYesThe URI of the Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

AWS Glue

PuppyGraph supports AWS Glue as a catalog metastore with the following authentication methods:

Authentication with Instance profile

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}

Authentication with IAM Roles

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS Glue parameters in the metastore section.

ParameterRequiredDescription
typeYesThe type of the metastore. Set the value to glue.
useInstanceProfileYesWhether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse.
regionYesThe region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.
accessKeyNoThe access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
secretKeyNoThe secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
iamRoleArnNoThe ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.

Iceberg REST Catalog

PuppyGraph supports Iceberg REST Catalog (including Tabular) as a catalog metastore. See the REST Catalog API to learn more about the details.

Iceberg REST

The minimal configuration of an Iceberg REST metastore is as follows:

"metastore": {
  "type": "rest",
  "uri": "http://rest:8182"
}

Tabular

Tabular (tabular.io) is a managed Iceberg platform. An example of the Tabular metastore configuration is as follows:

"metastore": {
  "type": "rest",
  "uri": "https://api.tabular.io/ws",
  "warehouse": "sample_warehouse",
  "security": "oauth2",
  "session": "user",
  "credential": "t-xxxxxxxxxx:Gy_yyyyyyyyy"
}

The table below outlines the Iceberg REST parameters in the metastore section.

ParameterRequiredDescription
typeYesThe type of metastore. Set the value to rest.
uriYesThe server endpoint URI of the REST Catalog
warehouseNoThe name of the Tabular warehouse. Required by Tabular metastore.
credentialNoThe Tabular authentication credential. Required by Tabular metastore.
securityNoSecurity Schema of the REST catalog. Set it to oauth2 when using Tabular metastore.
sessionNoSet it to user when using Tabular metastore.

Unity catalog

PuppyGraph supports both Databricks Unity catalog and OSS Unity catalog as a catalog metastore:

"metastore": {
  "type": "unity",
  "host": "databricks/unity server host name",
  "token": "Access tokens",
  "databricksCatalogName": "catalog name"
}

The table below outlines the Unity catalog parameters in the metastore section.

ParameterRequiredDescription
typeYesThe type of the metastore. Set the value to unity.
hostYesThe server host name of the unity catalog.
tokenYesAccess token for user to request the unity catalog server.
databricksCatalogNameYesThe catalog name under the Unity Catalog instance.

Data Storage Parameters

HDFS

PuppyGraph supports HDFS as data storage with Hive Metastore. There is no storage section needed with this combination.

Amazon S3

PuppyGraph supports Amazon S3 (Simple Storage Service) as data storage with the following authentication methods:

Authentication with Instance profile

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}

Authentication with IAM Roles

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS S3 parameters in the storage section.

ParameterRequiredDescription
useInstanceProfileYesWhether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse.
regionYesThe region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.
accessKeyNoThe access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
secretKeyNoThe ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.
iamRoleArnNoThe ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.

S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) as data storage.

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The table below outlines the S3 Compatible parameters in the storage section.

ParameterRequiredDescription
useInstanceProfileYesSet the value to false.
accessKeyYesThe access key of an IAM user for accessing the S3 compatible storage.
secretKeyYesThe secret key of an IAM user for accessing the S3 compatible storage.
enableSslYesWhether to enable SSL connection for accessing the S3 compatible storage.
Set the value to true or false.
endpointYesThe S3 compatible storage endpoint.
enablePathStyleAccessYesWhether to use path-style access method when accessing the S3 compatible storage.
Set the value to true or false.

Google Cloud Storage

PuppyGraph supports Google Cloud Storage (GCS) as data storage with the following authentication methods:

Authentication with Instance-associated Service Account

"storage": {
  "type": "GCS",
  "useComputeEngineServiceAccount": "true"
}

Authentication with Service Account Key

"storage": {
  "type": "GCS",
  "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
  "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
  "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}

The table below outlines the GCS parameters in the storage section.

ParameterRequiredDescription
typeYesThe type of the data storage. Set the value to GCS.
useComputeEngineServiceAccountNoWhether to use the service account associated to the compute engine instance for accessing GCS. Set the value to true or false.
serviceAccountEmailNoThe email address of the service account for accessing GCS. Required by authentication with Service Account Key.
serviceAccountPrivateKeyIdNoThe private key id of the service account for accessing GCS. Required by authentication with Service Account Key.
serviceAccountPrivateKeyNoThe private key of the service account for accessing GCS. Required by authentication with Service Account Key.

Azure Blob Storage

PuppyGraph supports Azure Blob Storage as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}

Authentication with SAS (Shared Access Signatures) Token

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<azure_storage_account_name>",
  "storageContainer": "<azure_storage_container_name>",
  "sasToken": "<azure_storage_container_sas_token>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription
typeYesThe type of the data storage. Set the value toAzureBlob
storageAccountYesThe name of the Azure Storage Account
sharedKeyNoThe Shared Key of the Azure Storage account
storageContainerNoThe name of the Storage Container.
sasTokenNoThe account or container SAS Token. Required by Authentication with SAS (Shared Access Signatures) Token.

Azure Data Lake Storage Gen2

PuppyGraph supports Azure Data Lake Storage Gen2 as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureDLS2",
  "storageAccount": "<azure_storage_account_name>",
  "sharedKey": "<azure_shared_key>"
}

Authentication with Client Secret of Service Principal

"storage": {
  "type": "AzureDLS2",
  "clientId": "<azure_service_principal_client_id>",
  "clientSecret": "<azure_service_principal_secret>",
  "clientEndpoint": "<azure_service_principal_endpoint>"
}

Authentication with Managed Identities

"storage": {
  "type": "AzureDLS2",
  "useManagedIdentity": "true",
  "clientId": "<azure_service_principal_client_id>",
  "tenantId": "<azure_service_principal_tenant_id>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription
typeYesThe type of the data storage. Set the value to AzureDLS2
storageAccountNoThe name of the Azure Storage Account
sharedKeyNoThe Shared Key of the Azure Storage account.
clientIdNoThe Client ID of the service principal.
tenantIdNoThe Tenant ID of the managed identity.
clientSecretNoThe Client Secret of the service principal
clientEndpointNoThe Client Endpoint of service principal.
useManagedIdentityNoWhether to authenticate with Managed Identities. Set the value to true or false.