Data Lake Catalog
Catalog Parameters Overview
{
"name": "name",
"type": "type",
"metastore": {
"type": "HMS/glue/rest/unity",
"hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
"useInstanceProfile": "glue.useInstanceProfile",
"region": "glue.region",
"accessKey": "glue.accessKey",
"secretKey": "glue.secretKey",
"iamRoleArn": "glue.iamRoleArn",
"uri": "rest.uri",
"warehouse": "rest.warehouse",
"security": "rest.security",
"session": "rest.session",
"credential": "rest.session",
"host": "databricks.token",
"token": "databricks.host",
"databricksCatalogName": "databricks.catalog"
},
"storage": {
"type": "S3/GCS/AzureBlob/AzureDLS2",
"useInstanceProfile": "S3.useInstanceProfile",
"region": "S3.region",
"accessKey": "S3.accessKey",
"secretKey": "S3.secretKey",
"iamRoleArn": "S3.iamRoleArn",
"enableSsl": "S3.enableSsl",
"endpoint": "S3.endpoint",
"enablePathStyleAccess": "S3.enablePathStyleAccess",
"useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
"serviceAccountEmail": "GCS.serviceAccountEmail",
"serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
"serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
"impersonationServiceAccount": "GCS.impersonationServiceAccount",
"storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
"sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
"storageContainer": "AzureBlob.storageContainer",
"sasToken": "AzureBlob.sasToken",
"clientId": "AzureDLS2.clientId",
"clientSecret": "AzureDLS2.clientSecret",
"clientEndpoint": "AzureDLS2.clientEndpoint",
"useManagedIdentity": "AzureDLS2.useManagedIdentity",
"tenantId": "AzureDLS2.tenantId"
}
}
Parameter | Required | Description |
---|---|---|
name | Yes | The name of the catalog |
type | Yes | The type of the catalog |
metastore | Yes | Metastore parameters |
storage | Yes | Data storage parameters |
Metastore Parameters
Hive Metastore
PuppyGraph supports Hive Metastore (HMS) as a catalog metastore:
The table below outlines the Hive Metastore parameters in the metastore
section.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the metastore. Set the value to HMS . |
hiveMetastoreUrl | Yes | The URI of the Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port> . |
AWS Glue
PuppyGraph supports AWS Glue as a catalog metastore with the following authentication methods:
Authentication with Instance profile
Authentication with IAM Roles
"metastore": {
"type": "glue",
"useInstanceProfile": "true",
"region": "<aws_glue_region>",
"iamRoleArn": "<iam_role_arn>"
}
Authentication with IAM User Access keys
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "<aws_glue_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>"
}
The table below outlines the AWS Glue parameters in the metastore
section.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the metastore. Set the value to glue . |
useInstanceProfile | Yes | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse . |
region | Yes | The region of the AWS Glue Data Catalog. Example: us-east-1 . See AWS Glue endpoints and quotas for more details. |
accessKey | No | The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
secretKey | No | The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
iamRoleArn | No | The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles. |
Iceberg REST Catalog
PuppyGraph supports Iceberg REST Catalog (including Tabular) as a catalog metastore. See the REST Catalog API to learn more about the details.
Iceberg REST
The minimal configuration of an Iceberg REST metastore is as follows:
Tabular
Tabular (tabular.io) is a managed Iceberg platform. An example of the Tabular metastore configuration is as follows:
"metastore": {
"type": "rest",
"uri": "https://api.tabular.io/ws",
"warehouse": "sample_warehouse",
"security": "oauth2",
"session": "user",
"credential": "t-xxxxxxxxxx:Gy_yyyyyyyyy"
}
The table below outlines the Iceberg REST parameters in the metastore
section.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of metastore. Set the value to rest . |
uri | Yes | The server endpoint URI of the REST Catalog |
warehouse | No | The name of the Tabular warehouse. Required by Tabular metastore. |
credential | No | The Tabular authentication credential. Required by Tabular metastore. |
security | No | Security Schema of the REST catalog. Set it to oauth2 when using Tabular metastore. |
session | No | Set it to user when using Tabular metastore. |
Unity catalog
PuppyGraph supports both Databricks Unity catalog and OSS Unity catalog as a catalog metastore:
"metastore": {
"type": "unity",
"host": "databricks/unity server host name",
"token": "Access tokens",
"databricksCatalogName": "catalog name"
}
The table below outlines the Unity catalog parameters in the metastore
section.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the metastore. Set the value to unity . |
host | Yes | The server host name of the unity catalog. |
token | Yes | Access token for user to request the unity catalog server. |
databricksCatalogName | Yes | The catalog name under the Unity Catalog instance. |
Data Storage Parameters
HDFS
PuppyGraph supports HDFS as data storage with Hive Metastore. There is no storage
section needed with this combination.
Amazon S3
PuppyGraph supports Amazon S3 (Simple Storage Service) as data storage with the following authentication methods:
Authentication with Instance profile
Authentication with IAM Roles
"storage": {
"useInstanceProfile": "true",
"region": "<aws_s3_region>",
"iamRoleArn": "<iam_role_arn>"
}
Authentication with IAM User Access keys
"storage": {
"useInstanceProfile": "false",
"region": "<aws_s3_region>",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>"
}
The table below outlines the AWS S3 parameters in the storage
section.
Parameter | Required | Description |
---|---|---|
useInstanceProfile | Yes | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse . |
region | Yes | The region of the Amazon S3. Example: us-east-1 . See Amazon Simple Storage Service endpoints and quotas for more details. |
accessKey | No | The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
secretKey | No | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
iamRoleArn | No | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles. |
S3 Compatible Storage
PuppyGraph supports S3 Compatible Storage (e.g. MinIO) as data storage.
"storage": {
"useInstanceProfile": "false",
"accessKey": "<iam_user_access_key>",
"secretKey": "<iam_user_secret_key>",
"enableSsl": "{true | false}",
"endpoint": "<s3_endpoint>",
"enablePathStyleAccess": "{true | false}"
}
The table below outlines the S3 Compatible parameters in the storage
section.
Parameter | Required | Description |
---|---|---|
useInstanceProfile | Yes | Set the value to false . |
accessKey | Yes | The access key of an IAM user for accessing the S3 compatible storage. |
secretKey | Yes | The secret key of an IAM user for accessing the S3 compatible storage. |
enableSsl | Yes | Whether to enable SSL connection for accessing the S3 compatible storage. Set the value to true or false . |
endpoint | Yes | The S3 compatible storage endpoint. |
enablePathStyleAccess | Yes | Whether to use path-style access method when accessing the S3 compatible storage. Set the value to true or false . |
Google Cloud Storage
PuppyGraph supports Google Cloud Storage (GCS) as data storage with the following authentication methods:
Authentication with Instance-associated Service Account
Authentication with Service Account Key
"storage": {
"type": "GCS",
"serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
"serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
"serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}
The table below outlines the GCS parameters in the storage
section.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the data storage. Set the value to GCS . |
useComputeEngineServiceAccount | No | Whether to use the service account associated to the compute engine instance for accessing GCS. Set the value to true or false . |
serviceAccountEmail | No | The email address of the service account for accessing GCS. Required by authentication with Service Account Key. |
serviceAccountPrivateKeyId | No | The private key id of the service account for accessing GCS. Required by authentication with Service Account Key. |
serviceAccountPrivateKey | No | The private key of the service account for accessing GCS. Required by authentication with Service Account Key. |
Azure Blob Storage
PuppyGraph supports Azure Blob Storage as data storage with the following authentication methods:
Authentication with Shared Key
"storage": {
"type": "AzureBlob",
"storageAccount": "<Azure_storage_account_name>",
"sharedKey": "<Azure_shared_key>"
}
Authentication with SAS (Shared Access Signatures) Token
"storage": {
"type": "AzureBlob",
"storageAccount": "<azure_storage_account_name>",
"storageContainer": "<azure_storage_container_name>",
"sasToken": "<azure_storage_container_sas_token>"
}
The following table describes the parameters you need to configure in storage
.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the data storage. Set the value toAzureBlob |
storageAccount | Yes | The name of the Azure Storage Account |
sharedKey | No | The Shared Key of the Azure Storage account |
storageContainer | No | The name of the Storage Container. |
sasToken | No | The account or container SAS Token. Required by Authentication with SAS (Shared Access Signatures) Token. |
Azure Data Lake Storage Gen2
PuppyGraph supports Azure Data Lake Storage Gen2 as data storage with the following authentication methods:
Authentication with Shared Key
"storage": {
"type": "AzureDLS2",
"storageAccount": "<azure_storage_account_name>",
"sharedKey": "<azure_shared_key>"
}
Authentication with Client Secret of Service Principal
"storage": {
"type": "AzureDLS2",
"clientId": "<azure_service_principal_client_id>",
"clientSecret": "<azure_service_principal_secret>",
"clientEndpoint": "<azure_service_principal_endpoint>"
}
Authentication with Managed Identities
"storage": {
"type": "AzureDLS2",
"useManagedIdentity": "true",
"clientId": "<azure_service_principal_client_id>",
"tenantId": "<azure_service_principal_tenant_id>"
}
The following table describes the parameters you need to configure in storage
.
Parameter | Required | Description |
---|---|---|
type | Yes | The type of the data storage. Set the value to AzureDLS2 |
storageAccount | No | The name of the Azure Storage Account |
sharedKey | No | The Shared Key of the Azure Storage account. |
clientId | No | The Client ID of the service principal. |
tenantId | No | The Tenant ID of the managed identity. |
clientSecret | No | The Client Secret of the service principal |
clientEndpoint | No | The Client Endpoint of service principal. |
useManagedIdentity | No | Whether to authenticate with Managed Identities. Set the value to true or false . |