Data Lake Catalog

Detailed explanation of parameters in PuppyGraph schemas for accessing Data Lakes.

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "HMS/glue/rest",
    "hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
    "useInstanceProfile": "glue.useInstanceProfile",
    "region": "glue.region",
    "accessKey": "glue.accessKey",
    "secretKey": "glue.secretKey",
    "iamRoleArn": "glue.iamRoleArn",
    "uri": "rest.uri",
    "warehouse": "rest.warehouse",
    "security": "rest.security",
    "session": "rest.session",
    "credential": "rest.session"
  },
  "storage": {
    "type": "S3/GCS/AzureBlob/AzureDLS2",
    "useInstanceProfile": "S3.useInstanceProfile",
    "region": "S3.region",
    "accessKey": "S3.accessKey",
    "secretKey": "S3.secretKey",
    "iamRoleArn": "S3.iamRoleArn",
    "enableSsl": "S3.enableSsl",
    "endpoint": "S3.endpoint",
    "enablePathStyleAccess": "S3.enablePathStyleAccess",
    "useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
    "serviceAccountEmail": "GCS.serviceAccountEmail",
    "serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
    "serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
    "impersonationServiceAccount": "GCS.impersonationServiceAccount",
    "storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
    "sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
    "storageContainer": "AzureBlob.storageContainer",
    "sasToken": "AzureBlob.sasToken",
    "clientId": "AzureDLS2.clientId",
    "clientSecret": "AzureDLS2.clientSecret",
    "clientEndpoint": "AzureDLS2.clientEndpoint",
    "useManagedIdentity": "AzureDLS2.useManagedIdentity",
    "tenantId": "AzureDLS2.tenantId"
  }
}
ParameterRequiredDescription

name

Yes

The name of the catalog

type

Yes

The type of the catalog

metastore

Yes

Metastore parameters

storage

Yes

Data storage parameters

Metastore Parameters

Hive Metastore

PuppyGraph supports Hive Metastore (HMS) as a catalog metastore:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}

The table below outlines the Hive Metastore parameters in the metastore section.

ParameterRequiredDescription

type

Yes

The type of the metastore. Set the value to HMS.

hiveMetastoreUrl

Yes

The URI of the Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

AWS Glue

PuppyGraph supports AWS Glue as a catalog metastore with the following authentication methods:

Authentication with Instance profile

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}

Authentication with IAM Roles

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS Glue parameters in the metastore section.

ParameterRequiredDescription

type

Yes

The type of the metastore. Set the value to glue.

useInstanceProfile

Yes

Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse.

region

Yes

The region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.

accessKey

No

The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

secretKey

No

The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

iamRoleArn

No

The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.

Iceberg REST Catalog

PuppyGraph supports Iceberg REST Catalog (including Tabular) as a catalog metastore. See the REST Catalog API to learn more about the details.

Iceberg REST

The minimal configuration of an Iceberg REST metastore is as follows:

"metastore": {
  "type": "rest",
  "uri": "http://rest:8182"
}

Tabular

Tabular (tabular.io) is a managed Iceberg platform. An example of the Tabular metastore configuration is as follows:

"metastore": {
  "type": "rest",
  "uri": "https://api.tabular.io/ws",
  "warehouse": "sample_warehouse",
  "security": "oauth2",
  "session": "user",
  "credential": "t-xxxxxxxxxx:Gy_yyyyyyyyy"
}

The table below outlines the Iceberg REST parameters in the metastore section.

ParameterRequiredDescription

type

Yes

The type of metastore. Set the value to rest.

uri

Yes

The server endpoint URI of the REST Catalog

warehouse

No

The name of the Tabular warehouse. Required by Tabular metastore.

credential

No

The Tabular authentication credential. Required by Tabular metastore.

security

No

Security Schema of the REST catalog. Set it to oauth2 when using Tabular metastore.

session

No

Set it to user when using Tabular metastore.

Data Storage Parameters

HDFS

PuppyGraph supports HDFS as data storage with Hive Metastore. There is no storage section needed with this combination.

Amazon S3

PuppyGraph supports Amazon S3 (Simple Storage Service) as data storage with the following authentication methods:

Authentication with Instance profile

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}

Authentication with IAM Roles

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>",
  "iamRoleArn": "<iam_role_arn>"
}

Authentication with IAM User Access keys

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The table below outlines the AWS S3 parameters in the storage section.

ParameterRequiredDescription

useInstanceProfile

Yes

Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). Set the value to true orfalse.

region

Yes

The region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.

accessKey

No

The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.

secretKey

No

The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

iamRoleArn

No

The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.

S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) as data storage.

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The table below outlines the S3 Compatible parameters in the storage section.

ParameterRequiredDescription

useInstanceProfile

Yes

Set the value to false.

accessKey

Yes

The access key of an IAM user for accessing the S3 compatible storage.

secretKey

Yes

The secret key of an IAM user for accessing the S3 compatible storage.

enableSsl

Yes

Whether to enable SSL connection for accessing the S3 compatible storage. Set the value to true or false.

endpoint

Yes

The S3 compatible storage endpoint.

enablePathStyleAccess

Yes

Whether to use path-style access method when accessing the S3 compatible storage. Set the value to true or false.

Google Cloud Storage

PuppyGraph supports Google Cloud Storage (GCS) as data storage with the following authentication methods:

"storage": {
  "type": "GCS",
  "useComputeEngineServiceAccount": "true"
}

Authentication with Service Account Key

"storage": {
  "type": "GCS",
  "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
  "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
  "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}

The table below outlines the GCS parameters in the storage section.

ParameterRequiredDescription

type

Yes

The type of the data storage. Set the value to GCS.

useComputeEngineServiceAccount

No

Whether to use the service account associated to the compute engine instance for accessing GCS. Set the value to true or false.

serviceAccountEmail

No

The email address of the service account for accessing GCS. Required by authentication with Service Account Key.

serviceAccountPrivateKeyId

No

The private key id of the service account for accessing GCS. Required by authentication with Service Account Key.

serviceAccountPrivateKey

No

The private key of the service account for accessing GCS. Required by authentication with Service Account Key.

Azure Blob Storage

PuppyGraph supports Azure Blob Storage as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<azure_storage_account_name>",
  "storageContainer": "<azure_storage_container_name>",
  "sasToken": "<azure_storage_container_sas_token>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

type

Yes

The type of the data storage. Set the value toAzureBlob

storageAccount

Yes

The name of the Azure Storage Account

sharedKey

No

The Shared Key of the Azure Storage account

storageContainer

No

The name of the Storage Container.

sasToken

No

The account or container SAS Token. Required by Authentication with SAS (Shared Access Signatures) Token.

Azure Data Lake Storage Gen2

PuppyGraph supports Azure Data Lake Storage Gen2 as data storage with the following authentication methods:

Authentication with Shared Key

"storage": {
  "type": "AzureDLS2",
  "storageAccount": "<azure_storage_account_name>",
  "sharedKey": "<azure_shared_key>"
}

"storage": {
  "type": "AzureDLS2",
  "clientId": "<azure_service_principal_client_id>",
  "clientSecret": "<azure_service_principal_secret>",
  "clientEndpoint": "<azure_service_principal_endpoint>"
}

Authentication with Managed Identities

"storage": {
  "type": "AzureDLS2",
  "useManagedIdentity": "true",
  "clientId": "<azure_service_principal_client_id>",
  "tenantId": "<azure_service_principal_tenant_id>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

type

Yes

The type of the data storage. Set the value to AzureDLS2

storageAccount

No

The name of the Azure Storage Account

sharedKey

No

The Shared Key of the Azure Storage account.

clientId

No

The Client ID of the service principal.

tenantId

No

The Tenant ID of the managed identity.

clientSecret

No

The Client Secret of the service principal

clientEndpoint

No

The Client Endpoint of service principal.

useManagedIdentity

No

Whether to authenticate with Managed Identities. Set the value to true or false.

Last updated