Data Lake Catalog

Detailed explanation of parameters in PuppyGraph schemas for accessing Data Lakes.

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "HMS/glue/rest",
    "hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
    "useInstanceProfile": "glue.useInstanceProfile",
    "region": "glue.region",
    "accessKey": "glue.accessKey",
    "secretKey": "glue.secretKey",
    "iamRoleArn": "glue.iamRoleArn",
    "uri": "rest.uri",
    "warehouse": "rest.warehouse",
    "security": "rest.security",
    "session": "rest.session",
    "credential": "rest.session"
  },
  "storage": {
    "type": "S3/GCS/AzureBlob/AzureDLS2",
    "useInstanceProfile": "S3.useInstanceProfile",
    "region": "S3.region",
    "accessKey": "S3.accessKey",
    "secretKey": "S3.secretKey",
    "iamRoleArn": "S3.iamRoleArn",
    "enableSsl": "S3.enableSsl",
    "endpoint": "S3.endpoint",
    "enablePathStyleAccess": "S3.enablePathStyleAccess",
    "useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
    "serviceAccountEmail": "GCS.serviceAccountEmail",
    "serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
    "serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
    "impersonationServiceAccount": "GCS.impersonationServiceAccount",
    "storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
    "sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
    "storageContainer": "AzureBlob.storageContainer",
    "sasToken": "AzureBlob.sasToken",
    "clientId": "AzureDLS2.clientId",
    "clientSecret": "AzureDLS2.clientSecret",
    "clientEndpoint": "AzureDLS2.clientEndpoint",
    "useManagedIdentity": "AzureDLS2.useManagedIdentity",
    "tenantId": "AzureDLS2.tenantId"
  }
}
ParameterRequiredDescription

name

Yes

The name of the catalog.

type

Yes

The type of your data source. Set the value to iceberg/hudi/deltalake. For JDBC, please refer to JDBC Catalog

metastore

Yes

A set of catalog / metastore params about your data source.

storage

Yes

A set of storage params about your data source.

Metastore Parameters

A set of parameters about how PuppyGraph integrates with the metastore of your data source.

Hive Metastore (HMS)

If you choose Hive Metastore as the metastore of your data source, configure metastore as follows:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}
ParameterRequiredDescription

type

Yes

The type of metastore that you use for your data source. Set the value to HMS.

hiveMetastoreUrl

Yes

The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

AWS Glue

If you choose AWS Glue as the metastore of your data source, which is supported only when you choose AWS S3 as storage, take one of the following actions:

  • To choose the instance profile-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}
  • To choose the assumed role-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}
  • To choose the IAM user-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in metastore.

ParameterRequiredDescription

type

Yes

The type of metastore that you use for your data source. Set the value to glue.

useInstanceProfile

Yes

Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method. Valid values: true and false.

region

Yes

The region in which your AWS Glue Data Catalog resides. Example: us-west-1.

accessKey

No

The access key of your AWS IAM user. If you use the IAM user-based authentication method to access AWS Glue, you must specify this parameter.

secretKey

No

The secret key of your AWS IAM user. If you use the IAM user-based authentication method to access AWS Glue, you must specify this parameter.

iamRoleArn

No

The ARN of the IAM role that has privileges on your AWS Glue Data Catalog. If you use the assumed role-based authentication method to access AWS Glue, you must specify this parameter.

Iceberg Rest/Tabular

If you choose Iceberg Rest or Tabular as the metastore of your data source,, take one of the following actions:

  • To use Iceberg Rest, configure metastore as follows:

"metastore": {
  "type": "rest",
  "uri": "http://rest:8182"
}
  • If using tabular as metastore, you do not need to set storage parameters. To use Tabular, configure metastore as follows:

"metastore": {
  "type": "rest",
  "uri": "https://api.tabular.io/ws",
  "warehouse": "sample_warehouse",
  "security": "oauth2",
  "session": "user",
  "credential": "t-xdfefadffd:Gy_dflksdfex"
}

The following table describes the parameters you need to configure in metastore.

ParameterRequiredDescription

type

Yes

The type of metastore that you use for your data source. Set the value to rest.

uri

Yes

Specifies rest catalog server uri.

warehouse

No

Use for tabular, the warehouse name.

credential

No

Use for tabular, authentication secret for tabular service.

security

No

Use for tabular, fix value oauth2.

session

No

Use for tabular, fix value user

Storage Parameters

A set of parameters about how PuppyGraph integrates with your storage system.

HDFS

If you use HDFS as storage, you do not need to configure storage.

AWS S3

If you choose AWS S3 as storage for your data source, take one of the following actions:

  • To choose the instance profile-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}
  • To choose the IAM user-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

useInstanceProfile

Yes

Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method. Valid values: true and false.

region

Yes

The region in which your AWS S3 bucket resides. Example: us-west-1.

accessKey

No

The access key of your IAM user when IAM user-based authentication method is used.

secretKey

No

The secret key of your IAM user when IAM user-based authentication method is used.

MinIO

If you choose MinIO as storage for your data source, take one of the following actions:

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

useInstanceProfile

Yes

Set the value to false.

accessKey

Yes

The access key of your IAM user.

secretKey

Yes

The secret key of your IAM user.

enableSsl

Yes

Specifies whether to enable SSL connection. Valid values: true and false.

endpoint

Yes

The endpoint that is used to connect to your MinIO storage system instead of AWS S3.

enablePathStyleAccess

Yes

Specifies whether to enable path-style access. Valid values: true and false.

Google GCS

If you choose Google GCS as storage for your data source, take one of the following actions:

  • To choose the instance VM-based authentication method, configure storage as follows:

"storage": {
  "type": "GCS",
  "useComputeEngineServiceAccount": "true"
}
  • To choose the service account based authentication method, configure storage as follows:

"storage": {
  "type": "GCS",
  "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
  "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
  "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

type

Yes

Storage Type. Fix value: GCS

useComputeEngineServiceAccount

No

Specifies whether to enable the instance VM-based authentication method. Valid values: true .

serviceAccountEmail

No

Service account email address.

serviceAccountPrivateKeyId

No

Service account private key id.

serviceAccountPrivateKey

No

Service account private key.

Azure Blob Storage

If you choose Azure Blob Storage as storage for your data source, take one of the following actions:

  • To choose the shared key authentication method, configure storage as follows:

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}
  • To choose the SAS token authentication method, configure storage as follows:

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "storageContainer": "<Azure_storage_container_name>",
  "sasToken": "<Azure_storage_container_sas_token>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

type

Yes

Storage Type. Fix value: AzureBlob

storageAccount

Yes

The username of your Blob Storage account .

sharedKey

No

Shared Key of your Blob Storage account.

storageContainer

No

Container name that stores your data.

sasToken

No

Account or container SAS token to access your data.

Azure Data Lake Storage Gen2

If you choose Azure Data Lake Storage Gen2 as storage for your data source, take one of the following actions:

  • To choose the shared key authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}
  • To choose the service principal authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "clientId": "<Azure_service_principal_clinet_id>",
  "clientSecret": "<Azure_service_principal_secret>",
  "clientEndpoint": "<Azure_service_principal_endpoint>"
}
  • To choose the Managed Identity authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "useManagedIdentity": "true",
  "clientId": "<Azure_service_principal_clinet_id>",
  "tenantId": "<Azure_service_principal_secret>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

type

Yes

Storage Type. Fix value: AzureDLS2

storageAccount

No

The username of your Blob Storage account .

sharedKey

No

Shared Key of your Blob Storage account.

clientId

No

Client id of service principal, or client id of the managed identity

clientSecret

No

Client secret of service principal.

clientEndpoint

No

Client endpoint of service principal

useManagedIdentity

No

Specifies whether to enable the Managed Identity authentication method. Valid values: true .

tenantId

No

The id of the tenant whose data you want to access

Examples

Hudi + Hive metastore + S3

"catalogs": [
  {
    "name": "hudi_hms_s3",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Hudi + Hive metastore + MinIO

"catalogs": [
  {
    "name": "hudi_hms_minio",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false",
      "endpoint": "<s3_endpoint>",
      "enablePathStyleAccess": "true"
    }
  }
]

DeltaLake + Hive metastore + HDFS

"catalogs": [
  {
    "name": "delta_hms_hdfs",
    "type": "deltalake",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    }
  }
]

DeltaLake + AWS Glue + S3

"catalogs": [
  {
    "name": "delta_glue_s3",
    "type": "deltalake",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "<aws_glue_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Last updated