Data Lake Catalog

Detailed explanation of parameters in PuppyGraph schemas for accessing Data Lakes.

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "HMS/glue/rest",
    "hiveMetastoreUrl": "HMS.hiveMetastoreUrl",
    "useInstanceProfile": "glue.useInstanceProfile",
    "region": "glue.region",
    "accessKey": "glue.accessKey",
    "secretKey": "glue.secretKey",
    "iamRoleArn": "glue.iamRoleArn",
    "uri": "rest.uri",
    "warehouse": "rest.warehouse",
    "security": "rest.security",
    "session": "rest.session",
    "credential": "rest.session"
  },
  "storage": {
    "type": "S3/GCS/AzureBlob/AzureDLS2",
    "useInstanceProfile": "S3.useInstanceProfile",
    "region": "S3.region",
    "accessKey": "S3.accessKey",
    "secretKey": "S3.secretKey",
    "iamRoleArn": "S3.iamRoleArn",
    "enableSsl": "S3.enableSsl",
    "endpoint": "S3.endpoint",
    "enablePathStyleAccess": "S3.enablePathStyleAccess",
    "useComputeEngineServiceAccount": "GCS.useComputeEngineServiceAccount",
    "serviceAccountEmail": "GCS.serviceAccountEmail",
    "serviceAccountPrivateKeyId": "GCS.serviceAccountPrivateKeyId",
    "serviceAccountPrivateKey": "GCS.serviceAccountPrivateKey",
    "impersonationServiceAccount": "GCS.impersonationServiceAccount",
    "storageAccount": "[AzureBlob/AzureDLS2].storageAccount",
    "sharedKey": "[AzureBlob/AzureDLS2].sharedKey",
    "storageContainer": "AzureBlob.storageContainer",
    "sasToken": "AzureBlob.sasToken",
    "clientId": "AzureDLS2.clientId",
    "clientSecret": "AzureDLS2.clientSecret",
    "clientEndpoint": "AzureDLS2.clientEndpoint",
    "useManagedIdentity": "AzureDLS2.useManagedIdentity",
    "tenantId": "AzureDLS2.tenantId"
  }
}

Metastore Parameters

A set of parameters about how PuppyGraph integrates with the metastore of your data source.

Hive Metastore (HMS)

If you choose Hive Metastore as the metastore of your data source, configure metastore as follows:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}

AWS Glue

If you choose AWS Glue as the metastore of your data source, which is supported only when you choose AWS S3 as storage, take one of the following actions:

  • To choose the instance profile-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}
  • To choose the assumed role-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}
  • To choose the IAM user-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in metastore.

Iceberg Rest/Tabular

If you choose Iceberg Rest or Tabular as the metastore of your data source,, take one of the following actions:

  • To use Iceberg Rest, configure metastore as follows:

"metastore": {
  "type": "rest",
  "uri": "http://rest:8182"
}
  • If using tabular as metastore, you do not need to set storage parameters. To use Tabular, configure metastore as follows:

"metastore": {
  "type": "rest",
  "uri": "https://api.tabular.io/ws",
  "warehouse": "sample_warehouse",
  "security": "oauth2",
  "session": "user",
  "credential": "t-xdfefadffd:Gy_dflksdfex"
}

The following table describes the parameters you need to configure in metastore.

Storage Parameters

A set of parameters about how PuppyGraph integrates with your storage system.

HDFS

If you use HDFS as storage, you do not need to configure storage.

AWS S3

If you choose AWS S3 as storage for your data source, take one of the following actions:

  • To choose the instance profile-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}
  • To choose the IAM user-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in storage.

MinIO

If you choose MinIO as storage for your data source, take one of the following actions:

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The following table describes the parameters you need to configure in storage.

Google GCS

If you choose Google GCS as storage for your data source, take one of the following actions:

  • To choose the instance VM-based authentication method, configure storage as follows:

"storage": {
  "type": "GCS",
  "useComputeEngineServiceAccount": "true"
}
  • To choose the service account based authentication method, configure storage as follows:

"storage": {
  "type": "GCS",
  "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
  "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
  "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}

The following table describes the parameters you need to configure in storage.

Azure Blob Storage

If you choose Azure Blob Storage as storage for your data source, take one of the following actions:

  • To choose the shared key authentication method, configure storage as follows:

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}
  • To choose the SAS token authentication method, configure storage as follows:

"storage": {
  "type": "AzureBlob",
  "storageAccount": "<Azure_storage_account_name>",
  "storageContainer": "<Azure_storage_container_name>",
  "sasToken": "<Azure_storage_container_sas_token>"
}

The following table describes the parameters you need to configure in storage.

Azure Data Lake Storage Gen2

If you choose Azure Data Lake Storage Gen2 as storage for your data source, take one of the following actions:

  • To choose the shared key authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "storageAccount": "<Azure_storage_account_name>",
  "sharedKey": "<Azure_shared_key>"
}
  • To choose the service principal authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "clientId": "<Azure_service_principal_clinet_id>",
  "clientSecret": "<Azure_service_principal_secret>",
  "clientEndpoint": "<Azure_service_principal_endpoint>"
}
  • To choose the Managed Identity authentication method, configure storage as follows:

"storage": {
  "type": "AzureDLS2",
  "useManagedIdentity": "true",
  "clientId": "<Azure_service_principal_clinet_id>",
  "tenantId": "<Azure_service_principal_secret>"
}

The following table describes the parameters you need to configure in storage.

Examples

Hudi + Hive metastore + S3

"catalogs": [
  {
    "name": "hudi_hms_s3",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Hudi + Hive metastore + MinIO

"catalogs": [
  {
    "name": "hudi_hms_minio",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false",
      "endpoint": "<s3_endpoint>",
      "enablePathStyleAccess": "true"
    }
  }
]

DeltaLake + Hive metastore + HDFS

"catalogs": [
  {
    "name": "delta_hms_hdfs",
    "type": "deltalake",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    }
  }
]

DeltaLake + AWS Glue + S3

"catalogs": [
  {
    "name": "delta_glue_s3",
    "type": "deltalake",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "<aws_glue_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Last updated