Data Lake Catalog

Detailed explanation of parameters in PuppyGraph schemas for accessing Data Lakes.

Catalog Parameters Overview

{
  "name": "name",
  "type": "type",
  "metastore": {
    "type": "metastore.type",
    "hiveMetastoreUrl": "metastore.hiveMetastoreUrl",
    "useInstanceProfile": "metastore.useInstanceProfile",
    "region": "metastore.region",
    "accessKey": "metastore.accessKey",
    "secretKey": "metastore.secretKey",
    "iamRoleArn": "metastore.iamRoleArn"
  },
  "storage": {
    "useInstanceProfile": "storage.useInstanceProfile",
    "region": "storage.region",
    "accessKey": "storage.accessKey",
    "secretKey": "storage.secretKey",
    "iamRoleArn": "storage.iamRoleArn",
    "enableSsl": "storage.enableSsl",
    "endpoint": "storage.endpoint",
    "enablePathStyleAccess": "storage.enablePathStyleAccess"
  }
}
ParameterRequiredDescription

name

Yes

The name of the catalog.

type

Yes

The type of your data source. Set the value to iceberg/hudi/deltalake. For JDBC, please refer to JDBC Catalog

metastore

Yes

A set of catalog / metastore params about your data source.

storage

Yes

A set of storage params about your data source.

Metastore Parameters

A set of parameters about how PuppyGraph integrates with the metastore of your data source.

Hive Metastore (HMS)

If you choose Hive Metastore as the metastore of your data source, configure metastore as follows:

"metastore": {
  "type": "HMS",
  "hiveMetastoreUrl": "<hive_metastore_uri>"
}
ParameterRequiredDescription

type

Yes

The type of metastore that you use for your data source. Set the value to HMS.

hiveMetastoreUrl

Yes

The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

AWS Glue

If you choose AWS Glue as the metastore of your data source, which is supported only when you choose AWS S3 as storage, take one of the following actions:

  • To choose the instance profile-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>"
}
  • To choose the assumed role-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "true",
  "region": "<aws_glue_region>",
  "iamRoleArn": "<iam_role_arn>"
}
  • To choose the IAM user-based authentication method, configure metastore as follows:

"metastore": {
  "type": "glue",
  "useInstanceProfile": "false",
  "region": "<aws_glue_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in metastore.

ParameterRequiredDescription

type

Yes

The type of metastore that you use for your data source. Set the value to glue.

useInstanceProfile

Yes

Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method. Valid values: true and false.

region

Yes

The region in which your AWS Glue Data Catalog resides. Example: us-west-1.

accessKey

No

The access key of your AWS IAM user. If you use the IAM user-based authentication method to access AWS Glue, you must specify this parameter.

secretKey

No

The secret key of your AWS IAM user. If you use the IAM user-based authentication method to access AWS Glue, you must specify this parameter.

iamRoleArn

No

The ARN of the IAM role that has privileges on your AWS Glue Data Catalog. If you use the assumed role-based authentication method to access AWS Glue, you must specify this parameter.

Storage Parameters

A set of parameters about how PuppyGraph integrates with your storage system.

HDFS

If you use HDFS as storage, you do not need to configure storage.

AWS S3

If you choose AWS S3 as storage for your data source, take one of the following actions:

  • To choose the instance profile-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "true",
  "region": "<aws_s3_region>"
}
  • To choose the IAM user-based authentication method, configure storage as follows:

"storage": {
  "useInstanceProfile": "false",
  "region": "<aws_s3_region>",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

useInstanceProfile

Yes

Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method. Valid values: true and false.

region

Yes

The region in which your AWS S3 bucket resides. Example: us-west-1.

accessKey

No

The access key of your IAM user when IAM user-based authentication method is used.

secretKey

No

The secret key of your IAM user when IAM user-based authentication method is used.

MinIO

If you choose MinIO as storage for your data source, take one of the following actions:

"storage": {
  "useInstanceProfile": "false",
  "accessKey": "<iam_user_access_key>",
  "secretKey": "<iam_user_secret_key>",
  "enableSsl": "{true | false}",
  "endpoint": "<s3_endpoint>",
  "enablePathStyleAccess": "{true | false}"
}

The following table describes the parameters you need to configure in storage.

ParameterRequiredDescription

useInstanceProfile

Yes

Set the value to false.

accessKey

Yes

The access key of your IAM user.

secretKey

Yes

The secret key of your IAM user.

enableSsl

Yes

Specifies whether to enable SSL connection. Valid values: true and false.

endpoint

Yes

The endpoint that is used to connect to your MinIO storage system instead of AWS S3.

enablePathStyleAccess

Yes

Specifies whether to enable path-style access. Valid values: true and false.

Examples

Hudi + Hive metastore + S3

"catalogs": [
  {
    "name": "hudi_hms_s3",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Hudi + Hive metastore + MinIO

"catalogs": [
  {
    "name": "hudi_hms_minio",
    "type": "hudi",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false",
      "endpoint": "<s3_endpoint>",
      "enablePathStyleAccess": "true"
    }
  }
]

DeltaLake + Hive metastore + HDFS

"catalogs": [
  {
    "name": "delta_hms_hdfs",
    "type": "deltalake",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "<hive_metastore_uri>"
    }
  }
]

DeltaLake + AWS Glue + S3

"catalogs": [
  {
    "name": "delta_glue_s3",
    "type": "deltalake",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "<aws_glue_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "<aws_s3_region>",
      "accessKey": "<iam_user_access_key>",
      "secretKey": "<iam_user_secret_key>",
      "enableSsl": "false"
    }
  }
]

Last updated