Skip to content

Connecting to Iceberg

Apache Iceberg is a high-performance format for huge analytic tables. PuppyGraph supports Iceberg tables as a data source.

Prerequisites

Configuration

The configuration consists of two parts: Metastore (Catalog) and Data Storage. Please configure them according to you Iceberg implementation.

See also Tutorials for end-to-end examples for connecting to different Iceberg implementations.

Metastore Configuration

Iceberg REST Catalog

Iceberg REST Catalog defines a REST protocol allowing different Iceberg catalog implementations. To use it, select Iceberg-Rest as the metastore type in the Web UI.

Configuration Explanation
RestUri The server endpoint URI of the REST Catalog.
Warehouse The name of the Iceberg warehouse.
Security Security authentication mode of the REST catalog.
Credential The Oauth2 client authentication credential.
Scope Authorization scope for client credentials or token exchange

Here are specific configurations for different REST Catalog implementations:

Nessie

Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. It supports Iceberg REST Protocol.

PuppyGraph works with Nessie version 0.95.0 or higher. Please first configure Nessie Catalog with Iceberg REST before connecting.

  • Set RestUri to the Nessie REST endpoint. Typically, this ends with /iceberg like http://127.0.0.1:19120/iceberg.
  • Set Warehouse to the Nessie warehouse. This is typically the (base) storage location, for example s3://my-bucket/.
Polaris

Apache Polaris is an open-source, fully-featured catalog for Iceberg that implements Iceberg's REST API. PuppyGraph supports both Apache Polaris and Snowflake Open Catalog, a managed service for Polaris.

  • Set RestUri to the Polaris REST endpoint.
  • Set Warehouse to the Polaris warehouse accordingly.
  • Set Scope to the proper PRINCIPAL_ROLE in Polaris. The format is PRINCIPAL_ROLE:<ROLE_NAME>.
  • Set Credential to the client auth credential if needed. The format is <CLIENT-ID>:<CLIENT-SECRET>.
Tabular

Tabular is a cloud-native managed storage engine that provides a range of services on top of Apache Iceberg tables.

  • Set RestUri to the Tabular REST endpoint. It is https://api.tabular.io/ws by default.
  • Set Warehouse to the name of the Tabular warehouse
  • Set Credential to the Tabular member or service credential.

AWS Glue

Configuration Explanation
Region The region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached)
IAM Role ARN The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
Secret key The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

Hive Metastore

Configuration Explanation
Hive metastore URI The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

Data Storage Configuration

Get from metastore

There is no need to specify Storage configuration with the following implementation of Iceberg:

Select Get from metastore in the Web UI for these implementations.

Amazon S3 (Simple Storage Service)

PuppyGraph supports Amazon S3 (Simple Storage Service) for Iceberg.

Configuration Explanation
Region The region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached).
IAM Role ARN The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
Secret key The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

Amazon S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Iceberg.

Configuration Explanation
Endpoint The S3 compatible storage endpoint.
Access key The access key of an IAM user for accessing the S3 compatible storage.
Secret key The secret key of an IAM user for accessing the S3 compatible storage.
Enable SSL Whether to enable SSL connection for accessing the S3 compatible storage.
Enable path style access Whether to use path-style access method when accessing the S3 compatible storage.

Tutorials

Data Type Mapping

Iceberg Type PuppyGraph Type Description
boolean Boolean True or false
int Int 32-bit signed integers
long Long 64-bit signed integers
float Float 32-bit IEEE 754 floating point
double Double 64-bit IEEE 754 floating point
decimal(P, S) Decimal(P, S) Fixed-point decimal; precision P, scale S
string String Arbitrary-length character sequences
date Date Calendar date without timezone or time
time Time Time of day, microsecond precision, without date, timezone
timestamp DateTime Timestamp, microsecond precision, without timezone
list Array<E> A collection of values with some element type
struct Struct<field1 E1[,field2 E2...]> A tuple of typed values

Example Configurations

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog TypeStorage TypeExample Configuration
REST CatalogAmazon S3#rest-catalog-amazon-s3
REST CatalogMinIO#rest-catalog-minio
AWS GlueAmazon S3#aws-glue-amazon-s3
Hive MetastoreHDFS#hive-metastore-hdfs
Hive MetastoreAmazon S3#hive-metastore-amazon-s3
Hive MetastoreMinIO#hive-metastore-minio
Hive MetastoreGoogle GCS#hive-metastore-google-gcs
Hive MetastoreAzure Blob#hive-metastore-azure-blob
Hive MetastoreAzure Data Lake Gen2#hive-metastore-azure-data-lake-gen2

REST Catalog + Amazon S3

"catalogs": [
  {
    "name": "iceberg_rest_s3",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

REST Catalog + MinIO

"catalogs": [
  {
    "name": "iceberg_rest_minio",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

AWS Glue + Amazon S3

"catalogs": [
  {
    "name": "iceberg_glue_s3",
    "type": "iceberg",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + HDFS

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    }
  }
]

Hive Metastore + MinIO

"catalogs": [
  {
    "name": "iceberg_hms_minio",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

Hive Metastore + Amazon S3

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + Google GCS

"catalogs": [
  {
    "name": "iceberg_hms_gcs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "GCS",
      "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
      "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
      "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
    }
  }
]

Hive Metastore + Azure Blob

"catalogs": [
  {
    "name": "iceberg_hms_azblob",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureBlob",
      "storageAccount": "account_name",
      "storageContainer": "container_name",
      "sasToken": "sp=rl&st=2020-12-15T03:19:48Z&se=2024-12-12T11:19:48Z&sv=2022-11-02&sr=c&sig=1"
    }
  }
]

Hive Metastore + Azure Data Lake Gen2

"catalogs": [
  {
    "name": "iceberg_hms_azgen2",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureDLS2",
      "clientId": "000000-avaf-aaaa-bbbb-aba988azfa",
      "clientSecret": "EXAMPLEvonefPJabcde",
      "clientEndpoint": "https://login.microsoftonline.com/000000-avaf-aaaa-bbbb-aba988azfa/oauth2/token"
    }
  }
]