Skip to content

Connecting to Iceberg

Apache Iceberg is a high-performance format for huge analytic tables. This document describes how to connect to Iceberg tables in PuppyGraph.

See also Tutorials for end-to-end tutorials for connecting to different Iceberg implementations.

Prerequisites

Configuration

The configuration consists of two parts: Metastore (Catalog) and Data Storage. Please configure them according to you Iceberg implementation.

Metastore Configuration

Iceberg REST

Iceberg REST Catalog defines a REST protocol allowing different Iceberg catalog implementations. To use it, select Iceberg-Rest as the metastore type in the Web UI.

Configuration Explanation
RestUri The server endpoint URI of the REST Catalog.
Warehouse The name of the Iceberg warehouse.
Security Security authentication mode of the REST catalog.
Credential The Oauth2 client authentication credential.
Scope Authorization scope for client credentials or token exchange

The following sections provide specific configurations for different REST Catalog implementations.

Nessie

Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. PuppyGraph supports Nessie through Iceberg REST Protocol.

PuppyGraph works with Nessie version 0.95.0 or higher. Please first configure Nessie Catalog with Iceberg REST before connecting.

After selecting Iceberg-Rest as the metastore type:

  • Set RestUri to the Nessie REST endpoint. Typically, this ends with /iceberg like http://127.0.0.1:19120/iceberg.
  • Set Warehouse to the Nessie warehouse. This is typically the (base) storage location, for example s3://my-bucket/.

See also the following tutorials:

Polaris

Apache Polaris is an open-source, fully-featured catalog for Iceberg that implements Iceberg's REST API. PuppyGraph supports both Apache Polaris and Snowflake Open Catalog, a managed service for Polaris.

After selecting Iceberg-Rest as the metastore type:

  • Set RestUri to the Polaris REST endpoint.
  • Set Warehouse to the Polaris warehouse accordingly.
  • Set Scope to the proper PRINCIPAL_ROLE in Polaris. The format is PRINCIPAL_ROLE:<ROLE_NAME>.
  • Set Credential to the client auth credential if needed. The format is <CLIENT-ID>:<CLIENT-SECRET>.

See also the following tutorials for Polaris and Snowflake Open Catalog:

Tabular

Tabular is a cloud-native managed storage engine that provides a range of services on top of Apache Iceberg tables.

After selecting Iceberg-Rest as the metastore type:

  • Set RestUri to the Tabular REST endpoint. It is https://api.tabular.io/ws by default.
  • Set Warehouse to the name of the Tabular warehouse
  • Set Credential to the Tabular member or service credential.

AWS Glue

Configuration Explanation
Region The region of the AWS Glue Data Catalog. Example: us-east-1. See AWS Glue endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached)
IAM Role ARN The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.
Secret key The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys.

Hive Metastore

Configuration Explanation
Hive metastore URI The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port>.

Data Storage Configuration

Get from metastore

There is no need to specify Storage configuration with the following implementation of Iceberg:

Select Get from metastore in the Web UI for these implementations.

Amazon S3 (Simple Storage Service)

PuppyGraph supports Amazon S3 (Simple Storage Service) for Iceberg.

Configuration Explanation
Region The region of the Amazon S3. Example: us-east-1. See Amazon Simple Storage Service endpoints and quotas for more details.
Use instance profile Whether to use role-based authentication (Explicit IAM roles or instance-profile attached).
IAM Role ARN The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles.
Access key The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys.
Secret key The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys.

Amazon S3 Compatible Storage

PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Iceberg.

Configuration Explanation
Endpoint The S3 compatible storage endpoint.
Access key The access key of an IAM user for accessing the S3 compatible storage.
Secret key The secret key of an IAM user for accessing the S3 compatible storage.
Enable SSL Whether to enable SSL connection for accessing the S3 compatible storage.
Enable path style access Whether to use path-style access method when accessing the S3 compatible storage.

GCS (Google Cloud Storage)

PuppyGraph supports GCS (Google Cloud Storage) for Iceberg.

Using VM instance

Configuration Explanation
useComputeEngineServiceAccount Use Compute Engine service account, set to true here

Using VM instance impersonate a service account

Configuration Explanation
useComputeEngineServiceAccount Use Compute Engine service account, set to true here
impersonationServiceAccount Impersonation service account

Using service account

Configuration Explanation
serviceAccountEmail Service account email address associated with your Google Cloud project
serviceAccountPrivateKeyId Unique identifier for the service account's private key
serviceAccountPrivateKey Service account's private key for authentication

Using service account impersonate another service account

Configuration Explanation
serviceAccountEmail Service account email address associated with your Google Cloud project
serviceAccountPrivateKeyId Unique identifier for the service account's private key
serviceAccountPrivateKey Service account's private key for authentication
impersonationServiceAccount Impersonation service account

Tutorials

Data Type Mapping

Iceberg Type PuppyGraph Type Description
boolean Boolean True or false
int Int 32-bit signed integers
long Long 64-bit signed integers
float Float 32-bit IEEE 754 floating point
double Double 64-bit IEEE 754 floating point
decimal(P, S) Decimal(P, S) Fixed-point decimal; precision P, scale S
string String Arbitrary-length character sequences
date Date Calendar date without timezone or time
time Time Time of day, microsecond precision, without date, timezone
timestamp DateTime Timestamp, microsecond precision, without timezone
list Array<E> A collection of values with some element type
struct Struct<field1 E1[,field2 E2...]> A tuple of typed values

Example Configurations

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog TypeStorage TypeExample Configuration
REST CatalogAmazon S3#rest-catalog-amazon-s3
REST CatalogMinIO#rest-catalog-minio
AWS GlueAmazon S3#aws-glue-amazon-s3
Hive MetastoreHDFS#hive-metastore-hdfs
Hive MetastoreAmazon S3#hive-metastore-amazon-s3
Hive MetastoreMinIO#hive-metastore-minio
Hive MetastoreGoogle GCS#hive-metastore-google-gcs
Hive MetastoreAzure Blob#hive-metastore-azure-blob
Hive MetastoreAzure Data Lake Gen2#hive-metastore-azure-data-lake-gen2

REST Catalog + Amazon S3

"catalogs": [
  {
    "name": "iceberg_rest_s3",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

REST Catalog + MinIO

"catalogs": [
  {
    "name": "iceberg_rest_minio",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

AWS Glue + Amazon S3

"catalogs": [
  {
    "name": "iceberg_glue_s3",
    "type": "iceberg",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + HDFS

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    }
  }
]

Hive Metastore + MinIO

"catalogs": [
  {
    "name": "iceberg_hms_minio",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

Hive Metastore + Amazon S3

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + Google GCS

"catalogs": [
  {
    "name": "iceberg_hms_gcs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "GCS",
      "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
      "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
      "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
    }
  }
]

Hive Metastore + Azure Blob

"catalogs": [
  {
    "name": "iceberg_hms_azblob",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureBlob",
      "storageAccount": "account_name",
      "storageContainer": "container_name",
      "sasToken": "sp=rl&st=2020-12-15T03:19:48Z&se=2024-12-12T11:19:48Z&sv=2022-11-02&sr=c&sig=1"
    }
  }
]

Hive Metastore + Azure Data Lake Gen2

"catalogs": [
  {
    "name": "iceberg_hms_azgen2",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureDLS2",
      "clientId": "000000-avaf-aaaa-bbbb-aba988azfa",
      "clientSecret": "EXAMPLEvonefPJabcde",
      "clientEndpoint": "https://login.microsoftonline.com/000000-avaf-aaaa-bbbb-aba988azfa/oauth2/token"
    }
  }
]