Connecting to Iceberg
Apache Iceberg is a high-performance format for huge analytic tables. PuppyGraph supports Iceberg tables as a data source.
Prerequisites
- Both the Iceberg Catalog and Storage are accessible over the network from the PuppyGraph instance.
- PuppyGraph supports various Iceberg implementations. Please see Metastore Configuration and Data Storage Configuration for supported implementations.
Configuration
The configuration consists of two parts: Metastore (Catalog) and Data Storage. Please configure them according to you Iceberg implementation.
See also Tutorials for end-to-end examples for connecting to different Iceberg implementations.
Metastore Configuration
Iceberg REST Catalog
Iceberg REST Catalog defines a REST protocol allowing different Iceberg catalog implementations.
To use it, select Iceberg-Rest
as the metastore type in the Web UI.
Configuration | Explanation |
---|---|
RestUri | The server endpoint URI of the REST Catalog. |
Warehouse | The name of the Iceberg warehouse. |
Security | Security authentication mode of the REST catalog. |
Credential | The Oauth2 client authentication credential. |
Scope | Authorization scope for client credentials or token exchange |
Here are specific configurations for different REST Catalog implementations:
Nessie
Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. It supports Iceberg REST Protocol.
PuppyGraph works with Nessie version 0.95.0 or higher. Please first configure Nessie Catalog with Iceberg REST before connecting.
- Set
RestUri
to the Nessie REST endpoint. Typically, this ends with/iceberg
likehttp://127.0.0.1:19120/iceberg
. - Set
Warehouse
to the Nessie warehouse. This is typically the (base) storage location, for examples3://my-bucket/
.
Polaris
Apache Polaris is an open-source, fully-featured catalog for Iceberg that implements Iceberg's REST API. PuppyGraph supports both Apache Polaris and Snowflake Open Catalog, a managed service for Polaris.
- Set
RestUri
to the Polaris REST endpoint. - Set
Warehouse
to the Polaris warehouse accordingly. - Set
Scope
to the properPRINCIPAL_ROLE
in Polaris. The format isPRINCIPAL_ROLE:<ROLE_NAME>
. - Set
Credential
to the client auth credential if needed. The format is<CLIENT-ID>:<CLIENT-SECRET>
.
Tabular
Tabular is a cloud-native managed storage engine that provides a range of services on top of Apache Iceberg tables.
- Set
RestUri
to the Tabular REST endpoint. It is https://api.tabular.io/ws by default. - Set
Warehouse
to the name of the Tabular warehouse - Set
Credential
to the Tabular member or service credential.
AWS Glue
Configuration | Explanation |
---|---|
Region | The region of the AWS Glue Data Catalog. Example: us-east-1 . See AWS Glue endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached) |
IAM Role ARN | The ARN of the IAM role for accessing the AWS Glue Data Catalog. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Secret key | The secret key of the IAM user for accessing the AWS Glue Data Catalog. Required by authentication with IAM User Access keys. |
Hive Metastore
Configuration | Explanation |
---|---|
Hive metastore URI | The URI of your Hive metastore. Format: thrift://<metastore_IP_address>:<metastore_port> . |
Data Storage Configuration
Get from metastore
There is no need to specify Storage configuration with the following implementation of Iceberg:
- HDFS with Hive Metastore
- Iceberg REST Catalog with credential vending.
Select Get from metastore
in the Web UI for these implementations.
Amazon S3 (Simple Storage Service)
PuppyGraph supports Amazon S3 (Simple Storage Service) for Iceberg.
Configuration | Explanation |
---|---|
Region | The region of the Amazon S3. Example: us-east-1 . See Amazon Simple Storage Service endpoints and quotas for more details. |
Use instance profile | Whether to use role-based authentication (Explicit IAM roles or instance-profile attached). |
IAM Role ARN | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM roles. |
Access key | The access key of the IAM user for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Secret key | The ARN of the IAM role for accessing the Amazon S3. Required by authentication with IAM User Access keys. |
Amazon S3 Compatible Storage
PuppyGraph supports S3 Compatible Storage (e.g. MinIO) for Iceberg.
Configuration | Explanation |
---|---|
Endpoint | The S3 compatible storage endpoint. |
Access key | The access key of an IAM user for accessing the S3 compatible storage. |
Secret key | The secret key of an IAM user for accessing the S3 compatible storage. |
Enable SSL | Whether to enable SSL connection for accessing the S3 compatible storage. |
Enable path style access | Whether to use path-style access method when accessing the S3 compatible storage. |
Tutorials
- Querying Iceberg Data as a Graph
- Querying Snowflake Open Catalog Data as a Graph
- Querying Polaris Data as a Graph
- Run Graph Queries on Apache Iceberg Tables with Dremio & Puppygraph
Data Type Mapping
Iceberg Type | PuppyGraph Type | Description |
---|---|---|
boolean |
Boolean |
True or false |
int |
Int |
32-bit signed integers |
long |
Long |
64-bit signed integers |
float |
Float |
32-bit IEEE 754 floating point |
double |
Double |
64-bit IEEE 754 floating point |
decimal(P, S) |
Decimal(P, S) |
Fixed-point decimal; precision P, scale S |
string |
String |
Arbitrary-length character sequences |
date |
Date |
Calendar date without timezone or time |
time |
Time |
Time of day, microsecond precision, without date, timezone |
timestamp |
DateTime |
Timestamp, microsecond precision, without timezone |
list |
Array<E> |
A collection of values with some element type |
struct |
Struct<field1 E1[,field2 E2...]> |
A tuple of typed values |
Example Configurations
Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.
Catalog Type | Storage Type | Example Configuration |
---|---|---|
REST Catalog | Amazon S3 | #rest-catalog-amazon-s3 |
REST Catalog | MinIO | #rest-catalog-minio |
AWS Glue | Amazon S3 | #aws-glue-amazon-s3 |
Hive Metastore | HDFS | #hive-metastore-hdfs |
Hive Metastore | Amazon S3 | #hive-metastore-amazon-s3 |
Hive Metastore | MinIO | #hive-metastore-minio |
Hive Metastore | Google GCS | #hive-metastore-google-gcs |
Hive Metastore | Azure Blob | #hive-metastore-azure-blob |
Hive Metastore | Azure Data Lake Gen2 | #hive-metastore-azure-data-lake-gen2 |
REST Catalog + Amazon S3
"catalogs": [
{
"name": "iceberg_rest_s3",
"type": "iceberg",
"metastore": {
"type": "rest",
"uri": "http://127.0.0.1:8181"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]
REST Catalog + MinIO
"catalogs": [
{
"name": "iceberg_rest_minio",
"type": "iceberg",
"metastore": {
"type": "rest",
"uri": "http://127.0.0.1:8181"
},
"storage": {
"useInstanceProfile": "false",
"accessKey": "admin",
"secretKey": "password",
"enableSsl": "false",
"endpoint": "http://127.0.0.1:9000",
"enablePathStyleAccess": "true"
}
}
]
AWS Glue + Amazon S3
"catalogs": [
{
"name": "iceberg_glue_s3",
"type": "iceberg",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]
Hive Metastore + HDFS
"catalogs": [
{
"name": "iceberg_hms_hdfs",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
}
}
]
Hive Metastore + MinIO
"catalogs": [
{
"name": "iceberg_hms_minio",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"useInstanceProfile": "false",
"accessKey": "admin",
"secretKey": "password",
"enableSsl": "false",
"endpoint": "http://127.0.0.1:9000",
"enablePathStyleAccess": "true"
}
}
]
Hive Metastore + Amazon S3
"catalogs": [
{
"name": "iceberg_hms_hdfs",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]
Hive Metastore + Google GCS
"catalogs": [
{
"name": "iceberg_hms_gcs",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"type": "GCS",
"serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
"serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
"serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}
}
]
Hive Metastore + Azure Blob
"catalogs": [
{
"name": "iceberg_hms_azblob",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"type": "AzureBlob",
"storageAccount": "account_name",
"storageContainer": "container_name",
"sasToken": "sp=rl&st=2020-12-15T03:19:48Z&se=2024-12-12T11:19:48Z&sv=2022-11-02&sr=c&sig=1"
}
}
]
Hive Metastore + Azure Data Lake Gen2
"catalogs": [
{
"name": "iceberg_hms_azgen2",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"type": "AzureDLS2",
"clientId": "000000-avaf-aaaa-bbbb-aba988azfa",
"clientSecret": "EXAMPLEvonefPJabcde",
"clientEndpoint": "https://login.microsoftonline.com/000000-avaf-aaaa-bbbb-aba988azfa/oauth2/token"
}
}
]