PuppyGraph
Search
K

Connecting to Apache Iceberg

PuppyGraph enables querying the data in Apache Iceberg as graphs.

Prerequisites

In this guide, PuppyGraph connects to an existing Iceberg installation. If you are looking for a quick start without any prior Iceberg installation, please refer to Iceberg.
A Spark installation is required if you plan to follow Prepare Data (Optional) to create tables and insert sample data to the existing Iceberg installation.
The guide also assumes that PuppyGraph has been deployed at localhost according to one of guides in AMI Version or Docker Deployment.
In this demo, we use the username puppygraph and password 888888.

Prepare Data (Optional)

The guide will create the following two tables as shown below in Iceberg. Feel free to skip this step if you would like to query some existing tables.
Person Table
Referral Table
ID
Age
Name
v1
29
marko
v2
27
vadas
RefID
Source
Referred
Weight
e1
v1
v2
0.5
Here is the shell command to start a Spark SQL shell for data preparation. The spark-sql executable is in the bin folder of the Spark directory.
The shell command assumes that the Iceberg data are stored on HDFS at 172.31.19.123:9000 and the Hive Metastore is at 172.31.31.125:9083.
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0 --conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 --conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.puppy_iceberg=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.puppy_iceberg.type=hive --conf spark.sql.catalog.puppy_iceberg.uri=thrift://172.31.31.125:9083
Now we can use the following Spark-SQL statements to create tables person and referral in the database onhdfs and insert data. The catalog name is puppy_iceberg as specified in the shell command above.
CREATE DATABASE puppy_iceberg.onhdfs;
USE puppy_iceberg.onhdfs;
CREATE EXTERNAL TABLE person (id string, age int, name string) using iceberg;
INSERT INTO person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
CREATE EXTERNAL TABLE referral (refId string, source string, referred string, weight double) using iceberg;
INSERT INTO referral VALUES ('e1', 'v1', 'v2', 0.5);

Define the Graph

We need to define the graph before querying it. Create a PuppyGraph schema file iceberg.json:
iceberg.json
{
"catalogs": [
{
"name": "catalog_test",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://172.31.31.125:9083"
}
}
],
"vertices": [
{
"label": "person",
"mappedTableSource": {
"catalog": "catalog_test",
"schema": "onhdfs",
"table": "person",
"metaFields": {
"id": "id"
}
},
"attributes": [
{
"type": "Int",
"name": "age"
},
{
"type": "String",
"name": "name"
}
]
}
],
"edges": [
{
"label": "knows",
"mappedTableSource": {
"catalog": "catalog_test",
"schema": "onhdfs",
"table": "referral",
"metaFields": {
"id": "refId",
"from": "source",
"to": "referred"
}
},
"from": "person",
"to": "person",
"attributes": [
{
"type": "Double",
"name": "weight"
}
]
}
]
}
Here are some notes on this schema:
  • A catalog object named catalog_test defines an Iceberg data source. The hiveMetastoreUrl field matches the Hive Metastore URL in Prepare Data (Optional).
  • The label of the vertices and edges do not have to be the same as the names of tables in Iceberg. There is a mappedTableSource object in each of the vertex and edge types specifying the actual schema (database) name (onhdfs) and table name (referral).
  • Tthe mappedTableSource also refers to necessary columns in the tables.
    • The id field refers to the column storing the ID for vertices (ID) and edges (refId).
    • The fieldsfrom and to refer to the columns in the table as the ends of edges. The values of the two columns need to match the id of the vertices. In this example, each row in referral table models an edge in the graph from source to referred.
Once the schema file iceberg.json is created, upload it to PuppyGraph with the following shell command:
curl -XPOST -H "content-type: application/json" --data-binary @./iceberg.json --user "puppygraph:888888" localhost:8081/schema

Query the Graph

Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:
[PuppyGraph]> console
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Now we have connected to the Gremlin Console. We can query the graph. For example, this query finds out the names of people known by someone:
g.V().hasLabel("person").out("knows").values("name")
gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

Supported Iceberg Implementations

PuppyGraph supports the following implementations of Iceberg.
Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.
Catalog Type
Storage Type
Config Example
REST Catalog
Amazon S3
REST Catalog
MinIO
AWS Glue
Amazon S3
Hive Metastore
HDFS
Hive Metastore
Amazon S3
Hive Metastore
MinIO

REST Catalog + Amazon S3

"catalogs": [
{
"name": "iceberg_rest_s3",
"type": "iceberg",
"metastore": {
"type": "rest",
"uri": "http://127.0.0.1:8181"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]

REST Catalog + MinIO

"catalogs": [
{
"name": "iceberg_rest_minio",
"type": "iceberg",
"metastore": {
"type": "rest",
"uri": "http://127.0.0.1:8181"
},
"storage": {
"useInstanceProfile": "false",
"accessKey": "admin",
"secretKey": "password",
"enableSsl": "false",
"endpoint": "http://127.0.0.1:9000",
"enablePathStyleAccess": "true"
}
}
]

AWS Glue + Amazon S3

"catalogs": [
{
"name": "iceberg_glue_s3",
"type": "iceberg",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]

Hive Metastore + HDFS

"catalogs": [
{
"name": "iceberg_hms_hdfs",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
}
}
]

Hive Metastore + MinIO

"catalogs": [
{
"name": "iceberg_hms_minio",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"useInstanceProfile": "false",
"accessKey": "admin",
"secretKey": "password",
"enableSsl": "false",
"endpoint": "http://127.0.0.1:9000",
"enablePathStyleAccess": "true"
}
}
]

Hive Metastore + Amazon S3

"catalogs": [
{
"name": "iceberg_hms_hdfs",
"type": "iceberg",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://127.0.0.1:9083"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-west-2",
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"enableSsl": "false"
}
}
]