PuppyGraph enables querying the data in Apache Iceberg as graphs.
Prerequisites
In this guide, PuppyGraph connects to an existing Iceberg installation. If you are looking for a quick start without any prior Iceberg installation, please refer to Querying Iceberg Data as a Graph .
A Spark installation is required if you plan to follow Prepare Data (Optional) to create tables and insert sample data to the existing Iceberg installation.
The guide also assumes that PuppyGraph has been deployed at localhost
according to one of guides in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker .
In this demo, we use the username puppygraph
and password puppygraph123
.
Prepare Data (Optional)
The guide will create the following two tables as shown below in Iceberg. Feel free to skip this step if you would like to query some existing tables.
Person Table Referral Table
RefID Source Referred Weight
Here is the shell command to start a Spark SQL shell for data preparation. The spark-sql
executable is in the bin
folder of the Spark directory.
The shell command assumes that the Iceberg data are stored on HDFS at 172.31.19.123:9000
and the Hive Metastore is at 172.31.31.125:9083
.
Copy spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0 --conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 --conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.puppy_iceberg=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.puppy_iceberg.type=hive --conf spark.sql.catalog.puppy_iceberg.uri=thrift://172.31.31.125:9083
Now we can use the following Spark-SQL statements to create tables person
and referral
in the database onhdfs
and insert data. The catalog name is puppy_iceberg
as specified in the shell command above.
Copy CREATE DATABASE puppy_iceberg .onhdfs;
USE puppy_iceberg.onhdfs;
CREATE EXTERNAL TABLE person (id string, age int , name string) using iceberg;
INSERT INTO person VALUES ( 'v1' , 29 , 'marko' ), ( 'v2' , 27 , 'vadas' );
CREATE EXTERNAL TABLE referral (refId string, source string, referred string, weight double) using iceberg;
INSERT INTO referral VALUES ( 'e1' , 'v1' , 'v2' , 0 . 5 );
Define the Graph
We need to define the graph before querying it. Create a PuppyGraph schema file iceberg.json
:
iceberg.json
Copy {
"catalogs" : [
{
"name" : "catalog_test" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://172.31.31.125:9083"
}
}
] ,
"vertices" : [
{
"label" : "person" ,
"mappedTableSource" : {
"catalog" : "catalog_test" ,
"schema" : "onhdfs" ,
"table" : "person" ,
"metaFields" : {
"id" : "id"
}
} ,
"attributes" : [
{
"type" : "Int" ,
"name" : "age"
} ,
{
"type" : "String" ,
"name" : "name"
}
]
}
] ,
"edges" : [
{
"label" : "knows" ,
"mappedTableSource" : {
"catalog" : "catalog_test" ,
"schema" : "onhdfs" ,
"table" : "referral" ,
"metaFields" : {
"id" : "refId" ,
"from" : "source" ,
"to" : "referred"
}
} ,
"from" : "person" ,
"to" : "person" ,
"attributes" : [
{
"type" : "Double" ,
"name" : "weight"
}
]
}
]
}
Here are some notes on this schema:
A catalog object named catalog_test
defines an Iceberg data source. The hiveMetastoreUrl
field matches the Hive Metastore URL in Prepare Data (Optional) .
The label of the vertices and edges do not have to be the same as the names of tables in Iceberg. There is a mappedTableSource
object in each of the vertex and edge types specifying the actual schema (database) name (onhdfs
) and table name (referral
).
Tthe mappedTableSource
also refers to necessary columns in the tables.
The id
field refers to the column storing the ID for vertices (ID
) and edges (refId
).
The fieldsfrom
and to
refer to the columns in the table as the ends of edges. The values of the two columns need to match the id
of the vertices. In this example, each row in referral
table models an edge in the graph from source
to referred
.
Once the schema file iceberg.json
is created, upload it to PuppyGraph with the following shell command:
Copy curl -XPOST -H "content-type: application/json" --data-binary @./iceberg.json --user "puppygraph:puppygraph123" localhost:8081/schema
Query the Graph
Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:
Copy [PuppyGraph]> console
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
Now we have connected to the Gremlin Console. We can query the graph. For example, this query finds out the names of people known by someone:
Copy g.V().hasLabel("person").out("knows").values("name")
Copy gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas
Supported Iceberg Implementations
PuppyGraph supports the following implementations of Iceberg.
Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.
REST Catalog + Amazon S3
Copy "catalogs" : [
{
"name" : "iceberg_rest_s3" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "rest" ,
"uri" : "http://127.0.0.1:8181"
} ,
"storage" : {
"useInstanceProfile" : "false" ,
"region" : "us-west-2" ,
"accessKey" : "AKIAIOSFODNN7EXAMPLE" ,
"secretKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" ,
"enableSsl" : "false"
}
}
]
REST Catalog + MinIO
Copy "catalogs" : [
{
"name" : "iceberg_rest_minio" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "rest" ,
"uri" : "http://127.0.0.1:8181"
} ,
"storage" : {
"useInstanceProfile" : "false" ,
"accessKey" : "admin" ,
"secretKey" : "password" ,
"enableSsl" : "false" ,
"endpoint" : "http://127.0.0.1:9000" ,
"enablePathStyleAccess" : "true"
}
}
]
AWS Glue + Amazon S3
Copy "catalogs" : [
{
"name" : "iceberg_glue_s3" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "glue" ,
"useInstanceProfile" : "false" ,
"region" : "us-west-2" ,
"accessKey" : "AKIAIOSFODNN7EXAMPLE" ,
"secretKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
} ,
"storage" : {
"useInstanceProfile" : "false" ,
"region" : "us-west-2" ,
"accessKey" : "AKIAIOSFODNN7EXAMPLE" ,
"secretKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" ,
"enableSsl" : "false"
}
}
]
Hive Metastore + HDFS
Copy "catalogs" : [
{
"name" : "iceberg_hms_hdfs" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
}
}
]
Hive Metastore + MinIO
Copy "catalogs" : [
{
"name" : "iceberg_hms_minio" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
} ,
"storage" : {
"useInstanceProfile" : "false" ,
"accessKey" : "admin" ,
"secretKey" : "password" ,
"enableSsl" : "false" ,
"endpoint" : "http://127.0.0.1:9000" ,
"enablePathStyleAccess" : "true"
}
}
]
Hive Metastore + Amazon S3
Copy "catalogs" : [
{
"name" : "iceberg_hms_hdfs" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
} ,
"storage" : {
"useInstanceProfile" : "false" ,
"region" : "us-west-2" ,
"accessKey" : "AKIAIOSFODNN7EXAMPLE" ,
"secretKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" ,
"enableSsl" : "false"
}
}
]
Hive Metastore + Google GCS
Copy "catalogs" : [
{
"name" : "iceberg_hms_gcs" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
} ,
"storage" : {
"type" : "GCS" ,
"serviceAccountEmail" : "acc_name@project.iam.gserviceaccount.com" ,
"serviceAccountPrivateKeyId" : "AKIAIOSFODNN7EXAMPLE" ,
"serviceAccountPrivateKey" : "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
}
}
]
Hive Metastore + Azure Blob
Copy "catalogs" : [
{
"name" : "iceberg_hms_azblob" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
} ,
"storage" : {
"type" : "AzureBlob" ,
"storageAccount" : "account_name" ,
"storageContainer" : "container_name" ,
"sasToken" : "sp=rl&st=2020-12-15T03:19:48Z&se=2024-12-12T11:19:48Z&sv=2022-11-02&sr=c&sig=1"
}
}
]
Hive Metastore + Azure Data Lake Gen2
Copy "catalogs" : [
{
"name" : "iceberg_hms_azgen2" ,
"type" : "iceberg" ,
"metastore" : {
"type" : "HMS" ,
"hiveMetastoreUrl" : "thrift://127.0.0.1:9083"
} ,
"storage" : {
"type" : "AzureDLS2" ,
"clientId" : "000000-avaf-aaaa-bbbb-aba988azfa" ,
"clientSecret" : "EXAMPLEvonefPJabcde" ,
"clientEndpoint" : "https://login.microsoftonline.com/000000-avaf-aaaa-bbbb-aba988azfa/oauth2/token"
}
}
]