Connecting to Apache Iceberg

PuppyGraph enables querying the data in Apache Iceberg as graphs.

Prerequisites

In this guide, PuppyGraph connects to an existing Iceberg installation. If you are looking for a quick start without any prior Iceberg installation, please refer to Querying Iceberg Data as a Graph.

A Spark installation is required if you plan to follow Prepare Data (Optional) to create tables and insert sample data to the existing Iceberg installation.

The guide also assumes that PuppyGraph has been deployed at localhost according to one of guides in Launching PuppyGraph from AWS Marketplace or Launching PuppyGraph in Docker.

In this demo, we use the username puppygraph and password puppygraph123.

Prepare Data (Optional)

The guide will create the following two tables as shown below in Iceberg. Feel free to skip this step if you would like to query some existing tables.

IDAgeName

v1

29

marko

v2

27

vadas

Here is the shell command to start a Spark SQL shell for data preparation. The spark-sql executable is in the bin folder of the Spark directory.

The shell command assumes that the Iceberg data are stored on HDFS at 172.31.19.123:9000 and the Hive Metastore is at 172.31.31.125:9083.

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0 --conf spark.hadoop.fs.defaultFS=hdfs://172.31.19.123:9000 --conf spark.sql.warehouse.dir=hdfs://172.31.19.123:9000/spark-warehouse --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.puppy_iceberg=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.puppy_iceberg.type=hive --conf spark.sql.catalog.puppy_iceberg.uri=thrift://172.31.31.125:9083

Now we can use the following Spark-SQL statements to create tables person and referral in the database onhdfs and insert data. The catalog name is puppy_iceberg as specified in the shell command above.

CREATE DATABASE puppy_iceberg.onhdfs;
USE puppy_iceberg.onhdfs;
CREATE EXTERNAL TABLE person (id string, age int, name string) using iceberg;
INSERT INTO person VALUES ('v1', 29, 'marko'), ('v2', 27, 'vadas');
CREATE EXTERNAL TABLE referral (refId string, source string, referred string, weight double) using iceberg;
INSERT INTO referral VALUES ('e1', 'v1', 'v2', 0.5);

Define the Graph

We need to define the graph before querying it. Create a PuppyGraph schema file iceberg.json:

iceberg.json
{
  "catalogs": [
    {
      "name": "catalog_test",
      "type": "iceberg",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://172.31.31.125:9083"
      }
    }
  ],
  "vertices": [
    {
      "label": "person",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "onhdfs",
        "table": "person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "type": "Int",
          "name": "age"
        },
        {
          "type": "String",
          "name": "name"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "mappedTableSource": {
        "catalog": "catalog_test",
        "schema": "onhdfs",
        "table": "referral",
        "metaFields": {
          "id": "refId",
          "from": "source",
          "to": "referred"
        }
      },
      "from": "person",
      "to": "person",
      "attributes": [
        {
          "type": "Double",
          "name": "weight"
        }
      ]
    }
  ]
}

Here are some notes on this schema:

  • A catalog object named catalog_test defines an Iceberg data source. The hiveMetastoreUrl field matches the Hive Metastore URL in Prepare Data (Optional).

  • The label of the vertices and edges do not have to be the same as the names of tables in Iceberg. There is a mappedTableSource object in each of the vertex and edge types specifying the actual schema (database) name (onhdfs) and table name (referral).

  • Tthe mappedTableSource also refers to necessary columns in the tables.

    • The id field refers to the column storing the ID for vertices (ID) and edges (refId).

    • The fieldsfrom and to refer to the columns in the table as the ends of edges. The values of the two columns need to match the id of the vertices. In this example, each row in referral table models an edge in the graph from source to referred.

Once the schema file iceberg.json is created, upload it to PuppyGraph with the following shell command:

curl -XPOST -H "content-type: application/json" --data-binary @./iceberg.json --user "puppygraph:puppygraph123" localhost:8081/schema

Query the Graph

Connecting to PuppyGraph at http://localhost:8081 and start gremlin console from the "Query" section:

[PuppyGraph]> console
         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph

Now we have connected to the Gremlin Console. We can query the graph. For example, this query finds out the names of people known by someone:

g.V().hasLabel("person").out("knows").values("name")
gremlin> g.V().hasLabel("person").out("knows").values("name")
==>vadas

Supported Iceberg Implementations

PuppyGraph supports the following implementations of Iceberg.

Please refer to Data Lake Catalog for detailed parameters for each type of catalog and storage.

Catalog TypeStorage TypeConfig Example

REST Catalog

Amazon S3

REST Catalog

MinIO

AWS Glue

Amazon S3

Hive Metastore

HDFS

Hive Metastore

Amazon S3

Hive Metastore

MinIO

Hive Metastore

Google GCS

Hive Metastore

Azure Blob

Hive Metastore

Azure Data Lake Gen2

REST Catalog + Amazon S3

"catalogs": [
  {
    "name": "iceberg_rest_s3",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

REST Catalog + MinIO

"catalogs": [
  {
    "name": "iceberg_rest_minio",
    "type": "iceberg",
    "metastore": {
      "type": "rest",
      "uri": "http://127.0.0.1:8181"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

AWS Glue + Amazon S3

"catalogs": [
  {
    "name": "iceberg_glue_s3",
    "type": "iceberg",
    "metastore": {
      "type": "glue",
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + HDFS

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    }
  }
]

Hive Metastore + MinIO

"catalogs": [
  {
    "name": "iceberg_hms_minio",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "accessKey": "admin",
      "secretKey": "password",
      "enableSsl": "false",
      "endpoint": "http://127.0.0.1:9000",
      "enablePathStyleAccess": "true"
    }
  }
]

Hive Metastore + Amazon S3

"catalogs": [
  {
    "name": "iceberg_hms_hdfs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "useInstanceProfile": "false",
      "region": "us-west-2",
      "accessKey": "AKIAIOSFODNN7EXAMPLE",
      "secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "enableSsl": "false"
    }
  }
]

Hive Metastore + Google GCS

"catalogs": [
  {
    "name": "iceberg_hms_gcs",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "GCS",
      "serviceAccountEmail": "acc_name@project.iam.gserviceaccount.com",
      "serviceAccountPrivateKeyId": "AKIAIOSFODNN7EXAMPLE",
      "serviceAccountPrivateKey": "-----BEGIN PRIVATE KEY-----\nabcded\n-----END PRIVATE KEY-----\n"
    }
  }
]

Hive Metastore + Azure Blob

"catalogs": [
  {
    "name": "iceberg_hms_azblob",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureBlob",
      "storageAccount": "account_name",
      "storageContainer": "container_name",
      "sasToken": "sp=rl&st=2020-12-15T03:19:48Z&se=2024-12-12T11:19:48Z&sv=2022-11-02&sr=c&sig=1"
    }
  }
]

Hive Metastore + Azure Data Lake Gen2

"catalogs": [
  {
    "name": "iceberg_hms_azgen2",
    "type": "iceberg",
    "metastore": {
      "type": "HMS",
      "hiveMetastoreUrl": "thrift://127.0.0.1:9083"
    },
    "storage": {
      "type": "AzureDLS2",
      "clientId": "000000-avaf-aaaa-bbbb-aba988azfa",
      "clientSecret": "EXAMPLEvonefPJabcde",
      "clientEndpoint": "https://login.microsoftonline.com/000000-avaf-aaaa-bbbb-aba988azfa/oauth2/token"
    }
  }
]

Last updated