Querying Google BigLake Iceberg Tables Data as a Graph

Summary

In this tutorial, you will:

Create Google Cloud BigLake Iceberg tables in a GCS bucket and load them with example data.
Start a PuppyGraph Docker container and connect it to the BigLake tables.
Query the BigLake data as a graph.

Prerequisites

Please ensure that docker is available. The installation can be verified by running:

docker --version

See https://docs.docker.com for more details on Docker.

You also need:

A Google Cloud project with BigLake API enabled.
A GCS bucket and a BigLake catalog with the end-user credential mode. See Create a BigLake Iceberg catalog for instructions. Note that you just need to create the catalog without creating any tables, as the tutorial will create sample tables for you.
A service account with the following roles, and its key file downloaded locally (e.g., /path/to/key.json):
- roles/biglake.editor (BigLake Editor) on the project
- roles/storage.objectUser (Storage Object User) on the Cloud Storage bucket
- roles/serviceusage.serviceUsageConsumer (Service Usage Consumer) on the project
Python 3.9+ installed locally, used for setting up the data preparation environment.

Note

For production (read-only) use, roles/biglake.viewer (BigLake Viewer) and roles/storage.objectViewer (Storage Object Viewer) are sufficient for PuppyGraph to read the data. The BigLake Editor role is for writing sample data in the tutorial.

Data Preparation

Create and activate a Python virtual environment, then install PySpark:

python -m venv venv
source venv/bin/activate
pip install pyspark==4.0.1

Create BigLake Iceberg Tables

Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account key file so that the script can authenticate with Google Cloud:

export GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_KEY_JSON>

Create a file named data.py with the following content, then run it to create tables and insert sample data via the BigLake Iceberg REST catalog:

python data.py

data.py

from pyspark.sql import SparkSession

catalog_name = "sample"
namespace = "modern"
spark = SparkSession.builder.appName("sample") \
  .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.1,org.apache.iceberg:iceberg-gcp-bundle:1.10.1') \
  .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
  .config(f'spark.sql.catalog.{catalog_name}.type', 'rest') \
  .config(f'spark.sql.catalog.{catalog_name}.uri', 'https://biglake.googleapis.com/iceberg/v1/restcatalog') \
  .config(f'spark.sql.catalog.{catalog_name}.warehouse', 'gs://<MY_BIGLAKE_BUCKET>') \
  .config(f'spark.sql.catalog.{catalog_name}.header.x-goog-user-project', '<MY_PROJECT_ID>') \
  .config(f'spark.sql.catalog.{catalog_name}.rest.auth.type', 'org.apache.iceberg.gcp.auth.GoogleAuthManager') \
  .config(f'spark.sql.catalog.{catalog_name}.io-impl', 'org.apache.iceberg.gcp.gcs.GCSFileIO') \
  .config(f'spark.sql.catalog.{catalog_name}.rest-metrics-reporting-enabled', 'false') \
  .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
  .config('spark.sql.defaultCatalog', catalog_name) \
  .getOrCreate()

spark.sql(f"CREATE NAMESPACE IF NOT EXISTS {namespace};")
spark.sql(f"USE {namespace};")

spark.sql("""
CREATE EXTERNAL TABLE person (
  id string,
  name string,
  age int
) USING iceberg;
""")

spark.sql("""
INSERT INTO person VALUES
  ('v1', 'marko', 29),
  ('v2', 'vadas', 27),
  ('v4', 'josh', 32),
  ('v6', 'peter', 35);
""")

spark.sql("""
CREATE EXTERNAL TABLE software (
  id string,
  name string,
  lang string
) USING iceberg;
""")

spark.sql("""
INSERT INTO software VALUES
  ('v3', 'lop', 'java'),
  ('v5', 'ripple', 'java');
""")

spark.sql("""
CREATE EXTERNAL TABLE created (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg;
""")

spark.sql("""
INSERT INTO created VALUES
  ('e9', 'v1', 'v3', 0.4),
  ('e10', 'v4', 'v5', 1.0),
  ('e11', 'v4', 'v3', 0.4),
  ('e12', 'v6', 'v3', 0.2);
""")

spark.sql("""
CREATE EXTERNAL TABLE knows (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg;
""")

spark.sql("""
INSERT INTO knows VALUES
  ('e7', 'v1', 'v2', 0.5),
  ('e8', 'v1', 'v4', 1.0);
""")

spark.stop()

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

The above SQL creates the following tables:

personsoftwareknowscreated

id	name	age
v1	marko	29
v2	vadas	27
v4	josh	32
v6	peter	35

id	name	lang
v3	lop	java
v5	ripple	java

id	from_id	to_id	weight
e7	v1	v2	0.5
e8	v1	v4	1.0

id	from_id	to_id	weight
e9	v1	v3	0.4
e10	v4	v5	1.0
e11	v4	v3	0.4
e12	v6	v3	0.2

Deployment

Run PuppyGraph with Docker:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -v <PATH_TO_KEY_JSON>:/home/ubuntu/key.json \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -e GOOGLE_APPLICATION_CREDENTIALS=/home/ubuntu/key.json \
  -d --name puppy --rm --pull=always puppygraph/puppygraph:1.0-preview

Warning

Replace <PATH_TO_KEY_JSON> with the absolute path to your service account key file (e.g., /path/to/key.json).

Modeling the Graph

We then define a graph on top of the data tables we just created. Actually, this is the "Modern" graph defined by Apache TinkerPop.

We can model the graph schema in PuppyGraph Web UI in two ways:

Build the graph manually by adding catalogs, nodes and edges step by step.
Upload a schema file

First, log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials if necessary:

Field	Value
Username	`puppygraph`
Password	`puppygraph123`

Build the graph manuallyUpload a schema file

Connecting to BigLake

Click Catalog in the sidebar, then click Add Catalog > Data Lakes > Apache Iceberg.

In the Catalog section, fill in the fields as follows:

Parameter	Value
Catalog type	`Apache Iceberg`
Catalog name	`biglake_test`
Metastore type	`Google BigLake REST`
Endpoint URL	`https://biglake.googleapis.com/iceberg/v1/restcatalog`
GCS Warehouse	`gs://<MY_BIGLAKE_BUCKET>`
GCP User Project	`<MY_PROJECT_ID>`
Storage type	`Get from metastore`

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

Click Create Catalog.

BigLake Catalog — Google BigLake Iceberg REST Catalog

Creating the Graph Schema

Click Graph in the sidebar. Click Add Nodes from Catalog > From Catalog > biglake_test > modern > person, then click Next.

Now, you are in the Add Node panel. Drag the id bar in ID Columns. Click Next.

Click Enable Local Replication for better performance. Then click Add Node to finish adding the person vertex.

Click Add Edge in the toolbar. Select knows table in the same way. Set both From Node and To Node to person. Select from_id as the From Node ID and to_id as the To Node ID. Drag the id bar in ID Columns. Click Next.

Click Enable Local Replication, then click Add Edge to finish adding the knows edge.

Add the software node in the same way as the person node.

Add the created edge in the same way as the knows edge. Set person as the From Node and software as the To Node. Select from_id for the From node and to_id for the To node.

Click Local in the toolbar. If local tables are not loaded or loading, click Load Data to load the data to local tables.

Create a schema file schema.json:

schema.json

href="#__codelineno-6-1">{ "catalog": [ { "name": "biglake_test", "type": "iceberg", "metastore": { "type": "rest", "uri": "https://biglake.googleapis.com/iceberg/v1/restcatalog", "warehouse": "gs://<MY_BIGLAKE_BUCKET>", "security": "google", "enableIcebergMetaCache": "true", "icebergMetaCacheTTL": "180", "headers": { "x-goog-user-project": "<MY_PROJECT_ID>" } } } ], "node": [ { "dataSourceGroup": { "externalDataSource": { "catalog": "biglake_test", "schema": "modern", "table": "person", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "age", "targetFieldName": "age" } ] }, "localDataSource": { "enabled": true, "localTableName": "local_person", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "age", "targetFieldName": "age" } ] } }, "label": "person", "id": [ { "name": "id", "type": "STRING" } ], "attribute": [ { "name": "name", "type": "STRING" }, { "name": "age", "type": "INT" } ] }, { "dataSourceGroup": { "externalDataSource": { "catalog": "biglake_test", "schema": "modern", "table": "software", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "lang", "targetFieldName": "lang" } ] }, "localDataSource": { "enabled": true, "localTableName": "local_software", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "lang", "targetFieldName": "lang" } ] } }, "label": "software", "id": [ { "name": "id", "type": "STRING" } ], "attribute": [ { "name": "name", "type": "STRING" }, { "name": "lang", "type": "STRING" } ] } ], "edge": [ { "dataSourceGroup": { "externalDataSource": { "catalog": "biglake_test", "schema": "modern", "table": "knows", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] }, "localDataSource": { "enabled": true, "localTableName": "local_knows", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] } }, "label": "knows", "fromNodeLabel": "person", "toNodeLabel": "person", "id": [ { "name": "id", "type": "STRING" } ], "fromKey": [ { "name": "from_id", "type": "STRING" } ], "toKey": [ { "name": "to_id", "type": "STRING" } ], "attribute": [ { "name": "from_id", "type": "STRING" }, { "name": "to_id", "type": "STRING" }, { "name": "weight", "type": "DOUBLE" } ] }, { "dataSourceGroup": { "externalDataSource": { "catalog": "biglake_test", "schema": "modern", "table": "created", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] }, "localDataSource": { "enabled": true, "localTableName": "local_created", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] } }, "label": "created", "fromNodeLabel": "person", "toNodeLabel": "software", "id": [ { "name": "id", "type": "STRING" } ], "fromKey": [ { "name": "from_id", "type": "STRING" } ], "toKey": [ { "name": "to_id", "type": "STRING" } ], "attribute": [ { "name": "from_id", "type": "STRING" }, { "name": "to_id", "type": "STRING" }, { "name": "weight", "type": "DOUBLE" } ] } ], "localTable": [ { "dataSourceGroup": { "externalDataSource": { "enabled": true, "catalog": "biglake_test", "schema": "modern", "table": "person", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "age", "targetFieldName": "age" } ] } }, "name": "local_person", "column": [ { "name": "id", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "INT" } ], "indexColumn": [ "id" ], "distributeBy": [ { "bucket": { "num": 12, "column": [ "id" ] } } ], "orderBy": [ { "column": "id" } ], "colocateKey": "co_whi0f3_person", "replicationNum": 1 }, { "dataSourceGroup": { "externalDataSource": { "enabled": true, "catalog": "biglake_test", "schema": "modern", "table": "software", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "name", "targetFieldName": "name" }, { "sourceFieldName": "lang", "targetFieldName": "lang" } ] } }, "name": "local_software", "column": [ { "name": "id", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "lang", "type": "STRING" } ], "indexColumn": [ "id" ], "distributeBy": [ { "bucket": { "num": 12, "column": [ "id" ] } } ], "orderBy": [ { "column": "id" } ], "colocateKey": "co_339xsg_software", "replicationNum": 1 }, { "dataSourceGroup": { "externalDataSource": { "enabled": true, "catalog": "biglake_test", "schema": "modern", "table": "knows", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] } }, "name": "local_knows", "column": [ { "name": "id", "type": "STRING" }, { "name": "from_id", "type": "STRING" }, { "name": "to_id", "type": "STRING" }, { "name": "weight", "type": "DOUBLE" } ], "indexColumn": [ "id" ], "distributeBy": [ { "bucket": { "num": 12, "column": [ "id" ] } } ], "orderBy": [ { "column": "id" } ], "colocateKey": "co_m3520t_knows", "replicationNum": 1 }, { "dataSourceGroup": { "externalDataSource": { "enabled": true, "catalog": "biglake_test", "schema": "modern", "table": "created", "mappedField": [ { "sourceFieldName": "id", "targetFieldName": "id" }, { "sourceFieldName": "from_id", "targetFieldName": "from_id" }, { "sourceFieldName": "to_id", "targetFieldName": "to_id" }, { "sourceFieldName": "weight", "targetFieldName": "weight" } ] } }, "name": "local_created", "column": [ { "name": "id", "type": "STRING" }, { "name": "from_id", "type": "STRING" }, { "name": "to_id", "type": "STRING" }, { "name": "weight", "type": "DOUBLE" } ], "indexColumn": [ "id" ], "distributeBy": [ { "bucket": { "num": 12, "column": [ "id" ] } } ], "orderBy": [ { "column": "id" } ], "colocateKey": "co_52ejgb_created", "replicationNum": 1 } ] }

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

In the Web UI, click Graph in the sidebar, then click Upload Schema and select the schema file.

In the Upload Schema panel, you can see the basic information of the graph schema parsed from the file. Choose Cache data only in After Upload option, then click Upload.

Click Local in the toolbar. If local tables are not loaded, click Load Data to load the data to local tables.

Querying the Graph

In PuppyGraph Web UI, click Query in the sidebar. You can run graph queries in openCypher or Gremlin language.

For example, the following query answers the question: "What software was created by people that marko knows?"

openCypher

MATCH path = (p:person)-[:knows]->()-[:created]->()
WHERE p.name = 'marko'
RETURN path;

Gremlin

g.V().hasLabel('person').has("name", "marko")
  .out("knows").out("created").path()

Cleanup

To stop and remove the PuppyGraph Docker container, run:

docker stop puppy

To remove the python virtual environment, simply delete the venv directory:

rm -rf venv