Skip to content

Querying Google BigLake Iceberg Tables Data as a Graph

Summary

In this tutorial, you will:

  • Create Google Cloud BigLake Iceberg tables in a GCS bucket and load them with example data.
  • Start a PuppyGraph Docker container and connect it to the BigLake tables.
  • Query the BigLake data as a graph.

Prerequisites

Please ensure that docker is available. The installation can be verified by running:

docker --version

See https://docs.docker.com for more details on Docker.

You also need:

  • A Google Cloud project with BigLake API enabled.
  • A GCS bucket and a BigLake catalog with the end-user credential mode. See Create a BigLake Iceberg catalog for instructions. Note that you just need to create the catalog without creating any tables, as the tutorial will create sample tables for you.
  • A service account with the following roles, and its key file downloaded locally (e.g., /path/to/key.json):
  • Python 3.9+ installed locally, used for setting up the data preparation environment.

Note

For production (read-only) use, roles/biglake.viewer (BigLake Viewer) and roles/storage.objectViewer (Storage Object Viewer) are sufficient for PuppyGraph to read the data. The BigLake Editor role is for writing sample data in the tutorial.

Data Preparation

▶ Create and activate a Python virtual environment, then install PySpark:

python -m venv venv
source venv/bin/activate
pip install pyspark==4.0.1

Create BigLake Iceberg Tables

▶ Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account key file so that the script can authenticate with Google Cloud:

export GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_KEY_JSON>

▶ Create a file named data.py with the following content, then run it to create tables and insert sample data via the BigLake Iceberg REST catalog:

python data.py
data.py
from pyspark.sql import SparkSession

catalog_name = "sample"
namespace = "modern"
spark = SparkSession.builder.appName("sample") \
  .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.1,org.apache.iceberg:iceberg-gcp-bundle:1.10.1') \
  .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
  .config(f'spark.sql.catalog.{catalog_name}.type', 'rest') \
  .config(f'spark.sql.catalog.{catalog_name}.uri', 'https://biglake.googleapis.com/iceberg/v1/restcatalog') \
  .config(f'spark.sql.catalog.{catalog_name}.warehouse', 'gs://<MY_BIGLAKE_BUCKET>') \
  .config(f'spark.sql.catalog.{catalog_name}.header.x-goog-user-project', '<MY_PROJECT_ID>') \
  .config(f'spark.sql.catalog.{catalog_name}.rest.auth.type', 'org.apache.iceberg.gcp.auth.GoogleAuthManager') \
  .config(f'spark.sql.catalog.{catalog_name}.io-impl', 'org.apache.iceberg.gcp.gcs.GCSFileIO') \
  .config(f'spark.sql.catalog.{catalog_name}.rest-metrics-reporting-enabled', 'false') \
  .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
  .config('spark.sql.defaultCatalog', catalog_name) \
  .getOrCreate()

spark.sql(f"CREATE NAMESPACE IF NOT EXISTS {namespace};")
spark.sql(f"USE {namespace};")

spark.sql("""
CREATE EXTERNAL TABLE person (
  id string,
  name string,
  age int
) USING iceberg;
""")

spark.sql("""
INSERT INTO person VALUES
  ('v1', 'marko', 29),
  ('v2', 'vadas', 27),
  ('v4', 'josh', 32),
  ('v6', 'peter', 35);
""")

spark.sql("""
CREATE EXTERNAL TABLE software (
  id string,
  name string,
  lang string
) USING iceberg;
""")

spark.sql("""
INSERT INTO software VALUES
  ('v3', 'lop', 'java'),
  ('v5', 'ripple', 'java');
""")

spark.sql("""
CREATE EXTERNAL TABLE created (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg;
""")

spark.sql("""
INSERT INTO created VALUES
  ('e9', 'v1', 'v3', 0.4),
  ('e10', 'v4', 'v5', 1.0),
  ('e11', 'v4', 'v3', 0.4),
  ('e12', 'v6', 'v3', 0.2);
""")

spark.sql("""
CREATE EXTERNAL TABLE knows (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg;
""")

spark.sql("""
INSERT INTO knows VALUES
  ('e7', 'v1', 'v2', 0.5),
  ('e8', 'v1', 'v4', 1.0);
""")

spark.stop()

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

The above SQL creates the following tables:

id name age
v1 marko 29
v2 vadas 27
v4 josh 32
v6 peter 35
id name lang
v3 lop java
v5 ripple java
id from_id to_id weight
e7 v1 v2 0.5
e8 v1 v4 1.0
id from_id to_id weight
e9 v1 v3 0.4
e10 v4 v5 1.0
e11 v4 v3 0.4
e12 v6 v3 0.2

Deployment

▶ Run PuppyGraph with Docker:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -v <PATH_TO_KEY_JSON>:/home/ubuntu/key.json \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -e GOOGLE_APPLICATION_CREDENTIALS=/home/ubuntu/key.json \
  -d --name puppy --rm --pull=always puppygraph/puppygraph:1.0-preview

Warning

Replace <PATH_TO_KEY_JSON> with the absolute path to your service account key file (e.g., /path/to/key.json).

Modeling the Graph

We then define a graph on top of the data tables we just created. Actually, this is the "Modern" graph defined by Apache TinkerPop.

Modern Graph

Modern Graph

We can model the graph schema in PuppyGraph Web UI in two ways:

  • Build the graph manually by adding catalogs, nodes and edges step by step.
  • Upload a schema file

▶ First, log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials if necessary:

Field Value
Username puppygraph
Password puppygraph123

Connecting to BigLake

▶ Click Catalog in the sidebar, then click Add Catalog > Data Lakes > Apache Iceberg.

▶ In the Catalog section, fill in the fields as follows:

Parameter Value
Catalog type Apache Iceberg
Catalog name biglake_test
Metastore type Google BigLake REST
Endpoint URL https://biglake.googleapis.com/iceberg/v1/restcatalog
GCS Warehouse gs://<MY_BIGLAKE_BUCKET>
GCP User Project <MY_PROJECT_ID>
Storage type Get from metastore

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

▶ Click Create Catalog.

BigLake Catalog
Google BigLake Iceberg REST Catalog

Creating the Graph Schema

▶ Click Graph in the sidebar. Click Add Nodes from Catalog > From Catalog > biglake_test > modern > person, then click Next.

▶ Now, you are in the Add Node panel. Drag the id bar in ID Columns. Click Next.

▶ Click Enable Local Replication for better performance. Then click Add Node to finish adding the person vertex.

▶ Click Add Edge in the toolbar. Select knows table in the same way. Set both From Node and To Node to person. Select from_id as the From Node ID and to_id as the To Node ID. Drag the id bar in ID Columns. Click Next.

▶ Click Enable Local Replication, then click Add Edge to finish adding the knows edge.

▶ Add the software node in the same way as the person node.

▶ Add the created edge in the same way as the knows edge. Set person as the From Node and software as the To Node. Select from_id for the From node and to_id for the To node.

Graph Schema
Graph Schema

▶ Click Local in the toolbar. If local tables are not loaded or loading, click Load Data to load the data to local tables.

▶ Create a schema file schema.json:

schema.json
{
  "catalog": [
    {
      "name": "biglake_test",
      "type": "iceberg",
      "metastore": {
        "type": "rest",
        "uri": "https://biglake.googleapis.com/iceberg/v1/restcatalog",
        "warehouse": "gs://<MY_BIGLAKE_BUCKET>",
        "security": "google",
        "enableIcebergMetaCache": "true",
        "icebergMetaCacheTTL": "180",
        "headers": {
          "x-goog-user-project": "<MY_PROJECT_ID>"
        }
      }
    }
  ],
  "node": [
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "person",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "age",
              "targetFieldName": "age"
            }
          ]
        },
        "localDataSource": {
          "enabled": true,
          "localTableName": "local_person",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "age",
              "targetFieldName": "age"
            }
          ]
        }
      },
      "label": "person",
      "id": [
        {
          "name": "id",
          "type": "STRING"
        }
      ],
      "attribute": [
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "INT"
        }
      ]
    },
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "software",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "lang",
              "targetFieldName": "lang"
            }
          ]
        },
        "localDataSource": {
          "enabled": true,
          "localTableName": "local_software",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "lang",
              "targetFieldName": "lang"
            }
          ]
        }
      },
      "label": "software",
      "id": [
        {
          "name": "id",
          "type": "STRING"
        }
      ],
      "attribute": [
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "lang",
          "type": "STRING"
        }
      ]
    }
  ],
  "edge": [
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "knows",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        },
        "localDataSource": {
          "enabled": true,
          "localTableName": "local_knows",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        }
      },
      "label": "knows",
      "fromNodeLabel": "person",
      "toNodeLabel": "person",
      "id": [
        {
          "name": "id",
          "type": "STRING"
        }
      ],
      "fromKey": [
        {
          "name": "from_id",
          "type": "STRING"
        }
      ],
      "toKey": [
        {
          "name": "to_id",
          "type": "STRING"
        }
      ],
      "attribute": [
        {
          "name": "from_id",
          "type": "STRING"
        },
        {
          "name": "to_id",
          "type": "STRING"
        },
        {
          "name": "weight",
          "type": "DOUBLE"
        }
      ]
    },
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "created",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        },
        "localDataSource": {
          "enabled": true,
          "localTableName": "local_created",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        }
      },
      "label": "created",
      "fromNodeLabel": "person",
      "toNodeLabel": "software",
      "id": [
        {
          "name": "id",
          "type": "STRING"
        }
      ],
      "fromKey": [
        {
          "name": "from_id",
          "type": "STRING"
        }
      ],
      "toKey": [
        {
          "name": "to_id",
          "type": "STRING"
        }
      ],
      "attribute": [
        {
          "name": "from_id",
          "type": "STRING"
        },
        {
          "name": "to_id",
          "type": "STRING"
        },
        {
          "name": "weight",
          "type": "DOUBLE"
        }
      ]
    }
  ],
  "localTable": [
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "enabled": true,
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "person",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "age",
              "targetFieldName": "age"
            }
          ]
        }
      },
      "name": "local_person",
      "column": [
        {
          "name": "id",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "INT"
        }
      ],
      "indexColumn": [
        "id"
      ],
      "distributeBy": [
        {
          "bucket": {
            "num": 12,
            "column": [
              "id"
            ]
          }
        }
      ],
      "orderBy": [
        {
          "column": "id"
        }
      ],
      "colocateKey": "co_whi0f3_person",
      "replicationNum": 1
    },
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "enabled": true,
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "software",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "name",
              "targetFieldName": "name"
            },
            {
              "sourceFieldName": "lang",
              "targetFieldName": "lang"
            }
          ]
        }
      },
      "name": "local_software",
      "column": [
        {
          "name": "id",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "lang",
          "type": "STRING"
        }
      ],
      "indexColumn": [
        "id"
      ],
      "distributeBy": [
        {
          "bucket": {
            "num": 12,
            "column": [
              "id"
            ]
          }
        }
      ],
      "orderBy": [
        {
          "column": "id"
        }
      ],
      "colocateKey": "co_339xsg_software",
      "replicationNum": 1
    },
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "enabled": true,
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "knows",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        }
      },
      "name": "local_knows",
      "column": [
        {
          "name": "id",
          "type": "STRING"
        },
        {
          "name": "from_id",
          "type": "STRING"
        },
        {
          "name": "to_id",
          "type": "STRING"
        },
        {
          "name": "weight",
          "type": "DOUBLE"
        }
      ],
      "indexColumn": [
        "id"
      ],
      "distributeBy": [
        {
          "bucket": {
            "num": 12,
            "column": [
              "id"
            ]
          }
        }
      ],
      "orderBy": [
        {
          "column": "id"
        }
      ],
      "colocateKey": "co_m3520t_knows",
      "replicationNum": 1
    },
    {
      "dataSourceGroup": {
        "externalDataSource": {
          "enabled": true,
          "catalog": "biglake_test",
          "schema": "modern",
          "table": "created",
          "mappedField": [
            {
              "sourceFieldName": "id",
              "targetFieldName": "id"
            },
            {
              "sourceFieldName": "from_id",
              "targetFieldName": "from_id"
            },
            {
              "sourceFieldName": "to_id",
              "targetFieldName": "to_id"
            },
            {
              "sourceFieldName": "weight",
              "targetFieldName": "weight"
            }
          ]
        }
      },
      "name": "local_created",
      "column": [
        {
          "name": "id",
          "type": "STRING"
        },
        {
          "name": "from_id",
          "type": "STRING"
        },
        {
          "name": "to_id",
          "type": "STRING"
        },
        {
          "name": "weight",
          "type": "DOUBLE"
        }
      ],
      "indexColumn": [
        "id"
      ],
      "distributeBy": [
        {
          "bucket": {
            "num": 12,
            "column": [
              "id"
            ]
          }
        }
      ],
      "orderBy": [
        {
          "column": "id"
        }
      ],
      "colocateKey": "co_52ejgb_created",
      "replicationNum": 1
    }
  ]
}

Note

Replace <MY_BIGLAKE_BUCKET> with your GCS bucket name and <MY_PROJECT_ID> with your Google Cloud project ID.

▶ In the Web UI, click Graph in the sidebar, then click Upload Schema and select the schema file.

▶ In the Upload Schema panel, you can see the basic information of the graph schema parsed from the file. Choose Cache data only in After Upload option, then click Upload.

▶ Click Local in the toolbar. If local tables are not loaded, click Load Data to load the data to local tables.

Querying the Graph

In PuppyGraph Web UI, click Query in the sidebar. You can run graph queries in openCypher or Gremlin language.

For example, the following query answers the question: "What software was created by people that marko knows?"

  • openCypher

    MATCH path = (p:person)-[:knows]->()-[:created]->()
    WHERE p.name = 'marko'
    RETURN path;
    

  • Gremlin

    g.V().hasLabel('person').has("name", "marko")
      .out("knows").out("created").path()
    

Query Result
Query Result

Cleanup

  • To stop and remove the PuppyGraph Docker container, run:
docker stop puppy
  • To remove the python virtual environment, simply delete the venv directory:
rm -rf venv