Skip to content

Querying Azure OneLake Data as a Graph

Summary

In this tutorial, you will:

  • Create an Azure OneLake data lake and load example data;
  • Deploy PuppyGraph in Docker and query OneLake data as a graph.

Prerequisites

Before starting, you need to set up your OAuth 2.0 client credentials, which include Client ID, Client Secret, and Tenant ID. Sometimes Client Endpoint is required, which is equivalent to https://login.microsoftonline.com/<your-tenant-id>/oauth2/token.

The steps for obtaining the credentials as well as the access to Azure storage are as follows:

▶ Go to https://portal.azure.com/#home

▶ Search 'app registrations' and click on 'New registration'.

App Registration

App Registration

▶ Give a name to the new app, and other configs could be left default.

App Configuration

App Configuration

▶ Open the app overview page, you can find client ID and tenant ID.

App Overview

App Overview

▶ Then in the app overview page, click on 'Certificates & secrets'.

Manage Secrets

Manage Secrets

▶ Click on 'New client secret', fill 'Description' and 'expires', then confirm. Remember to record the value in a safe place. This value will be used for the credential afterwards.

Create Client Secret

Create Client Secret

Then you need to grant the access of Azure Storage to your App:

▶ Search 'storage accounts' in the portal, go to the page and click on 'create'.

Create Storage Accounts

Create Storage Accounts

It's okay to only set the account name and leave other fields default.

The administrator may go to your storage account, click on 'Access Control (IAM)', and click on 'add role assignment' to grant access to resources to your app (Role: Storage Blob Data Contributor (or higher permission), Assign access to: User, group, or service principal, Select: find your App, then click on 'save', then your app will have the access to the storage through OAuth2).

Grant Storage Access

Grant Storage Access

If it succeeds, your administrator will see a confirmation as follows:

Success Confirmation

Success Confirmation

After your credential is ready, you need to create your own workspace and lakehouse in Microsoft Fabric. We use ws_demo and lh_demo as an example of workspace and lakehouse in this tutorial. Remember you need to grant the access to the App you created in the previous step in Fabric.

To prepare the environment in Fabric, please follow the steps below:

▶ Go to https://app.fabric.microsoft.com/, log in, start trial and navigate to the portal, click 'New workspace', fill in the name, and confirm.

Create Workspace

Create Workspace

▶ Select and enter this workspace ('demo_ws' in this example).

Enter Workspace

Enter Workspace

▶ Then click on 'New Item', search 'lakehouse' and click the box to create a lakehouse. (only the name is needed, it's 'demo_lh' in this example.)

Create Lakehouse

Create Lakehouse

Now your path to the lakehouse will look like abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse, you will need to access the OneLake data through this path.

▶ Go back to the workspace page, and click 'Manage access'.

Manage Access

Manage Access

▶ Search your app name, and set the permission to 'Member', then click 'add'.

Add Member

Add Member

Now all the setup is ready. You could view the structure of folders and files on the lakehouse page.

Explore Data

Explore Data

Lastly, please ensure that docker compose is available. The installation can be verified by running:

docker compose version

See https://docs.docker.com/compose/install/ for Docker Compose installation instructions and https://www.docker.com/get-started/ for more details on Docker.

Accessing the PuppyGraph Web UI requires a browser. However, the tutorial offers alternative instructions for those who wish to exclusively use the CLI.

Deployment

▶ Before we start, we need to prepare some required prerequisite files/folders for docker compose. Remember to put them all together in the root folder where you are going to run docker compose

iceberg-spark-config/spark-defaults.conf configures Spark to use Iceberg extensions, integrate with the REST Catalog, and access Azure OneLake storage using OAuth credentials.

spark-defaults.conf
# Iceberg Spark Extensions
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

spark.sql.catalog.iceberg_rest=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_rest.type=rest
spark.sql.catalog.iceberg_rest.uri=http://iceberg-rest:8181
spark.sql.catalog.iceberg_rest.io-impl=org.apache.iceberg.hadoop.HadoopFileIO

spark.hadoop.fs.abfss.impl=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfs

spark.hadoop.fs.azure.account.auth.type.onelake.dfs.fabric.microsoft.com=OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.onelake.dfs.fabric.microsoft.com=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-secret>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.onelake.dfs.fabric.microsoft.com=<oauth-2.0-token-endpoint>

spark.master=local[*]

iceberg-azure-jars is a folder that contains dependency JARs that enable Spark to access Azure OneLake (via the abfss:// protocol), including Azure SDKs and Hadoop connectors. To prepare the jars, run the following script in iceberg-azure-jars:

iceberg-azure-download-jars.sh
#!/bin/bash

# Azure
wget https://repo1.maven.org/maven2/com/azure/azure-core/1.37.0/azure-core-1.37.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-core-http-netty/1.13.3/azure-core-http-netty-1.13.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-identity/1.8.0/azure-identity-1.8.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.31.3/azure-storage-blob-12.31.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.30.2/azure-storage-common-12.30.2.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-file-datalake/12.24.3/azure-storage-file-datalake-12.24.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-xml/1.2.0/azure-xml-1.2.0.jar

# Iceberg
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-core/1.10.0/iceberg-core-1.10.0.jar
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-azure/1.10.0/iceberg-azure-1.10.0.jar
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/1.10.0/iceberg-spark-runtime-3.4_2.12-1.10.0.jar

# Netty
wget https://repo1.maven.org/maven2/io/netty/netty-buffer/4.1.93.Final/netty-buffer-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-common/4.1.93.Final/netty-common-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-codec/4.1.93.Final/netty-codec-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-handler/4.1.93.Final/netty-handler-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-resolver-dns/4.1.93.Final/netty-resolver-dns-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-transport/4.1.93.Final/netty-transport-4.1.93.Final.jar

# Others
wget https://repo1.maven.org/maven2/net/minidev/json-smart/2.5.1/json-smart-2.5.1.jar
wget https://repo1.maven.org/maven2/net/minidev/lang-tag/1.6/lang-tag-1.6.jar
wget https://repo1.maven.org/maven2/net/minidev/accessors-smart/2.5.1/accessors-smart-2.5.1.jar
wget https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/9.37/nimbus-jose-jwt-9.37.jar
wget https://repo1.maven.org/maven2/com/nimbusds/oauth2-oidc-sdk/9.37.3/oauth2-oidc-sdk-9.37.3.jar
wget https://repo1.maven.org/maven2/org/reactivestreams/reactive-streams/1.0.4/reactive-streams-1.0.4.jar
wget https://repo1.maven.org/maven2/io/projectreactor/reactor-core/3.5.8/reactor-core-3.5.8.jar
wget https://repo1.maven.org/maven2/io/projectreactor/netty/reactor-netty-core/1.1.7/reactor-netty-core-1.1.7.jar
wget https://repo1.maven.org/maven2/io/projectreactor/netty/reactor-netty-http/1.1.7/reactor-netty-http-1.1.7.jar
wget https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.13.0/jna-5.13.0.jar
wget https://repo1.maven.org/maven2/net/java/dev/jna/jna-platform/5.13.0/jna-platform-5.13.0.jar
wget https://repo1.maven.org/maven2/com/microsoft/azure/msal4j/1.14.1/msal4j-1.14.1.jar
wget https://repo1.maven.org/maven2/com/nimbusds/content-type/2.2/content-type-2.2.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.4/hadoop-azure-3.3.4.jar

Then the contents in the folder should look like:

iceberg-azure-jars
~/test_onelake/iceberg-azure-jars$ ls
accessors-smart-2.5.1.jar                iceberg-azure-1.10.0.jar                   netty-common-4.1.93.Final.jar
azure-core-1.37.0.jar                    iceberg-core-1.10.0.jar                    netty-handler-4.1.93.Final.jar
azure-core-http-netty-1.13.3.jar         iceberg-spark-runtime-3.4_2.12-1.10.0.jar  netty-resolver-dns-4.1.93.Final.jar
azure-identity-1.8.0.jar                 jna-5.13.0.jar                             netty-transport-4.1.93.Final.jar
azure-storage-blob-12.31.3.jar           jna-platform-5.13.0.jar                    nimbus-jose-jwt-9.37.jar
azure-storage-common-12.30.2.jar         json-smart-2.5.1.jar                       oauth2-oidc-sdk-9.37.3.jar
azure-storage-file-datalake-12.24.3.jar  lang-tag-1.6.jar                           reactive-streams-1.0.4.jar
azure-xml-1.2.0.jar                      msal4j-1.14.1.jar                          reactor-core-3.5.8.jar
content-type-2.2.jar                     netty-buffer-4.1.93.Final.jar              reactor-netty-core-1.1.7.jar
hadoop-azure-3.3.4.jar                   netty-codec-4.1.93.Final.jar               reactor-netty-http-1.1.7.jar

pg_hba.conf configures PostgreSQL client authentication rules, allowing the Hive Metastore container to connect to the database using the hive user over the network.

pg_hba.conf
# Allow Hive Metastore container to connect
host    metastore       hive            0.0.0.0/0               md5

# Keep defaults
local   all             all                                     trust
host    all             all             127.0.0.1/32            trust
host    all             all             ::1/128                 trust
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust

init_hive.sh initializes the Hive Metastore database and user when PostgreSQL starts by creating the hive user and metastore database if they do not exist.

init_hive.sh
#!/bin/bash
set -e

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "postgres" <<-EOSQL
DO \$\$
BEGIN
  IF NOT EXISTS (
      SELECT FROM pg_catalog.pg_roles
      WHERE rolname = 'hive') THEN
      CREATE USER hive WITH PASSWORD 'hive';
  END IF;
END
\$\$;
EOSQL

DB_EXISTS=$(psql -U "$POSTGRES_USER" -d "postgres" -tAc "SELECT 1 FROM pg_database WHERE datname='metastore'")
if [ "$DB_EXISTS" != "1" ]; then
    psql -U "$POSTGRES_USER" -d "postgres" -c "CREATE DATABASE metastore OWNER hive;"
fi

hive-entrypoint.sh is the entry script for the Hive Metastore: waits for PostgreSQL to be ready, checks if the Hive schema exists, initializes it if missing, and starts the Metastore service.

hive-entrypoint.sh
#!/bin/bash
set -e

echo "Waiting for PostgreSQL to be ready..."
while ! (echo > /dev/tcp/postgres/5432) >/dev/null 2>&1; do
  sleep 2
done

echo "PostgreSQL is ready. Checking Hive schema status..."

if /opt/hive/bin/schematool -dbType postgres -info >/dev/null 2>&1; then
  echo "Hive schema already exists. Skipping initialization."
else
  echo "Hive schema not found. Initializing..."
  /opt/hive/bin/schematool -dbType postgres -initSchema -verbose
fi

echo "Starting Hive Metastore Thrift service..."
exec /opt/hive/bin/hive --service metastore

hive-site.xml defines Hive’s database connection parameters (URL, driver, username, password) to specify where Hive metadata is stored in PostgreSQL.

hive-site.xml
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://postgres:5432/metastore</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
</configuration>

deltalake-spark-config/spark-defaults.conf configures Spark to use Delta Lake extensions, integrate with the Hive Metastore, and access Azure OneLake storage using OAuth credentials.

spark-defaults.conf
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

spark.sql.catalogImplementation=hive
spark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083

spark.sql.warehouse.dir=/tmp/local-warehouse

spark.hadoop.fs.abfss.impl=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfs

spark.hadoop.fs.azure.account.auth.type.onelake.dfs.fabric.microsoft.com=OAuth

spark.hadoop.fs.azure.account.oauth.provider.type.onelake.dfs.fabric.microsoft.com=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-secret>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.onelake.dfs.fabric.microsoft.com=<oauth-2.0-token-endpoint>

spark.master=local[*]

hive-azure-jars is a folder that contains dependency JARs that enable Spark to access Azure OneLake (via the abfss:// protocol), including Azure SDKs and Hadoop connectors. To prepare the jars, run the following script in hive-azure-jars:

hive-azure-download-jars.sh
#!/bin/bash

# Azure SDK components
wget https://repo1.maven.org/maven2/com/azure/azure-core/1.37.0/azure-core-1.37.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-identity/1.8.0/azure-identity-1.8.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.25.0/azure-storage-common-12.25.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.25.0/azure-storage-blob-12.25.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-file-datalake/12.24.0/azure-storage-file-datalake-12.24.0.jar

# Hadoop Azure connector
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.4/hadoop-azure-3.3.4.jar

# Delta Lake core components
wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
wget https://repo1.maven.org/maven2/io/delta/delta-storage/2.4.0/delta-storage-2.4.0.jar

Then the contents in the folder should look like:

hive-azure-jars
~/test_onelake$ ls hive-azure-jars

azure-core-1.37.0.jar           azure-storage-common-12.25.0.jar         delta-storage-2.4.0.jar
azure-identity-1.8.0.jar        azure-storage-file-datalake-12.24.0.jar  hadoop-azure-3.3.4.jar
azure-storage-blob-12.25.0.jar  delta-core_2.12-2.4.0.jar

▶ Create a file named docker-compose.yaml in your working directory with the following content.

docker-compose.yaml
networks:
  iceberg-net:
    name: iceberg-net

services:
  # Iceberg REST Catalog
  iceberg-rest:
    image: tabulario/iceberg-rest
    container_name: iceberg-rest
    networks:
      - iceberg-net
    ports:
      - "8181:8181"
    environment:
      CATALOG_CATALOG__IMPL: "org.apache.iceberg.jdbc.JdbcCatalog"
      CATALOG_URI: "jdbc:sqlite:/iceberg/iceberg_catalog.db"
      CATALOG_JDBC_USER: "user"
      CATALOG_JDBC_PASSWORD: "password"

      CATALOG_WAREHOUSE: "abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse"
      CATALOG_IO__IMPL: "org.apache.iceberg.azure.adlsv2.ADLSFileIO"
      AZURE_CLIENT_ID: "<oauth-2.0-client-id>"
      AZURE_CLIENT_SECRET: "<oauth-2.0-client-secret>"
      AZURE_TENANT_ID: "<oauth-2.0-tenant-id>"
      AZURE_AUTHORITY_HOST: "https://login.microsoftonline.com/"
    volumes:
      - /home/ubuntu/iceberg-data:/iceberg 
      # make sure '/home/ubuntu' exists or 
      # run 'sudo mkdir -p /home/ubuntu/iceberg-data' in advance

  spark-master:
    image: apache/spark:3.4.2
    container_name: spark
    depends_on:
      - iceberg-rest
    environment:
      SPARK_MODE: master
      SPARK_MASTER_HOST: spark-master
    command: >
      /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    ports:
      - "8080:8080"   # Spark UI
      - "7077:7077"   # Spark Master
    volumes:
      - ./iceberg-spark-config:/opt/spark/conf             # spark-defaults.conf
      - ./iceberg-azure-jars:/opt/spark/external-jars      # Azure jar
    networks:
      - iceberg-net

  puppygraph:
    image: puppygraph/puppygraph:stable
    container_name: puppygraph
    networks:
      - iceberg-net
    environment:
      - PUPPYGRAPH_USERNAME=puppygraph
      - PUPPYGRAPH_PASSWORD=puppygraph123
    ports:
      - "8081:8081"
      - "8182:8182"
      - "7687:7687"
    depends_on:
      - iceberg-rest

docker-compose.yaml

networks:
  deltacat-net:
    name: deltacat-net
    driver: bridge

volumes:
  pgdata:

services:
  postgres:
    image: postgres:15
    container_name: postgres
    environment:
      POSTGRES_DB: postgres
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./pg_hba.conf:/etc/postgresql/pg_hba.conf
      - ./init_hive.sh:/docker-entrypoint-initdb.d/init_hive.sh
    command: postgres -c 'hba_file=/etc/postgresql/pg_hba.conf'
    networks:
      - deltacat-net

  hive-metastore:
    image: apache/hive:4.0.0
    container_name: hive-metastore
    environment:
      SERVICE_NAME: metastore
      HIVE_CONF_DIR: /opt/hive/conf
    volumes:
      - ./hive-entrypoint.sh:/hive-entrypoint.sh 
      - ./postgresql-42.7.3.jar:/opt/hive/lib/postgresql-42.7.3.jar
      - ./hive-site.xml:/opt/hive/conf/hive-site.xml
    entrypoint: ["/bin/bash", "/hive-entrypoint.sh"] 
    depends_on:
      - postgres
    networks:
      - deltacat-net
    ports:
      - "9083:9083"

  spark-master:
    image: apache/spark:3.4.2
    container_name: spark
    depends_on:
      - hive-metastore
    environment:
      SPARK_MODE: master
      SPARK_MASTER_HOST: spark-master
    command: >
      /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    ports:
      - "8080:8080"   # Spark UI
      - "7077:7077"   # Spark Master
    volumes:
      - ./deltalake-spark-config:/opt/spark/conf      # spark-defaults.conf
      - ./hive-azure-jars:/opt/spark/external-jars    # Azure jar
    networks:
      - deltacat-net

  puppygraph:
    image: puppygraph/puppygraph:stable
    container_name: puppygraph
    networks:
      - deltacat-net
    environment:
      - PUPPYGRAPH_USERNAME=puppygraph
      - PUPPYGRAPH_PASSWORD=puppygraph123
    ports:
      - "8081:8081"
      - "8182:8182"
      - "7687:7687"
    depends_on:
      - hive-metastore

▶ Then run the following command to start Iceberg services and PuppyGraph:

docker compose up -d
[+] Running 4/4
✔ Network   iceberg-net         Created
✔ Container iceberg-rest        Started
✔ Container puppygraph          Started  
✔ Container spark               Started    
[+] Running 5/5
✔ Network   deltacat-net        Created 
✔ Container postgres            Started
✔ Container hive-metastore      Started
✔ Container spark               Started
✔ Container puppygraph          Started   

Data Preparation

This tutorial is designed to be comprehensive and standalone, so it includes steps to populate data in OneLake. In practical scenarios, PuppyGraph can query data directly from your existing OneLake Iceberg/Delta Lake tables.

▶ Run the following command to start a Spark SQL shell to access OneLake.

docker exec -it spark bash

cp /opt/spark/external-jars/* /opt/spark/jars/

/opt/spark/bin/spark-sql

The shell will be like this:

spark-sql (default)> 

▶ In order to create necessary tables and insert data for our example dataset, we should execute the following SQL statements in the shell of the configuration selected:

Notes for Iceberg: SET CATALOG iceberg_rest; must be run first. If you deleted any data in Fabric before running the following SQL scripts, make sure to clean /home/ubuntu/iceberg-data first as well. (For Delta Lake, this action is not mandatory, but we still recommend to clean pgdata in advance in this situation.)

SET CATALOG iceberg_rest;

CREATE DATABASE modern;

USE modern;

CREATE EXTERNAL TABLE person (
  id string,
  name string,
  age int
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/person';

INSERT INTO person VALUES
  ('v1', 'marko', 29),
  ('v2', 'vadas', 27),
  ('v4', 'josh', 32),
  ('v6', 'peter', 35);

CREATE EXTERNAL TABLE software (
  id string,
  name string,
  lang string
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/software';

INSERT INTO software VALUES
  ('v3', 'lop', 'java'),
  ('v5', 'ripple', 'java');

CREATE EXTERNAL TABLE created (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/created';

INSERT INTO created VALUES
  ('e9', 'v1', 'v3', 0.4),
  ('e10', 'v4', 'v5', 1.0),
  ('e11', 'v4', 'v3', 0.4),
  ('e12', 'v6', 'v3', 0.2);

CREATE EXTERNAL TABLE knows (
  id string,
  from_id string,
  to_id string,
  weight double
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/knows';

INSERT INTO knows VALUES
  ('e7', 'v1', 'v2', 0.5),
  ('e8', 'v1', 'v4', 1.0);  
CREATE DATABASE modern;

USE modern;

CREATE TABLE person (
  id string,
  name string,
  age int
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/person';

INSERT INTO person VALUES
  ('v1', 'marko', 29),
  ('v2', 'vadas', 27),
  ('v4', 'josh', 32),
  ('v6', 'peter', 35);

CREATE EXTERNAL TABLE software (
  id string,
  name string,
  lang string
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/software';

INSERT INTO software VALUES
  ('v3', 'lop', 'java'),
  ('v5', 'ripple', 'java');

CREATE EXTERNAL TABLE created (
  id string,
  from_id string,
  to_id string,
  weight double
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/created';

INSERT INTO created VALUES
  ('e9', 'v1', 'v3', 0.4),
  ('e10', 'v4', 'v5', 1.0),
  ('e11', 'v4', 'v3', 0.4),
  ('e12', 'v6', 'v3', 0.2);

CREATE EXTERNAL TABLE knows (
  id string,
  from_id string,
  to_id string,
  weight double
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/knows';

INSERT INTO knows VALUES
  ('e7', 'v1', 'v2', 0.5),
  ('e8', 'v1', 'v4', 1.0);

The above SQL creates the following tables:

id name age
v1 marko 29
v2 vadas 27
v4 josh 32
v6 peter 35
id name lang
v3 lop java
v5 ripple java
id from_id to_id weight
e7 v1 v2 0.5
e8 v1 v4 1.0
id from_id to_id weight
e9 v1 v3 0.4
e10 v4 v5 1.0
e11 v4 v3 0.4
e12 v6 v3 0.2

Modeling a Graph

We then define a graph on top of the data tables we just created. Actually, this is the "Modern" graph defined by Apache Tinkerpop.

Modern Graph

Modern Graph

A schema instructs PuppyGraph on mapping data from the OneLake Iceberg/Delta Lake into a graph. PuppyGraph offers various methods for schema creation. For this tutorial, we've already prepared a schema to help save time.

▶ Create a PuppyGraph schema file schema.json with the following content:

schema.json

{
  "catalogs": [
    {
      "name": "onelake_iceberg",
      "type": "iceberg",
      "metastore": {
        "type": "rest",
        "uri": "http://iceberg-rest:8181"
      },
      "storage": {
        "type": "AzureDLS2",
        "clientId": "<oauth-2.0-client-id>",
        "clientSecret": "<oauth-2.0-client-secret>",
        "clientEndpoint": "<oauth-2.0-token-endpoint>"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": {
            "catalog": "onelake_iceberg",
            "schema": "modern",
            "table": "person"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "Int",
              "field": "age",
              "alias": "age"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": {
            "catalog": "onelake_iceberg",
            "schema": "modern",
            "table": "software"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "String",
              "field": "lang",
              "alias": "lang"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": {
          "catalog": "onelake_iceberg",
          "schema": "modern",
          "table": "knows"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": {
          "catalog": "onelake_iceberg",
          "schema": "modern",
          "table": "created"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      }
    ]
  }
}

schema.json

{
  "catalogs": [
    {
      "name": "delta_hms_onelake",
      "type": "deltalake",
      "metastore": {
        "type": "HMS",
        "hiveMetastoreUrl": "thrift://hive-metastore:9083"
      },
      "storage": {
        "type": "AzureDLS2",
        "clientId": "<oauth-2.0-client-id>",
        "clientSecret": "<oauth-2.0-client-secret>",
        "clientEndpoint": "<oauth-2.0-token-endpoint>"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": {
            "catalog": "delta_hms_onelake",
            "schema": "modern",
            "table": "person"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "Int",
              "field": "age",
              "alias": "age"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": {
            "catalog": "delta_hms_onelake",
            "schema": "modern",
            "table": "software"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "String",
              "field": "lang",
              "alias": "lang"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": {
          "catalog": "delta_hms_onelake",
          "schema": "modern",
          "table": "knows"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": {
          "catalog": "delta_hms_onelake",
          "schema": "modern",
          "table": "created"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      }
    ]
  }
}

▶ Log into PuppyGraph Web UI at http://localhost:8081 with username puppygraph and password puppygraph123.

PuppyGraph Login

PuppyGraph Login

▶ Upload the schema by selecting the file schema.json in the Upload Graph Schema JSON block and clicking on Upload.

Upload Schema Page

Upload Schema Page

Once the schema is uploaded, the schema page shows the visualized graph schema as follows.

Visualized Schema

Visualized Schema

Alternative: Schema Uploading via CLI

▶ Alternatively, run the following command to upload the schema file:

curl -XPOST -H "content-type: application/json" --data-binary @./schema.json --user "puppygraph:puppygraph123" localhost:8081/schema

The response shows that graph schema has been uploaded successfully:

{"Status":"OK","Message":"Schema uploaded and gremlin server restarted"}

Querying the Graph

In this tutorial we will use the Gremlin query language to query the Graph. Gremlin is a graph query language developed by Apache TinkerPop. Prior knowledge of Gremlin is not necessary to follow the tutorial. To learn more about it, visit https://tinkerpop.apache.org/gremlin.html.

▶ Click on the Query panel the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin.

Interactive Gremlin Query Page

Interactive Gremlin Query Page

Queries are entered on the left side, and the right side displays the graph visualization.

The first query retrieves the property of the person named "marko".

▶ Copy the following query, paste it in the query input, and click on the run button.

g.V().has("name", "marko").valueMap()

The output is plain text like the following:

Rows: 1
age              29
name             marko

Now let's also leverage the visualization. The next query gets all the software created by people known to "marko".

▶ Copy the following query, paste it in the query input, and click on the run button.

g.V().has("name", "marko")
  .out("knows").out("created").path()

The output is as follows. There are two paths in the result as "marko" knows "josh" who created "lop" and "ripple".

Interactive Query with Results

Interactive Query with Results

Alternative: Querying the graph via CLI

Alternatively, we can query the graph via CLI.

▶ Execute the following command to access the PuppyGraph Gremlin Console

docker exec -it puppygraph ./bin/console

The welcome screen appears as follows:

  ____                                     ____                          _
 |  _ \   _   _   _ __    _ __    _   _   / ___|  _ __    __ _   _ __   | |__
 | |_) | | | | | | '_ \  | '_ \  | | | | | |  _  | '__|  / _` | | '_ \  | '_ \
 |  __/  | |_| | | |_) | | |_) | | |_| | | |_| | | |    | (_| | | |_) | | | | |
 |_|      \__,_| | .__/  | .__/   \__, |  \____| |_|     \__,_| | .__/  |_| |_|
                 |_|     |_|      |___/                         |_|
Welcome to PuppyGraph!
version: 0.10

puppy-gremlin> 

▶ Run the following queries in the console to query the Graph.

g.V().has("name", "marko").valueMap()

Properties of the person named "marko":

puppy-gremlin> g.V().has("name", "marko").valueMap()
Done! Elapsed time: 0.059s, rows: 1
==>map[age:29 name:marko]
g.V().has("name", "marko").out("knows").out("created").valueMap()

All the software created by the people known to "marko":

puppy-gremlin> g.V().has("name", "marko").out("knows").out("created").valueMap()
Done! Elapsed time: 0.042s, rows: 2
==>map[lang:java name:lop]
==>map[lang:java name:ripple]

▶ To exit PuppyGraph Gremlin Console, enter the command:

:exit

Cleaning up

▶ Run the following command to shut down and remove the services:

docker compose down

See Also

Please refer to Connecting to Apache Iceberg and Connecting to Delta Lake for connecting to different implementations of Iceberg.

Azure OneLake allows users to convert Iceberg format and Delta Lake format mutually, see Use Iceberg tables with OneLake.