Querying Azure OneLake Data as a Graph
Summary
In this tutorial, you will:
- Create an Azure OneLake data lake and load example data;
- Deploy PuppyGraph in Docker and query OneLake data as a graph.
Prerequisites
Before starting, you need to set up your OAuth 2.0 client credentials, which include Client ID, Client Secret, and Tenant ID. Sometimes Client Endpoint is required, which is equivalent to https://login.microsoftonline.com/<your-tenant-id>/oauth2/token.
The steps for obtaining the credentials as well as the access to Azure storage are as follows:
Go to https://portal.azure.com/#home
Search 'app registrations' and click on 'New registration'.

App Registration
Give a name to the new app, and other configs could be left default.

App Configuration
Open the app overview page, you can find client ID and tenant ID.

App Overview
Then in the app overview page, click on 'Certificates & secrets'.

Manage Secrets
Click on 'New client secret', fill 'Description' and 'expires', then confirm. Remember to record the value in a safe place. This value will be used for the credential afterwards.

Create Client Secret
Then you need to grant the access of Azure Storage to your App:
Search 'storage accounts' in the portal, go to the page and click on 'create'.

Create Storage Accounts
It's okay to only set the account name and leave other fields default.
The administrator may go to your storage account, click on 'Access Control (IAM)', and click on 'add role assignment' to grant access to resources to your app (Role: Storage Blob Data Contributor (or higher permission), Assign access to: User, group, or service principal, Select: find your App, then click on 'save', then your app will have the access to the storage through OAuth2).

Grant Storage Access
If it succeeds, your administrator will see a confirmation as follows:

Success Confirmation
After your credential is ready, you need to create your own workspace and lakehouse in Microsoft Fabric. We use ws_demo and lh_demo as an example of workspace and lakehouse in this tutorial. Remember you need to grant the access to the App you created in the previous step in Fabric.
To prepare the environment in Fabric, please follow the steps below:
Go to https://app.fabric.microsoft.com/, log in, start trial and navigate to the portal, click 'New workspace', fill in the name, and confirm.

Create Workspace
Select and enter this workspace ('demo_ws' in this example).

Enter Workspace
Then click on 'New Item', search 'lakehouse' and click the box to create a lakehouse. (only the name is needed, it's 'demo_lh' in this example.)

Create Lakehouse
Now your path to the lakehouse will look like abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse, you will need to access the OneLake data through this path.
Go back to the workspace page, and click 'Manage access'.

Manage Access
Search your app name, and set the permission to 'Member', then click 'add'.

Add Member
Now all the setup is ready. You could view the structure of folders and files on the lakehouse page.

Explore Data
Lastly, please ensure that docker compose is available. The installation can be verified by running:
See https://docs.docker.com/compose/install/ for Docker Compose installation instructions and https://www.docker.com/get-started/ for more details on Docker.
Accessing the PuppyGraph Web UI requires a browser. However, the tutorial offers alternative instructions for those who wish to exclusively use the CLI.
Deployment
Before we start, we need to prepare some required prerequisite files/folders for docker compose. Remember to put them all together in the root folder where you are going to run
docker compose
iceberg-spark-config/spark-defaults.conf configures Spark to use Iceberg extensions, integrate with the REST Catalog, and access Azure OneLake storage using OAuth credentials.
spark-defaults.conf
# Iceberg Spark Extensions
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg_rest=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_rest.type=rest
spark.sql.catalog.iceberg_rest.uri=http://iceberg-rest:8181
spark.sql.catalog.iceberg_rest.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
spark.hadoop.fs.abfss.impl=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfs
spark.hadoop.fs.azure.account.auth.type.onelake.dfs.fabric.microsoft.com=OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.onelake.dfs.fabric.microsoft.com=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-secret>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.onelake.dfs.fabric.microsoft.com=<oauth-2.0-token-endpoint>
spark.master=local[*]
iceberg-azure-jars is a folder that contains dependency JARs that enable Spark to access Azure OneLake (via the abfss:// protocol), including Azure SDKs and Hadoop connectors. To prepare the jars, run the following script in iceberg-azure-jars:
iceberg-azure-download-jars.sh
#!/bin/bash
# Azure
wget https://repo1.maven.org/maven2/com/azure/azure-core/1.37.0/azure-core-1.37.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-core-http-netty/1.13.3/azure-core-http-netty-1.13.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-identity/1.8.0/azure-identity-1.8.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.31.3/azure-storage-blob-12.31.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.30.2/azure-storage-common-12.30.2.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-file-datalake/12.24.3/azure-storage-file-datalake-12.24.3.jar
wget https://repo1.maven.org/maven2/com/azure/azure-xml/1.2.0/azure-xml-1.2.0.jar
# Iceberg
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-core/1.10.0/iceberg-core-1.10.0.jar
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-azure/1.10.0/iceberg-azure-1.10.0.jar
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/1.10.0/iceberg-spark-runtime-3.4_2.12-1.10.0.jar
# Netty
wget https://repo1.maven.org/maven2/io/netty/netty-buffer/4.1.93.Final/netty-buffer-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-common/4.1.93.Final/netty-common-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-codec/4.1.93.Final/netty-codec-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-handler/4.1.93.Final/netty-handler-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-resolver-dns/4.1.93.Final/netty-resolver-dns-4.1.93.Final.jar
wget https://repo1.maven.org/maven2/io/netty/netty-transport/4.1.93.Final/netty-transport-4.1.93.Final.jar
# Others
wget https://repo1.maven.org/maven2/net/minidev/json-smart/2.5.1/json-smart-2.5.1.jar
wget https://repo1.maven.org/maven2/net/minidev/lang-tag/1.6/lang-tag-1.6.jar
wget https://repo1.maven.org/maven2/net/minidev/accessors-smart/2.5.1/accessors-smart-2.5.1.jar
wget https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/9.37/nimbus-jose-jwt-9.37.jar
wget https://repo1.maven.org/maven2/com/nimbusds/oauth2-oidc-sdk/9.37.3/oauth2-oidc-sdk-9.37.3.jar
wget https://repo1.maven.org/maven2/org/reactivestreams/reactive-streams/1.0.4/reactive-streams-1.0.4.jar
wget https://repo1.maven.org/maven2/io/projectreactor/reactor-core/3.5.8/reactor-core-3.5.8.jar
wget https://repo1.maven.org/maven2/io/projectreactor/netty/reactor-netty-core/1.1.7/reactor-netty-core-1.1.7.jar
wget https://repo1.maven.org/maven2/io/projectreactor/netty/reactor-netty-http/1.1.7/reactor-netty-http-1.1.7.jar
wget https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.13.0/jna-5.13.0.jar
wget https://repo1.maven.org/maven2/net/java/dev/jna/jna-platform/5.13.0/jna-platform-5.13.0.jar
wget https://repo1.maven.org/maven2/com/microsoft/azure/msal4j/1.14.1/msal4j-1.14.1.jar
wget https://repo1.maven.org/maven2/com/nimbusds/content-type/2.2/content-type-2.2.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.4/hadoop-azure-3.3.4.jar
Then the contents in the folder should look like:
iceberg-azure-jars
~/test_onelake/iceberg-azure-jars$ ls
accessors-smart-2.5.1.jar iceberg-azure-1.10.0.jar netty-common-4.1.93.Final.jar
azure-core-1.37.0.jar iceberg-core-1.10.0.jar netty-handler-4.1.93.Final.jar
azure-core-http-netty-1.13.3.jar iceberg-spark-runtime-3.4_2.12-1.10.0.jar netty-resolver-dns-4.1.93.Final.jar
azure-identity-1.8.0.jar jna-5.13.0.jar netty-transport-4.1.93.Final.jar
azure-storage-blob-12.31.3.jar jna-platform-5.13.0.jar nimbus-jose-jwt-9.37.jar
azure-storage-common-12.30.2.jar json-smart-2.5.1.jar oauth2-oidc-sdk-9.37.3.jar
azure-storage-file-datalake-12.24.3.jar lang-tag-1.6.jar reactive-streams-1.0.4.jar
azure-xml-1.2.0.jar msal4j-1.14.1.jar reactor-core-3.5.8.jar
content-type-2.2.jar netty-buffer-4.1.93.Final.jar reactor-netty-core-1.1.7.jar
hadoop-azure-3.3.4.jar netty-codec-4.1.93.Final.jar reactor-netty-http-1.1.7.jar
pg_hba.conf configures PostgreSQL client authentication rules, allowing the Hive Metastore container to connect to the database using the hive user over the network.
init_hive.sh initializes the Hive Metastore database and user when PostgreSQL starts by creating the hive user and metastore database if they do not exist.
init_hive.sh
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "postgres" <<-EOSQL
DO \$\$
BEGIN
IF NOT EXISTS (
SELECT FROM pg_catalog.pg_roles
WHERE rolname = 'hive') THEN
CREATE USER hive WITH PASSWORD 'hive';
END IF;
END
\$\$;
EOSQL
DB_EXISTS=$(psql -U "$POSTGRES_USER" -d "postgres" -tAc "SELECT 1 FROM pg_database WHERE datname='metastore'")
if [ "$DB_EXISTS" != "1" ]; then
psql -U "$POSTGRES_USER" -d "postgres" -c "CREATE DATABASE metastore OWNER hive;"
fi
hive-entrypoint.sh is the entry script for the Hive Metastore: waits for PostgreSQL to be ready, checks if the Hive schema exists, initializes it if missing, and starts the Metastore service.
hive-entrypoint.sh
#!/bin/bash
set -e
echo "Waiting for PostgreSQL to be ready..."
while ! (echo > /dev/tcp/postgres/5432) >/dev/null 2>&1; do
sleep 2
done
echo "PostgreSQL is ready. Checking Hive schema status..."
if /opt/hive/bin/schematool -dbType postgres -info >/dev/null 2>&1; then
echo "Hive schema already exists. Skipping initialization."
else
echo "Hive schema not found. Initializing..."
/opt/hive/bin/schematool -dbType postgres -initSchema -verbose
fi
echo "Starting Hive Metastore Thrift service..."
exec /opt/hive/bin/hive --service metastore
hive-site.xml defines Hive’s database connection parameters (URL, driver, username, password) to specify where Hive metadata is stored in PostgreSQL.
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://postgres:5432/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
</configuration>
deltalake-spark-config/spark-defaults.conf configures Spark to use Delta Lake extensions, integrate with the Hive Metastore, and access Azure OneLake storage using OAuth credentials.
spark-defaults.conf
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalogImplementation=hive
spark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083
spark.sql.warehouse.dir=/tmp/local-warehouse
spark.hadoop.fs.abfss.impl=org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
spark.hadoop.fs.AbstractFileSystem.abfss.impl=org.apache.hadoop.fs.azurebfs.Abfs
spark.hadoop.fs.azure.account.auth.type.onelake.dfs.fabric.microsoft.com=OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.onelake.dfs.fabric.microsoft.com=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.onelake.dfs.fabric.microsoft.com=<oauth-2.0-client-secret>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.onelake.dfs.fabric.microsoft.com=<oauth-2.0-token-endpoint>
spark.master=local[*]
hive-azure-jars is a folder that contains dependency JARs that enable Spark to access Azure OneLake (via the abfss:// protocol), including Azure SDKs and Hadoop connectors. To prepare the jars, run the following script in hive-azure-jars:
hive-azure-download-jars.sh
#!/bin/bash
# Azure SDK components
wget https://repo1.maven.org/maven2/com/azure/azure-core/1.37.0/azure-core-1.37.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-identity/1.8.0/azure-identity-1.8.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.25.0/azure-storage-common-12.25.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.25.0/azure-storage-blob-12.25.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-file-datalake/12.24.0/azure-storage-file-datalake-12.24.0.jar
# Hadoop Azure connector
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.4/hadoop-azure-3.3.4.jar
# Delta Lake core components
wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
wget https://repo1.maven.org/maven2/io/delta/delta-storage/2.4.0/delta-storage-2.4.0.jar
Then the contents in the folder should look like:
Create a file named
docker-compose.yaml in your working directory with the following content.
docker-compose.yaml
networks:
iceberg-net:
name: iceberg-net
services:
# Iceberg REST Catalog
iceberg-rest:
image: tabulario/iceberg-rest
container_name: iceberg-rest
networks:
- iceberg-net
ports:
- "8181:8181"
environment:
CATALOG_CATALOG__IMPL: "org.apache.iceberg.jdbc.JdbcCatalog"
CATALOG_URI: "jdbc:sqlite:/iceberg/iceberg_catalog.db"
CATALOG_JDBC_USER: "user"
CATALOG_JDBC_PASSWORD: "password"
CATALOG_WAREHOUSE: "abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse"
CATALOG_IO__IMPL: "org.apache.iceberg.azure.adlsv2.ADLSFileIO"
AZURE_CLIENT_ID: "<oauth-2.0-client-id>"
AZURE_CLIENT_SECRET: "<oauth-2.0-client-secret>"
AZURE_TENANT_ID: "<oauth-2.0-tenant-id>"
AZURE_AUTHORITY_HOST: "https://login.microsoftonline.com/"
volumes:
- /home/ubuntu/iceberg-data:/iceberg
# make sure '/home/ubuntu' exists or
# run 'sudo mkdir -p /home/ubuntu/iceberg-data' in advance
spark-master:
image: apache/spark:3.4.2
container_name: spark
depends_on:
- iceberg-rest
environment:
SPARK_MODE: master
SPARK_MASTER_HOST: spark-master
command: >
/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
ports:
- "8080:8080" # Spark UI
- "7077:7077" # Spark Master
volumes:
- ./iceberg-spark-config:/opt/spark/conf # spark-defaults.conf
- ./iceberg-azure-jars:/opt/spark/external-jars # Azure jar
networks:
- iceberg-net
puppygraph:
image: puppygraph/puppygraph:stable
container_name: puppygraph
networks:
- iceberg-net
environment:
- PUPPYGRAPH_USERNAME=puppygraph
- PUPPYGRAPH_PASSWORD=puppygraph123
ports:
- "8081:8081"
- "8182:8182"
- "7687:7687"
depends_on:
- iceberg-rest
docker-compose.yaml
networks:
deltacat-net:
name: deltacat-net
driver: bridge
volumes:
pgdata:
services:
postgres:
image: postgres:15
container_name: postgres
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
- ./pg_hba.conf:/etc/postgresql/pg_hba.conf
- ./init_hive.sh:/docker-entrypoint-initdb.d/init_hive.sh
command: postgres -c 'hba_file=/etc/postgresql/pg_hba.conf'
networks:
- deltacat-net
hive-metastore:
image: apache/hive:4.0.0
container_name: hive-metastore
environment:
SERVICE_NAME: metastore
HIVE_CONF_DIR: /opt/hive/conf
volumes:
- ./hive-entrypoint.sh:/hive-entrypoint.sh
- ./postgresql-42.7.3.jar:/opt/hive/lib/postgresql-42.7.3.jar
- ./hive-site.xml:/opt/hive/conf/hive-site.xml
entrypoint: ["/bin/bash", "/hive-entrypoint.sh"]
depends_on:
- postgres
networks:
- deltacat-net
ports:
- "9083:9083"
spark-master:
image: apache/spark:3.4.2
container_name: spark
depends_on:
- hive-metastore
environment:
SPARK_MODE: master
SPARK_MASTER_HOST: spark-master
command: >
/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
ports:
- "8080:8080" # Spark UI
- "7077:7077" # Spark Master
volumes:
- ./deltalake-spark-config:/opt/spark/conf # spark-defaults.conf
- ./hive-azure-jars:/opt/spark/external-jars # Azure jar
networks:
- deltacat-net
puppygraph:
image: puppygraph/puppygraph:stable
container_name: puppygraph
networks:
- deltacat-net
environment:
- PUPPYGRAPH_USERNAME=puppygraph
- PUPPYGRAPH_PASSWORD=puppygraph123
ports:
- "8081:8081"
- "8182:8182"
- "7687:7687"
depends_on:
- hive-metastore
Then run the following command to start Iceberg services and PuppyGraph:
Data Preparation
This tutorial is designed to be comprehensive and standalone, so it includes steps to populate data in OneLake. In practical scenarios, PuppyGraph can query data directly from your existing OneLake Iceberg/Delta Lake tables.
Run the following command to start a Spark SQL shell to access OneLake.
The shell will be like this:
In order to create necessary tables and insert data for our example dataset, we should execute the following SQL statements in the shell of the configuration selected:
Notes for Iceberg:
SET CATALOG iceberg_rest; must be run first.
If you deleted any data in Fabric before running the following SQL scripts, make sure to clean /home/ubuntu/iceberg-data first as well. (For Delta Lake, this action is not mandatory, but we still recommend to clean pgdata in advance in this situation.)
SET CATALOG iceberg_rest;
CREATE DATABASE modern;
USE modern;
CREATE EXTERNAL TABLE person (
id string,
name string,
age int
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/person';
INSERT INTO person VALUES
('v1', 'marko', 29),
('v2', 'vadas', 27),
('v4', 'josh', 32),
('v6', 'peter', 35);
CREATE EXTERNAL TABLE software (
id string,
name string,
lang string
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/software';
INSERT INTO software VALUES
('v3', 'lop', 'java'),
('v5', 'ripple', 'java');
CREATE EXTERNAL TABLE created (
id string,
from_id string,
to_id string,
weight double
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/created';
INSERT INTO created VALUES
('e9', 'v1', 'v3', 0.4),
('e10', 'v4', 'v5', 1.0),
('e11', 'v4', 'v3', 0.4),
('e12', 'v6', 'v3', 0.2);
CREATE EXTERNAL TABLE knows (
id string,
from_id string,
to_id string,
weight double
) USING iceberg LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/iceberg/warehouse/modern/knows';
INSERT INTO knows VALUES
('e7', 'v1', 'v2', 0.5),
('e8', 'v1', 'v4', 1.0);
CREATE DATABASE modern;
USE modern;
CREATE TABLE person (
id string,
name string,
age int
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/person';
INSERT INTO person VALUES
('v1', 'marko', 29),
('v2', 'vadas', 27),
('v4', 'josh', 32),
('v6', 'peter', 35);
CREATE EXTERNAL TABLE software (
id string,
name string,
lang string
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/software';
INSERT INTO software VALUES
('v3', 'lop', 'java'),
('v5', 'ripple', 'java');
CREATE EXTERNAL TABLE created (
id string,
from_id string,
to_id string,
weight double
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/created';
INSERT INTO created VALUES
('e9', 'v1', 'v3', 0.4),
('e10', 'v4', 'v5', 1.0),
('e11', 'v4', 'v3', 0.4),
('e12', 'v6', 'v3', 0.2);
CREATE EXTERNAL TABLE knows (
id string,
from_id string,
to_id string,
weight double
) USING DELTA LOCATION 'abfss://demo_ws@onelake.dfs.fabric.microsoft.com/demo_lh.lakehouse/Files/deltalake/warehouse/modern/knows';
INSERT INTO knows VALUES
('e7', 'v1', 'v2', 0.5),
('e8', 'v1', 'v4', 1.0);
The above SQL creates the following tables:
| id | name | age |
|---|---|---|
| v1 | marko | 29 |
| v2 | vadas | 27 |
| v4 | josh | 32 |
| v6 | peter | 35 |
| id | name | lang |
|---|---|---|
| v3 | lop | java |
| v5 | ripple | java |
| id | from_id | to_id | weight |
|---|---|---|---|
| e7 | v1 | v2 | 0.5 |
| e8 | v1 | v4 | 1.0 |
| id | from_id | to_id | weight |
|---|---|---|---|
| e9 | v1 | v3 | 0.4 |
| e10 | v4 | v5 | 1.0 |
| e11 | v4 | v3 | 0.4 |
| e12 | v6 | v3 | 0.2 |
Modeling a Graph
We then define a graph on top of the data tables we just created. Actually, this is the "Modern" graph defined by Apache Tinkerpop.

Modern Graph
A schema instructs PuppyGraph on mapping data from the OneLake Iceberg/Delta Lake into a graph. PuppyGraph offers various methods for schema creation. For this tutorial, we've already prepared a schema to help save time.
Create a PuppyGraph schema file
schema.json with the following content:
schema.json
{
"catalogs": [
{
"name": "onelake_iceberg",
"type": "iceberg",
"metastore": {
"type": "rest",
"uri": "http://iceberg-rest:8181"
},
"storage": {
"type": "AzureDLS2",
"clientId": "<oauth-2.0-client-id>",
"clientSecret": "<oauth-2.0-client-secret>",
"clientEndpoint": "<oauth-2.0-token-endpoint>"
}
}
],
"graph": {
"vertices": [
{
"label": "person",
"oneToOne": {
"tableSource": {
"catalog": "onelake_iceberg",
"schema": "modern",
"table": "person"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"attributes": [
{
"type": "Int",
"field": "age",
"alias": "age"
},
{
"type": "String",
"field": "name",
"alias": "name"
}
]
}
},
{
"label": "software",
"oneToOne": {
"tableSource": {
"catalog": "onelake_iceberg",
"schema": "modern",
"table": "software"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"attributes": [
{
"type": "String",
"field": "lang",
"alias": "lang"
},
{
"type": "String",
"field": "name",
"alias": "name"
}
]
}
}
],
"edges": [
{
"label": "knows",
"fromVertex": "person",
"toVertex": "person",
"tableSource": {
"catalog": "onelake_iceberg",
"schema": "modern",
"table": "knows"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"fromId": {
"fields": [
{
"type": "String",
"field": "from_id",
"alias": "from_id"
}
]
},
"toId": {
"fields": [
{
"type": "String",
"field": "to_id",
"alias": "to_id"
}
]
},
"attributes": [
{
"type": "Double",
"field": "weight",
"alias": "weight"
}
]
},
{
"label": "created",
"fromVertex": "person",
"toVertex": "software",
"tableSource": {
"catalog": "onelake_iceberg",
"schema": "modern",
"table": "created"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"fromId": {
"fields": [
{
"type": "String",
"field": "from_id",
"alias": "from_id"
}
]
},
"toId": {
"fields": [
{
"type": "String",
"field": "to_id",
"alias": "to_id"
}
]
},
"attributes": [
{
"type": "Double",
"field": "weight",
"alias": "weight"
}
]
}
]
}
}
schema.json
{
"catalogs": [
{
"name": "delta_hms_onelake",
"type": "deltalake",
"metastore": {
"type": "HMS",
"hiveMetastoreUrl": "thrift://hive-metastore:9083"
},
"storage": {
"type": "AzureDLS2",
"clientId": "<oauth-2.0-client-id>",
"clientSecret": "<oauth-2.0-client-secret>",
"clientEndpoint": "<oauth-2.0-token-endpoint>"
}
}
],
"graph": {
"vertices": [
{
"label": "person",
"oneToOne": {
"tableSource": {
"catalog": "delta_hms_onelake",
"schema": "modern",
"table": "person"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"attributes": [
{
"type": "Int",
"field": "age",
"alias": "age"
},
{
"type": "String",
"field": "name",
"alias": "name"
}
]
}
},
{
"label": "software",
"oneToOne": {
"tableSource": {
"catalog": "delta_hms_onelake",
"schema": "modern",
"table": "software"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"attributes": [
{
"type": "String",
"field": "lang",
"alias": "lang"
},
{
"type": "String",
"field": "name",
"alias": "name"
}
]
}
}
],
"edges": [
{
"label": "knows",
"fromVertex": "person",
"toVertex": "person",
"tableSource": {
"catalog": "delta_hms_onelake",
"schema": "modern",
"table": "knows"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"fromId": {
"fields": [
{
"type": "String",
"field": "from_id",
"alias": "from_id"
}
]
},
"toId": {
"fields": [
{
"type": "String",
"field": "to_id",
"alias": "to_id"
}
]
},
"attributes": [
{
"type": "Double",
"field": "weight",
"alias": "weight"
}
]
},
{
"label": "created",
"fromVertex": "person",
"toVertex": "software",
"tableSource": {
"catalog": "delta_hms_onelake",
"schema": "modern",
"table": "created"
},
"id": {
"fields": [
{
"type": "String",
"field": "id",
"alias": "id"
}
]
},
"fromId": {
"fields": [
{
"type": "String",
"field": "from_id",
"alias": "from_id"
}
]
},
"toId": {
"fields": [
{
"type": "String",
"field": "to_id",
"alias": "to_id"
}
]
},
"attributes": [
{
"type": "Double",
"field": "weight",
"alias": "weight"
}
]
}
]
}
}
Log into PuppyGraph Web UI at http://localhost:8081 with username
puppygraph and password puppygraph123.

PuppyGraph Login
Upload the schema by selecting the file
schema.json in the Upload Graph Schema JSON block and clicking on Upload.

Upload Schema Page
Once the schema is uploaded, the schema page shows the visualized graph schema as follows.

Visualized Schema
Alternative: Schema Uploading via CLI
Alternatively, run the following command to upload the schema file:
curl -XPOST -H "content-type: application/json" --data-binary @./schema.json --user "puppygraph:puppygraph123" localhost:8081/schema
The response shows that graph schema has been uploaded successfully:
Querying the Graph
In this tutorial we will use the Gremlin query language to query the Graph. Gremlin is a graph query language developed by Apache TinkerPop. Prior knowledge of Gremlin is not necessary to follow the tutorial. To learn more about it, visit https://tinkerpop.apache.org/gremlin.html.
Click on the
Query panel the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin.

Interactive Gremlin Query Page
Queries are entered on the left side, and the right side displays the graph visualization.
The first query retrieves the property of the person named "marko".
Copy the following query, paste it in the query input, and click on the run button.
The output is plain text like the following:
Now let's also leverage the visualization. The next query gets all the software created by people known to "marko".
Copy the following query, paste it in the query input, and click on the run button.
The output is as follows. There are two paths in the result as "marko" knows "josh" who created "lop" and "ripple".

Interactive Query with Results
Alternative: Querying the graph via CLI
Alternatively, we can query the graph via CLI.
Execute the following command to access the PuppyGraph Gremlin Console
The welcome screen appears as follows:
____ ____ _
| _ \ _ _ _ __ _ __ _ _ / ___| _ __ __ _ _ __ | |__
| |_) | | | | | | '_ \ | '_ \ | | | | | | _ | '__| / _` | | '_ \ | '_ \
| __/ | |_| | | |_) | | |_) | | |_| | | |_| | | | | (_| | | |_) | | | | |
|_| \__,_| | .__/ | .__/ \__, | \____| |_| \__,_| | .__/ |_| |_|
|_| |_| |___/ |_|
Welcome to PuppyGraph!
version: 0.10
puppy-gremlin>
Run the following queries in the console to query the Graph.
Properties of the person named "marko":
To exit PuppyGraph Gremlin Console, enter the command:
Cleaning up
Run the following command to shut down and remove the services:
See Also
Please refer to Connecting to Apache Iceberg and Connecting to Delta Lake for connecting to different implementations of Iceberg.
Azure OneLake allows users to convert Iceberg format and Delta Lake format mutually, see Use Iceberg tables with OneLake.