Schema

The PuppyGraph schema is a JSON file that defines the graph structure. It includes definitions for the catalog, node (vertex), and edge.

How to write a schema file

PuppyGraph requires user to provide the schema JSON file and upload it to PuppyGraph.

Success

You can use the schema and demo data provided by PuppyGraph for a quick demo.

For a quick start, please follow the steps.

1. Identify the catalog (data source) of PuppyGraph

PuppyGraph is a query engine and can be used to query data from datalakes (Hive, Iceberg, Hudi, Delta lake) or other JDBC supported databases. Identify the catalogs that has the graph data for PuppyGraph.

2. Write the catalog schema

Based on the type of datalakes or JDBC, follow the user manual guide for each catalog.

Apache Iceberg: Connecting to Apache Iceberg
Apache Hudi: Connecting to Apache Hudi
Delta lake: Connecting to Delta Lake
MySQL: Connecting to MySQL
PostgreSQL: Connecting to PostgreSQL
For detailed specifications on the schema catalog fields. Refer to Data Lake Catalog and JDBC Catalog

Info

JDBC connection may impact the performance of PuppyGraph.

3. Identify the graph nodes (vertices) and edges.

Each nodes (vertices) or edges needs to map to a single table or view from the catalog.
Each nodes (vertices) and edges map needs to have a primary key. PuppyGraph will use it as ID for Gremlin query.
The edge tables need to have foreign keys or ids to the "from" and "to" node (vertex) tables.

Success

If the data source does not have a primary key or other fields that can be used as the ID. You can try to generate an ID column with row numbers. Or create a view with row numbers and use the view as the data source.

Success

If the data source has only edges. You can create a table by extracting from the edge data. Or create a view by selecting the IDs from the edges.

Info

Using views as PuppyGraph data source may impact PuppyGraph performance.

4. Write the schema for the nodes (vertices) and edges.

Write the node (vertex) schema following the specification Node (Vertex).
Write the edge schema following the specification Edge.

Example

{
  "catalogs": [
    {
      "name": "ldbc",
      "type": "iceberg",
      "metastore": {
        "type": "glue",
        "useInstanceProfile": "false",
        "region": "us-east-1",
        "accessKey": "******",
        "secretKey": "******"
      },
      "storage": {
        "useInstanceProfile": "false",
        "region": "us-east-1",
        "accessKey": "******",
        "secretKey": "******",
        "enableSsl": "false"
      }
    }
  ],
  "vertices": [
    {
      "label": "Person",
      "mappedTableSource": {
        "catalog": "ldbc",
        "schema": "ldbc_sf1",
        "table": "v_person",
        "metaFields": {
          "id": "id"
        }
      },
      "attributes": [
        {
          "name": "creationDate",
          "type": "DateTime"
        },
        {
          "name": "firstName",
          "type": "String"
        },
        {
          "name": "lastName",
          "type": "String"
        }
      ]
    }
  ],
  "edges": [
    {
      "label": "Knows",
      "mappedTableSource": {
        "catalog": "ldbc",
        "schema": "ldbc_sf1",
        "table": "e_person_knows_person",
        "metaFields": {
          "id": "id",
          "from": "from",
          "to": "to"
        }
      },
      "from": "Person",
      "to": "Person",
      "attributes": [
        {
          "name": "creationDate",
          "type": "DateTime"
        }
      ]
    }
  ]
}

The example schema file above defines an Apache Iceberg catalog as the graph data source, with its metastore located in AWS Glue.

The schema defines a single node (vertex) type named Person. This Person data maps to the Iceberg table ldbc.ldbc_sf1.v_person. Every PuppyGraph node (vertex) mandates an id metafield, which corresponds to the id field in the Iceberg table. The Person node (vertex) has three queryable attributes: creationDate of type DateTime, and firstName and lastName, both of type String.

The schema also defines a single edge type named Knows. This Knows data maps to the Iceberg table ldbc.ldbc_sf1.e_person_knows_person. Every PuppyGraph edge mandates three metafield id, from and to. PuppyGraph edge requires three metafields: id, from, and to, which map directly to the fields id, from, and to in the Iceberg table. Every PuppyGraph edge also needs to define the node (vertex) type it connects to, in this example, Knows connects from node (vertex) Person to node (vertex) Person. The Knows edge also has one queryable attribute "creationDate" of type DateTime.