PuppyGraph
Search
K

Schema

The PuppyGraph schema is a JSON file that defines the graph structure. It includes definitions for the catalog, vertex, and edge.

How to write a schema file

PuppyGraph requires user to provide the schema JSON file and upload it to PuppyGraph.
You can use the schema and demo data provided by PuppyGraph for a quick demo.
For a quick start, please follow the steps.

1. Identify the catalog (data source) of PuppyGraph

PuppyGraph is a query engine and can be used to query data from datalakes (Hive, Iceberg, Hudi, Delta lake) or other JDBC supported databases. Identify the catalogs that has the graph data for PuppyGraph.

2. Write the catalog schema

Based on the type of datalakes or JDBC, follow the user manual guide for each catalog.
  1. 1.
    Apache Iceberg: Connecting to Apache Iceberg
  2. 2.
  3. 3.
  4. 5.
  5. 6.
    For detailed specifications on the schema catalog fields. Refer to Data Lake Catalog and JDBC Catalog
JDBC connection may impact the performance of PuppyGraph.

3. Identify the graph vertexes and edges.

  1. 1.
    Each vertexes or edges needs to map to a single table or view from the catalog.
  2. 2.
    Each vertexes and edges map needs to have a primary key. PuppyGraph will use it as ID for Gremlin query.
  3. 3.
    The edge tables need to have foreign keys or ids to the "from" and "to" vertex tables.
If the data source does not have a primary key or other fields that can be used as the ID. You can try to generate an ID column with row numbers. Or create a view with row numbers and use the view as the data source.
If the data source has only edges. You can create a table by extracting from the edge data. Or create a view by selecting the IDs from the edges.
Using views as PuppyGraph data source may impact PuppyGraph performance.

4. Write the schema for the vertexes and edges.

  1. 1.
    Write the vertex schema following the specification Vertex.
  2. 2.
    Write the edge schema following the specification Edge.

Example

{
"catalogs": [
{
"name": "ldbc",
"type": "iceberg",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "us-east-1",
"accessKey": "******",
"secretKey": "******"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-east-1",
"accessKey": "******",
"secretKey": "******",
"enableSsl": "false"
}
}
],
"vertices": [
{
"label": "Person",
"mappedTableSource": {
"catalog": "ldbc",
"schema": "ldbc_sf1",
"table": "v_person",
"metaFields": {
"id": "id"
}
},
"attributes": [
{
"name": "creationDate",
"type": "DateTime"
},
{
"name": "firstName",
"type": "String"
},
{
"name": "lastName",
"type": "String"
}
]
}
],
"edges": [
{
"label": "Knows",
"mappedTableSource": {
"catalog": "ldbc",
"schema": "ldbc_sf1",
"table": "e_person_knows_person",
"metaFields": {
"id": "id",
"from": "from",
"to": "to"
}
},
"from": "Person",
"to": "Person",
"attributes": [
{
"name": "creationDate",
"type": "DateTime"
}
]
}
]
}
The example schema file above defines an Apache Iceberg catalog as the graph data source, with its metastore located in AWS Glue.
The schema defines a single vertex type named Person. This Person data maps to the Iceberg table ldbc.ldbc_sf1.v_person. Every PuppyGraph vertex mandates an id metafield, which corresponds to the id field in the Iceberg table. The Person vertex has three queryable attributes: creationDate of type DateTime, and firstName and lastName, both of type String.
The schema also defines a single edge type named Knows. This Knows data maps to the Iceberg table ldbc.ldbc_sf1.e_person_knows_person. Every PuppyGraph edge mandates three metafield id, from and to. PuppyGraph edge requires three metafields: id, from, and to, which map directly to the fields id, from, and to in the Iceberg table. Every PuppyGraph edge also needs to define the Vertex type it connects to, in this example, Knows connects from vertex Person to vertex Person. The Knows edge also has one queryable attribute "creationDate" of type DateTime.