Schema
The PuppyGraph schema is a JSON file that defines the graph structure. It includes definitions for the catalog, vertex, and edge.
How to write a schema file
PuppyGraph requires user to provide the schema JSON file and upload it to PuppyGraph.
Success
You can use the schema and demo data provided by PuppyGraph for a quick demo.
For a quick start, please follow the steps.
1. Identify the catalog (data source) of PuppyGraph
PuppyGraph is a query engine and can be used to query data from datalakes (Hive, Iceberg, Hudi, Delta lake) or other JDBC supported databases. Identify the catalogs that has the graph data for PuppyGraph.
2. Write the catalog schema
Based on the type of datalakes or JDBC, follow the user manual guide for each catalog.
- Apache Iceberg: Connecting to Apache Iceberg
- Apache Hudi: Connecting to Apache Hudi
- Delta lake: Connecting to Delta Lake
- MySQL: Connecting to MySQL
- PostgreSQL: Connecting to PostgreSQL
- For detailed specifications on the schema catalog fields. Refer to Data Lake Catalog and JDBC Catalog
Info
JDBC connection may impact the performance of PuppyGraph.
3. Identify the graph vertexes and edges.
- Each vertexes or edges needs to map to a single table or view from the catalog.
- Each vertexes and edges map needs to have a primary key. PuppyGraph will use it as
ID
for Gremlin query. - The edge tables need to have foreign keys or ids to the "from" and "to" vertex tables.
Success
If the data source does not have a primary key or other fields that can be used as the ID. You can try to generate an ID column with row numbers. Or create a view with row numbers and use the view as the data source.
Success
If the data source has only edges. You can create a table by extracting from the edge data. Or create a view by selecting the IDs from the edges.
Info
Using views as PuppyGraph data source may impact PuppyGraph performance.
4. Write the schema for the vertexes and edges.
- Write the vertex schema following the specification Vertex.
- Write the edge schema following the specification Edge.
Example
{
"catalogs": [
{
"name": "ldbc",
"type": "iceberg",
"metastore": {
"type": "glue",
"useInstanceProfile": "false",
"region": "us-east-1",
"accessKey": "******",
"secretKey": "******"
},
"storage": {
"useInstanceProfile": "false",
"region": "us-east-1",
"accessKey": "******",
"secretKey": "******",
"enableSsl": "false"
}
}
],
"vertices": [
{
"label": "Person",
"mappedTableSource": {
"catalog": "ldbc",
"schema": "ldbc_sf1",
"table": "v_person",
"metaFields": {
"id": "id"
}
},
"attributes": [
{
"name": "creationDate",
"type": "DateTime"
},
{
"name": "firstName",
"type": "String"
},
{
"name": "lastName",
"type": "String"
}
]
}
],
"edges": [
{
"label": "Knows",
"mappedTableSource": {
"catalog": "ldbc",
"schema": "ldbc_sf1",
"table": "e_person_knows_person",
"metaFields": {
"id": "id",
"from": "from",
"to": "to"
}
},
"from": "Person",
"to": "Person",
"attributes": [
{
"name": "creationDate",
"type": "DateTime"
}
]
}
]
}
The example schema file above defines an Apache Iceberg catalog as the graph data source, with its metastore located in AWS Glue.
The schema defines a single vertex type named Person
. This Person
data maps to the Iceberg table ldbc.ldbc_sf1.v_person
. Every PuppyGraph vertex mandates an id
metafield, which corresponds to the id
field in the Iceberg table. The Person
vertex has three queryable attributes: creationDate
of type DateTime
, and firstName
and lastName
, both of type String
.
The schema also defines a single edge type named Knows
. This Knows
data maps to the Iceberg table ldbc.ldbc_sf1.e_person_knows_person
. Every PuppyGraph edge mandates three metafield id
, from
and to
. PuppyGraph edge requires three metafields: id
, from
, and to
, which map directly to the fields id
, from
, and to
in the Iceberg table. Every PuppyGraph edge also needs to define the Vertex type it connects to, in this example, Knows
connects from vertex Person
to vertex Person
. The Knows
edge also has one queryable attribute "creationDate" of type DateTime
.