Data Sources and Local Tables

Every node, edge, and local table in a PuppyGraph schema declares where its rows come from through a dataSourceGroup. A data source group has up to three options (point at an external catalog table, read from a locally cached table, or combine multiple tables) plus a row-to-graph-field mapping for when source columns don't line up with graph fields.

Each section below shows the corresponding Schema Builder UI flow alongside the JSON.

Picking a source

Option	Use it when
`externalDataSource`	The simplest and most common case: rows live in a catalog table and you query them directly.
`localDataSource`	You want PuppyGraph to cache the data on its compute nodes for faster reads. The cache is defined as a local table.
`unionDataSource`	A single node or edge type spans multiple tables, for example a foreign key referenced from several tables, or one logical type partitioned across sources.

A dataSourceGroup can have all three sub-blocks defined at once, but only one is enabled at a time. This makes it easy to switch a node from external to local (or back) by toggling flags rather than rewriting the schema.

Pointing at a catalog table

The default for new nodes and edges. Rows are queried directly from the source table at query time.

Schema BuilderJSON

When you add a node or edge from a catalog table and choose External only in the local-replication step, the data source group has just an externalDataSource.

"dataSourceGroup": {
  "externalDataSource": {
    "enabled": true,
    "catalog": "postgres_data",
    "schema": "modern",
    "table": "person"
  }
}

Field	Description
`enabled`	Whether to read from this source.
`catalog`	Catalog name, as defined in the schema's `catalog[]`. See Connecting for catalog setup.
`schema`	Schema (database) name within the catalog.
`table`	Table name within the schema.
`mappedField`	Optional source-to-graph field map. See Mapping source columns to graph fields.
`whereClause`	Optional filter expression that filters source rows before they become graph elements. See Modeling denormalized tables.

Reading from a local table

Use this when you want PuppyGraph to cache the source data and serve queries from the cache.

Schema BuilderJSON

There are two ways to wire a node or edge to a local table:

Choose Cache and switch in the local-replication step when adding a node or edge from a catalog table. The wizard creates the local table behind the scenes, loads it, and switches the data source over.
Add as Node or Add as Edge from a Local Table in the catalog tree (local tables appear in their own section at the top). The new element reads from that local table directly.

"dataSourceGroup": {
  "localDataSource": {
    "enabled": true,
    "localTableName": "person_local"
  }
}

Field	Description
`enabled`	Whether to read from this source.
`localTableName`	Name of a local table defined in the schema's `localTable[]`.
`mappedField`	Optional. See below.
`whereClause`	Optional filter expression that filters local-table rows before they become graph elements. See Modeling denormalized tables.

The local table itself is defined separately, under the schema's top-level localTable[] array. See Defining a local table.

Combining tables with a union

A union data source treats multiple input tables as a single source. Each input is a (catalog, schema, table) tuple, and dedupKey[] lists the graph fields used as GROUP BY keys to deduplicate across inputs.

Schema BuilderJSON

In the Add Node or Add Edge wizard, click Add Source Table to add another input. Configure each input's column-to-field mappings (drag, rename, retype) just as you would for a single source. The wizard merges the column lists across inputs.

"dataSourceGroup": {
  "unionDataSource": {
    "enabled": true,
    "input": [
      {
        "catalog": "postgres_data",
        "schema":  "modern",
        "table":   "person_a",
        "mappedField": [
          { "sourceFieldName": "user_id", "targetFieldName": "id"   },
          { "sourceFieldName": "user_nm", "targetFieldName": "name" }
        ]
      },
      {
        "catalog": "postgres_data",
        "schema":  "modern",
        "table":   "person_b",
        "mappedField": [
          { "sourceFieldName": "id",   "targetFieldName": "id"   },
          { "sourceFieldName": "name", "targetFieldName": "name" }
        ]
      }
    ],
    "dedupKey": ["id"]
  }
}

Field	Description
`enabled`	Whether to read from this source.
`input[]`	One entry per input table: `catalog`, `schema`, `table`, optional `mappedField[]`, and optional `whereClause`.
`dedupKey[]`	Target field names to use as `GROUP BY` keys. Other target fields are aggregated.

Use a union when the same conceptual entity appears in more than one table and you want to query it as a single type. For example, a Person whose ID can appear in any of several event tables, with no dedicated person table of its own.

Mapping source columns to graph fields

When source column names don't match the graph field names you want to expose, use mappedField[] to translate. Each entry has a sourceFieldName (column in the source table) and a targetFieldName (the graph field, such as an id, attribute, fromKey, or toKey column on the node, edge, or local table).

Schema BuilderJSON

In the column wizard, click any column name to rename it. The new name becomes the graph field. The wizard tracks both the original and new names and emits the corresponding mappedField[] entries automatically.

"mappedField": [
  { "sourceFieldName": "user_id", "targetFieldName": "id"   },
  { "sourceFieldName": "user_nm", "targetFieldName": "name" }
]

mappedField[] is available on externalDataSource, localDataSource, and each unionDataSource.input[].

Defining a local table

A local table caches rows from one or more catalog tables onto PuppyGraph's compute nodes. Once loaded, queries against any node or edge backed by the local table read from the cache instead of the external source. This is typically faster and places no load on the external system.

A local table is defined in the schema's top-level localTable[] array and referenced by name from a localDataSource.

A minimal local table

Schema BuilderJSON

Choose Cache and switch or Cache only in the local-replication step when adding a node or edge to create a local table inline. Or click any catalog table in the tree and choose Add as Local Table to define one explicitly. Either path produces the same localTable[] entry.

"localTable": [
  {
    "name": "person_local",
    "dataSourceGroup": {
      "externalDataSource": {
        "enabled": true,
        "catalog": "postgres_data",
        "schema": "modern",
        "table": "person"
      }
    },
    "column": [
      { "name": "id",   "type": "STRING" },
      { "name": "name", "type": "STRING" },
      { "name": "age",  "type": "INT" }
    ]
  }
]

A node or edge then reads the local table by name:

"dataSourceGroup": {
  "localDataSource": {
    "enabled": true,
    "localTableName": "person_local"
  }
}

Local table fields

Field	Description
`name`	Local table name. Referenced by `localDataSource.localTableName`. Required.
`dataSourceGroup`	The external source to load data from. Required for tables that load from a catalog.
`column[]`	Columns in the local table. Required.
`indexColumn[]`	Columns to index for faster lookups. Optional.
`distributeBy[]`	Distribution across compute nodes. See Distribution. Optional.
`orderBy[]`	Sort order used for local-table data layout. Optional.
`partitionBy`	Partition configuration. See Partitioning. Optional.
`replicationNum`	Number of replicas across compute nodes. Defaults to the cluster setting. Optional.
`storageMedium`	`"SSD"` or `"HDD"`. Optional.
`keyType`	`DUPLICATE_KEY` (default), `PRIMARY_KEY`, or `UNIQUE_KEY`. See Key types.
`enableColocate`, `colocateKey`	Colocation configuration. Optional.
`extraProperties`	Additional engine properties as a map. Optional.

The Schema Builder configures the most common fields (name, columns, data source). The fields below (distribution, partitioning, key types, colocation) are typically set by editing the JSON directly.

Distribution

distributeBy[] controls how data is hashed across compute nodes:

"distributeBy": [
  { "bucket": { "num": 8, "column": ["id"] } }
]

num is the number of buckets, and column[] is the hash key. Choose a column with high cardinality and even value distribution to avoid hot spots.

Partitioning

partitionBy partitions a local table by one or more time-derived columns. This enables partition pruning at query time and partition-level retention.

"partitionBy": {
  "partitionColumn": [
    {
      "column": "event_time",
      "partitionTimeUnit": "DAY",
      "partitionInterval": 1
    }
  ],
  "partitionRetentionCount": 90
}

Field	Description
`partitionColumn[].column`	Column to partition by.
`partitionColumn[].partitionTimeUnit`	One of `YEAR`, `MONTH`, `DAY`, `HOUR`, `MINUTE`.
`partitionColumn[].partitionInterval`	Interval multiplier. For example, `10` with `MINUTE` produces 10-minute partitions. Default `1`.
`partitionColumn[].timestampScaleFactor`	If the column is a numeric Unix timestamp, divide by this scale factor first. Optional.
`partitionColumn[].dateFormat`	If the column is a string, parse it with `str_to_date(col, dateFormat)` first. Optional.
`partitionRetentionCount`	Keep only this many most-recent partitions. Older partitions are dropped. Optional.
`partitionRetentionCondition`	Alternative to count-based retention: a condition expression. Optional.

Use partitionRetentionCount or partitionRetentionCondition, not both.

Key types

keyType controls update semantics:

Key type	Behavior
`DUPLICATE_KEY` (default)	Append-only. Duplicate rows are kept. No `UPDATE` or `DELETE`.
`PRIMARY_KEY`	Supports `UPDATE` and `DELETE` on the key columns.
`UNIQUE_KEY`	Upsert: a row with an existing key replaces the previous row.

For most analytics workloads, DUPLICATE_KEY is the right choice. Use PRIMARY_KEY or UNIQUE_KEY when you need to mutate rows in place.

Loading data

Loading data into a local table is triggered from the Local Table Management page in the Web UI, or as part of the cache-and-switch or cache-only flow when adding a node from the Schema Builder. See Managing the Graph for the management UI and automation APIs.