Cluster Deployment
Overview
This article describes how to deploy PuppyGraph in a cluster mode with two main focuses:
- Horizontal scalability: Getting better performance by adding more nodes.
- High availability: Ensuring that the system remains operational even if some nodes fail.
The cluster deployment in this documentation is based on the Kubernetes, which is a popular container orchestration platform. But any other orchestration platform can be used as well.
Cluster Architecture
A PuppyGraph cluster consists of two types of nodes: leader nodes and compute nodes.
For production deployments, it is recommended to have at least three (3) leader nodes and at least three (3) compute nodes.
Leader Nodes
Leader nodes are responsible for managing the cluster and coordinating the compute nodes. They handle the leader election process, monitor the health of the cluster, and manage the cluster state.
Compute Nodes
Compute nodes are responsible for executing the queries and processing the data. They communicate with the leader nodes to get the cluster state and execute the queries.
Setting up Leader Nodes
Configuration of Leader Nodes
Basic Leader Environment Variables
These environment variables are required for a single leader node to work correctly.
| Variable Name | Description |
|---|---|
LEADER |
Identifies the node as a leader. Set to true for leader nodes. |
COMPUTE_NODE_COUNT |
Number of compute nodes in the cluster. Used by the leader to determine if the cluster is ready to serve requests. |
PRIORITY_IP_CIDR |
IP range for priority IPs for leader nodes. Should match the Kubernetes cluster CIDR. |
STORAGE_PATH_ROOT |
Root path for persistent storage. |
Leader Cluster Environment Variables
These environment variables are required for leader nodes to work correctly in a cluster.
| Variable Name | Description |
|---|---|
POD_NAME |
Name of the current leader node. Used by PuppyGraph to detect when to start a new cluster or wait for others. |
CLUSTER_ID |
Specifies the cluster ID. All leader nodes in the same cluster must use the same ID to avoid conflicts. |
LEADER_SERVICE_ADDRESS |
Address or name of the leader service. |
LEADER_NODE_COUNT |
Number of leader nodes in the cluster. Defines the minimum number required to form a cluster and continue startup. |
LEADER_NODE_DOMAIN_NAME |
Domain name of the current leader node. Allows leader nodes to communicate with each other. |
AUTHENTICATION_JWT_SECRETKEY |
Secret key for JWT authentication. It is recommended to use an authentication key with 32 or 64 bytes. It must be either 16, 24, or 32 bytes to select AES-128, AES-192, or AES-256 modes. |
Additional Cluster Startup Timeout Configuration
These environment variables are used to configure the startup process of the leader nodes. They define the timeouts for various stages of the startup process.
| Variable Name | Description |
|---|---|
CLUSTER_STARTUPTIMEOUT |
Max time the startup process of a leader node can take before being considered failed. |
CLUSTER_POTENTIALLEADERTIMEOUT |
Max time a leader node can take to discover other potential leaders before starting a new cluster. |
CLUSTER_OTHERLEADERTIMEOUT |
Max time a leader node waits for the leader election to complete before it's considered failed. Should be larger than CLUSTER_POTENTIALLEADERTIMEOUT. |
Leader Service
A leader service needs to be configured to allow leader nodes to communicate with each other.
The API service port (default 8081) needs to be exposed for this purpose.
It can also be configured to allow clients to connect to any leader nodes.
In this case, the Gremlin service port (default 8182) or openCypher service port over Bolt (default 7687) needs to be exposed if clients need to connect to the leader nodes directly.
Example Leader Service Configuration
apiVersion: v1
kind: Service
metadata:
name: puppygraph-leader-service
spec:
clusterIP: None
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
Leader Nodes Startup Process
When a leader node starts, it will attempt to discover existing leader nodes through the leader service to check if a cluster is already formed. If it finds other leader nodes, it will register itself with the cluster and join it.
If the leader node cannot find any other leader nodes after a timeout defined as CLUSTER_POTENTIALLEADERTIMEOUT (30 seconds by default), it will only initiate a new cluster if it is the first leader node to start. The eligibility of a leader node to initiate a new cluster is determined by the POD_NAME environment variable. Only the leader node with the POD_NAME ends with -0 will be considered as the first leader node to start. This is used to avoid multiple leader nodes starting at the same time and causing a split-brain situation.
The cluster will then wait for a certain number of leader and compute nodes to join. The number of leader nodes required to form a cluster is defined by the LEADER_NODE_COUNT environment variable. The number of compute nodes required to form a cluster is defined by the COMPUTE_NODE_COUNT environment variable. Once the required number of leader and compute nodes are registered, the cluster node will begin serving requests.
The whole leader startup process has a timeout defined as CLUSTER_STARTUPTIMEOUT (24 hours by default). If the leader node cannot finish the startup process within the timeout, the startup of the leader node is unsuccessful and it needs to be manually restarted.
Readiness Probe
A readiness probe needs to be configured to ensure that the leader nodes are ready to communicate with the rest of the cluster on startup.
The default readiness probe for the leader nodes checks that the /healthz path on port 8081 of the PuppyGraph server returns a 200 status code. On each leader startup, it will use the leader service to discover any existing cluster members in order to decide whether it should join the existing cluster or start a new one.
Example Readiness Probe Configuration
Setting up Compute Nodes
Configuration of Compute Nodes
| Environment Variable | Description |
|---|---|
COMPUTE_NODE |
Indicates that the node is a compute node. Set to true. |
COMPUTE_NODE_LEADER_ADDRESS |
Domain name of the leader service. Used by compute nodes to connect and join the cluster. |
COMPUTE_NODE_DOMAIN_NAME |
Domain name of the compute node. Used by the leader node to identify and communicate with the node. |
PRIORITY_IP_CIDR |
IP range for priority IPs. Should match the Kubernetes cluster CIDR. |
STORAGE_PATH_ROOT |
Root path for persistent storage. |
SCRATCH_PATH_ROOT |
Root path for temporary storage. |
Compute Node Startup Process
When a compute node starts, it will attempt to discover the leader nodes through the leader service using COMPUTE_NODE_LEADER_ADDRESS. If it finds any leader nodes, it will register itself with the cluster and join it.
Leader nodes will then use the registered address to communicate with the compute nodes.
Cluster Management
High Availability (HA)
PuppyGraph's high availability (HA) consists of two parts: Leader nodes HA and Compute nodes HA.
Leader Nodes HA
Leader nodes consist of one leader and multiple followers. Followers continuously synchronize metadata with the leader and monitor each other's health. When the leader node becomes unavailable, followers automatically elect a new leader among themselves, ensuring continuous availability of cluster management.
Leader nodes use the 2n + 1 deployment formula for HA, where n represents the number of allowable failed nodes. For example, to tolerate 1 failed leader node, at least 3 leader nodes are required; to tolerate 2 failed leader nodes, at least 5 leader nodes are required, and so on.
The minimum required leader and compute nodes are configured by environment variables:
| Environment Variable | Description | Default |
|---|---|---|
LEADER_NODE_COUNT |
Minimum number of leader nodes required to form a cluster | 3 |
COMPUTE_NODE_COUNT |
Minimum number of compute nodes required to form a cluster | 3 |
Enabling HA
To enable high availability, LEADER_NODE_COUNT must be set to 3 or more. Set it to 1 only for single-leader (non-HA) deployments. Note that a count of 3 is typically sufficient for HA, as the actual resilience of your services is determined by the number of replicas rather than the environment variable LEADER_NODE_COUNT. For production deployments, it is recommended to have at least 3 leader nodes to provide high availability.
The leader nodes store the cluster information in the mounted STORAGE_PATH_ROOT. If a leader node is restarted with existing cluster information, it will try to re-register itself with the cluster.
Compute Nodes HA
Compute nodes HA is implemented through multi-replica data storage, configured by the environment variable DATAACCESS_DATA_REPLICATIONNUM with a default value of 3. This means each data block is stored as 3 replicas on different compute nodes.
Leader nodes continuously monitor compute nodes. When a compute node is unavailable, the leader nodes will detect it and mark it as unhealthy. When a compute node becomes unavailable, data replicas on other nodes continue serving queries. The cluster automatically detects the replica count and restores it to the configured value through data replication.
After a compute node is restarted, it will try to register itself to the leader nodes again. It will also report the existing data state to the leader nodes, and the leader nodes will accordingly update the data state.
Replication vs HA Nodes
The replication factor (DATAACCESS_DATA_REPLICATIONNUM) is independent from the number of HA nodes. For example, you can have 5 compute nodes with a replication factor of 3.
Persistent Volumes and Local Cache
Leader Nodes
For leader nodes, persistent volumes are used to store cluster metadata. If a leader node pod is deleted and recreated, it will retain its metadata from the persistent volume as long as the volume itself is not deleted.
Compute Nodes
For compute nodes, persistent volumes are used to store the local cache data.
When you delete a compute node via the Web UI, the local cache data on that node is automatically migrated to other compute nodes before removal. This ensures data integrity and maintains the replica count specified by DATAACCESS_DATA_REPLICATIONNUM.
If a compute node fails unexpectedly (not deleted via Web UI), the cluster automatically detects the replica count mismatch and redistributes data replicas to other available compute nodes to restore the configured replication factor.
Cluster Stop and Restart
Stopping the Cluster
If you are using the Helm chart, stopping the cluster is done by uninstalling the Helm release.
Persistent Volumes
Uninstalling the Helm release does not delete the associated persistent volumes. The metadata and local cache data remain available for the next cluster restart.
If you no longer need the persistent volumes, you can manually delete them.
Restarting the Cluster
To restart the cluster, reinstall the Helm chart with your configuration.
Restarting a Single Leader/Compute Node
Sometimes you may need to restart a single leader/compute node, for example, if it is stuck. You can do this by deleting the corresponding pod using kubectl, and Kubernetes will automatically recreate it.
To delete the persistent volumes as well, you can run kubectl delete pvc followed by kubectl delete pod. This ensures the associated PV is removed after the pod terminates, allowing the pod to start with a clean volume upon restart.
# Delete PVC first (won't actually be removed until the pod releases the volume)
kubectl delete pvc <pvc-name> -n <namespace>
# Then delete Pod to trigger recreation with a fresh PV
kubectl delete pod <pod-name> -n <namespace>
Cluster Scaling
Cluster scaling consists of two parts: scaling up (adding nodes) and scaling down (removing nodes).
Scaling Up
For multi-leader deployments, to scale up leader or compute nodes, directly modify the replicas in the corresponding StatefulSet and apply changes.
For compute node scaling, the persistent local cache is not affected. New compute nodes automatically rebalance data upon startup.
Single Leader to HA Upgrade
If you are currently running a single leader deployment and want to scale up leader nodes to enable HA, the process is different:
- First, set
LEADER_NODE_COUNTto 3 and apply changes - Then, scale up by setting
replicasand apply changes again
Note: This process involves downtime until the multi-leader deployment is complete.
HA Scaling
In HA deployments, scaling up or down has no downtime and services remain available throughout.
After applying changes, you can verify pod status via
and confirm in the Web UI.
Scaling Down
Scaling down requires separate procedures for compute nodes and leader nodes.
Scaling Down Compute Nodes
-
Delete compute node(s) with the highest index via Web UI
- Navigate to the cluster management page in the Web UI
- Select the compute node(s) with the highest index (e.g., delete
puppygraph-compute-3andpuppygraph-compute-4when scaling from 5 to 3 replicas, considering 0-based indexing) - Click delete to remove the nodes
-
Wait for deletion and local cache data transfer to complete
- Refresh the Web UI to verify nodes have been removed.
- Note that this step takes relatively longer as local cache data automatically migrates to remaining nodes
-
Update StatefulSet replicas
- Modify
replicasto the target value in your StatefulSet configuration
- Modify
-
Apply changes
-
Delete persistent volumes of removed pods
- Manually delete the persistent volumes associated with the removed compute nodes.
- This ensures the pods start from a clean PV upon reboot when there is further scaling up in the future.
Pod Status After UI Deletion
After deleting nodes in Web UI, the corresponding pods will still remain in running status. This is expected behavior and the pods will be removed when you update replicas and apply changes.
Scaling Down Leader Nodes
The process is similar to compute nodes, with one important exception regarding the elected leader.
Steps:
-
Delete follower leader nodes with the highest indices via Web UI
- Navigate to the Web UI
- Delete follower nodes starting from the highest index
-
If the elected leader is among the nodes to be deleted:
- Use
kubectl delete podto delete the elected leader pod Replace<statefulset-name>with your actual StatefulSet name (e.g.,puppygraph-leader) and<n>with the pod ordinal. - This triggers leader election, selecting a new leader from among remaining followers
- The original leader pod will be recreated by Kubernetes and become a follower
- Now you can delete this follower via Web UI
- Use
-
Wait for Web UI to show completion
- Refresh the page to verify all nodes are deleted
-
Update StatefulSet replicas
- Modify
replicasto the target value in your StatefulSet configuration
- Modify
-
Apply changes
-
Delete persistent volumes of removed pods
- Manually delete the persistent volumes associated with the removed leader nodes.
- This ensures the pods start from a clean PV upon reboot when there is further scaling up in the future.
Version Upgrade
Pre-Upgrade Support Contact
Before starting any upgrade, contact PuppyGraph support to verify version compatibility and any specific migration requirements.
Image Upgrade
Upgrading PuppyGraph to a new version involves updating container images for leader and compute pods. You need to upgrade them separately and the order of upgrade matters: upgrade the compute nodes first, then upgrade the leader nodes.
To upgrade the images, set the image tag in the StatefulSet or Helm values and apply changes.
If using Helm chart, you can:
- Set
compute.image.tagto the new version and apply the change to upgrade compute nodes - Set
leader.image.tagto the new version and apply the change to upgrade leader nodes - Note that
compute.image.tagandleader.image.tagare more specific settings for compute and leader nodes, respectively, and they can overrideimage.tag. If neither is explicitly set, they default to the value ofimage.tag.
After applying each change, verify the pod status and confirm in the Web UI.
Upgrade Behavior
During the upgrade or downgrade process, all affected pods for leader nodes and compute nodes will be recreated via a rolling update by default. The local cache is not affected.
Since PuppyGraph is not a database itself, there are no data migration requirements for PuppyGraph data or local cache.
Whether the cluster can provide services during upgrade or downgrade depends on the specific versions, but generally services remain available.
Helm Chart Upgrade
If you are upgrading the PuppyGraph Helm chart itself with already running PuppyGraph clusters, you can review the release notes, check the chart changes and contact PuppyGraph support to verify version compatibility and any specific upgrade requirements.
Chart vs Image Upgrade
Upgrading the Helm chart is different from upgrading the PuppyGraph image version. PuppyGraph image upgrades are done through modifying values.yaml in the chart. Chart upgrades may not change the PuppyGraph version or even trigger pod rolling updates. However, the chart and image still evolve together, so the settings in the chart should be compatible with the corresponding image versions. Therefore, as long as the chart and image versions are aligned and up to date, they should remain compatible.
Example configuration
Here is an example configuration for deploying PuppyGraph in a cluster mode with three leader nodes and three compute nodes.
It depends on secrets secret-puppygraph to store the username PUPPYGRAPH_USERNAME and password PUPPYGRAPH_PASSWORD for the PuppyGraph server.
The configuration also includes an empty ConfigMap configmap-puppygraph that can be used to store additional configuration files if needed. The ConfigMap is mounted at /etc/config/puppygraph/ in the leader and compute nodes.
Example PuppyGraph Cluster Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: configmap-puppygraph
data: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: puppygraph-leader
labels:
app: puppygraph-leader
spec:
serviceName: puppygraph-leader-service
podManagementPolicy: "Parallel"
replicas: 3
selector:
matchLabels:
app: puppygraph-leader
template:
metadata:
labels:
app: puppygraph-leader
spec:
containers:
- name: puppygraph-leader
image: puppygraph/puppygraph:stable
imagePullPolicy: Always
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CLUSTER_ID
value: "1"
- name: PUPPYGRAPH_USERNAME
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_USERNAME
- name: PUPPYGRAPH_PASSWORD
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_PASSWORD
- name: LEADER
value: "true"
- name: LEADER_SERVICE_ADDRESS
value: "puppygraph-leader-service"
- name: LEADER_NODE_COUNT
value: "3"
- name: LEADER_NODE_DOMAIN_NAME
value: "$(POD_NAME).puppygraph-leader-service.$(NAMESPACE).svc.cluster.local"
- name: AUTHENTICATION_JWT_SECRETKEY
value: "<your-secret-key-here>"
- name: COMPUTE_NODE_COUNT
value: "3"
- name: PRIORITY_IP_CIDR
value: "172.31.0.0/16"
- name: DATAACCESS_DATA_REPLICATIONNUM
value: "3"
- name: STORAGE_PATH_ROOT
value: "/data/storage"
- name: SCRATCH_PATH_ROOT
value: "/data/scratch"
- name: CLUSTER_STARTUPTIMEOUT
value: "10m"
volumeMounts:
- name: volume-config-puppygraph
mountPath: /etc/config/puppygraph/
readOnly: true
- name: data
mountPath: /data
resources:
requests:
cpu: "16"
memory: "64Gi"
limits:
memory: "64Gi"
readinessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: volume-config-puppygraph
configMap:
name: configmap-puppygraph
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ebs-sc
resources:
requests:
storage: 200Gi
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-leader-service
spec:
clusterIP: None
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-cluster-proxy
spec:
type: LoadBalancer
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: puppygraph-compute
labels:
app: puppygraph-compute
spec:
serviceName: puppygraph-compute-service
podManagementPolicy: "Parallel"
replicas: 3
selector:
matchLabels:
app: puppygraph-compute
template:
metadata:
labels:
app: puppygraph-compute
spec:
containers:
- name: puppygraph-compute
image: puppygraph/puppygraph:stable
imagePullPolicy: Always
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: PUPPYGRAPH_USERNAME
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_USERNAME
- name: PUPPYGRAPH_PASSWORD
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_PASSWORD
- name: COMPUTE_NODE
value: "true"
- name: COMPUTE_NODE_LEADER_ADDRESS
value: "puppygraph-leader-service"
- name: COMPUTE_NODE_DOMAIN_NAME
value: "$(POD_NAME).puppygraph-compute-service.$(NAMESPACE).svc.cluster.local"
- name: PRIORITY_IP_CIDR
value: "172.31.0.0/16"
- name: STORAGE_PATH_ROOT
value: "/data/storage"
- name: SCRATCH_PATH_ROOT
value: "/data/scratch"
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
cpu: "16"
memory: "64Gi"
limits:
memory: "64Gi"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ebs-sc
resources:
requests:
storage: 30Gi
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-compute-service
spec:
clusterIP: None
selector:
app: puppygraph-compute
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
Troubleshooting Common Issues
What should I do if pods are stuck in CrashLoopBackOff?
Download logs via the Web UI and check for startup errors or missing configuration. Ensure environment variables are set correctly. Restart the cluster after fixing the issues.
Why are persistent volumes not attaching?
Verify storage class and PVC binding. If using cloud services, ensure the storage provisioner is configured correctly.
Why is the cluster not forming properly?
Wait for all pods to enter the Running status. If leader or compute node pods remain not ready after an extended period, download logs via the Web UI for detailed error information. You can also refer to Restarting a Single Leader/Compute Node to restart the corresponding pod.
What should I do about high memory or CPU usage?
Tune resource limits or investigate query patterns. Alternatively, add more compute nodes to distribute the load.
Why is the service not reachable?
Verify ingress or service configuration. If the issue occurs during cluster updates, try refreshing the page. During rolling updates, connections to the previous leader pod may be interrupted as it restarts. Refreshing will reconnect to another leader node pod.