Cluster Deployment
Overview
This article describes how to deploy PuppyGraph in a cluster mode with two main focus:
- Horizontal scalability: Getting better performance by adding more nodes.
- High availability: Ensuring that the system remains operational even if some nodes fail.
The cluster deployment in this documentation is based on the Kubernetes, which is a popular container orchestration platform. But any other orchestration platform can be used as well.
Cluster Architecture
A PuppyGraph cluster consists of two types of nodes: leader nodes and compute nodes.
Leader Nodes
Leader nodes are responsible for managing the cluster and coordinating the compute nodes. They handle the leader election process, monitor the health of the cluster, and manage the cluster state.
Compute Nodes
Compute nodes are responsible for executing the queries and processing the data. They communicate with the leader nodes to get the cluster state and execute the queries.
Setting up Leader Nodes
Configuration of Leader Nodes
Basic Leader Environment Variables
These environment variables are required for a single leader node to work correctly.
Variable Name | Description |
---|---|
LEADER |
Identifies the node as a leader. Set to true for leader nodes. |
COMPUTE_NODE_COUNT |
Number of compute nodes in the cluster. Used by the leader to determine if the cluster is ready to serve requests. |
PRIORITY_IP_CIDR |
IP range for priority IPs for leader nodes. Should match the Kubernetes cluster CIDR. |
STORAGE_PATH_ROOT |
Root path for persistent storage. |
Leader Cluster Environment Variables
These environment variables are required for leader nodes to work correctly in a cluster.
Variable Name | Description |
---|---|
POD_NAME |
Name of the current leader node. Used by PuppyGraph to detect when to start a new cluster or wait for others. |
CLUSTER_ID |
Specifies the cluster ID. All leader nodes in the same cluster must use the same ID to avoid conflicts. |
LEADER_SERVICE_ADDRESS |
Address or name of the leader service. |
LEADER_NODE_COUNT |
Number of leader nodes in the cluster. Defines the minimum number required to form a cluster and continue startup. |
LEADER_NODE_DOMAIN_NAME |
Domain name of the current leader node. Allows leader nodes to communicate with each other. |
Additional Cluster Startup Timeout Configuration
These environment variables are used to configure the startup process of the leader nodes. They define the timeouts for various stages of the startup process.
Variable Name | Description |
---|---|
CLUSTER_STARTUPTIMEOUT |
Max time the startup process of a leader node can take before being considered failed. |
CLUSTER_POTENTIALLEADERTIMEOUT |
Max time a leader node can take to discover other potential leaders before starting a new cluster. |
CLUSTER_OTHERLEADERTIMEOUT |
Max time a leader node waits for the leader election to complete before it's considered failed. Should be larger than CLUSTER_POTENTIALLEADERTIMEOUT . |
Leader Service
A leader service needs to be configured to allow leader nodes to communicate with each other.
The API service port (default 8081
) needs to be exposed for this purpose.
It can also be configured to allow clients to connect to any leader nodes.
In this case, the Gremlin service port (default 8182
) or openCypher service port over Bolt (default 7687
) needs to be exposed if clients need to connect to the leader nodes directly.
Example Leader Service Configuration
apiVersion: v1
kind: Service
metadata:
name: puppygraph-leader-service
spec:
clusterIP: None
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
Leader Nodes Startup Process
When a leader node starts, it will attempt to discover existing leader nodes through the leader service to check if a cluster is already formed. If it finds other leader nodes, it will register itself with the cluster and join it.
If the leader node cannot find any other leader nodes after a timeout defined as CLUSTER_POTENTIALLEADERTIMEOUT
(30 seconds by default), it will only initiate a new cluster if it is the first leader node to start. The eligibility of a leader node to initiate a new cluster is determined by the POD_NAME
environment variable. Only the leader node with the POD_NAME
ends with -0
will be considered as the first leader node to start. This is used to avoid multiple leader nodes starting at the same time and causing a split-brain situation.
The cluster will then wait for sufficient leader and compute nodes to join. The number of leader nodes required to form a cluster is defined by the LEADER_NODE_COUNT
environment variable. The number of compute nodes required to form a cluster is defined by the COMPUTE_NODE_COUNT
environment variable. Once the required number of leader and compute nodes are registered, the cluster node will begin serving requests.
The whole leader startup process has a timeout defined as CLUSTER_STARTUPTIMEOUT
(5 minutes by default). If the leader node cannot finish the startup process within the timeout, it will exit with an error.
Setting up Compute Nodes
Configuration of Compute Nodes
Environment Variable | Description |
---|---|
COMPUTE_NODE |
Indicates that the node is a compute node. Set to true . |
COMPUTE_NODE_LEADER_ADDRESS |
Domain name of the leader service. Used by compute nodes to connect and join the cluster. |
COMPUTE_NODE_DOMAIN_NAME |
Domain name of the compute node. Used by the leader node to identify and communicate with the node. |
PRIORITY_IP_CIDR |
IP range for priority IPs. Should match the Kubernetes cluster CIDR. |
STORAGE_PATH_ROOT |
Root path for persistent storage. |
SCRATCH_PATH_ROOT |
Root path for temporary storage. |
Compute Node Startup Process
When a compute node starts, it will attempt to discover the leader nodes through the leader service using COMPUTE_NODE_LEADER_ADDRESS
. If it finds any leader nodes, it will register itself with the cluster and join it.
Leader nodes will then use the registered address to communicate with the compute nodes.
Cluster Management
Failover and Recovery
Once a cluster is formed, the leader nodes will continuously monitor each other as well as the compute nodes. When a leader node is unhealthy, the cluster service will remain unaffected because the other leader nodes will continue to provide the service. Additionally, the remaining leader nodes will attempt to elect a new "primary" leader from among themselves to manage the cluster.
The leader nodes will store the cluster information in the mounted STORAGE_PATH_ROOT
. If a leader node is restarted with existing cluster information, it will try to re-register itself with the cluster. If the current cluster does not match the stored cluster information, the leader node will exit with an error.
When a compute node is not available, the leader nodes will detect it and mark it as unhealthy. The compute node will be removed from the cluster after a timeout defined by COMPUTE_NODE_UNHEALTHY_TIMEOUT
(default 5 minutes).
After a compute node is restarted, it will try to register itself to the leader nodes again. It will also report the existing data state to the leader nodes, and the leader nodes will accordingly update the data state.
Adding and Removing Nodes
To add a new leader or compute node to the cluster, simply deploy a new instance of the corresponding StatefulSet. The new node will automatically discover the existing cluster through the leader service and join it.
To remove a node from the cluster on purpose, delete the corresponding leader or compute node in the UI first. After that, you can recycle the pod and the persistent volume claim (PVC) associated with the removed node to free up the storage.
Readiness probe
Readiness probe needs to be configured to ensure that the leader nodes are ready to communicate with the rest of the cluster on startup.
The default readiness probe for the leader nodes checks that the :8081/healthz
endpoint of the PuppyGraph server returns a 200
status code. On each leader startup, it will use the leader service to discover any existing cluster members in order to decide whether it should join the existing cluster or start a new one.
Example Readiness Probe Configuration
Example configuration
Here is an example configuration for deploying PuppyGraph in a cluster mode with three leader nodes and three compute nodes.
It depends on secrets secret-puppygraph
to store the username PUPPYGRAPH_USERNAME
and password PUPPYGRAPH_PASSWORD
for the PuppyGraph server.
Example PuppyGraph Cluster Configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: puppygraph-leader
labels:
app: puppygraph-leader
spec:
serviceName: puppygraph-leader-service
podManagementPolicy: "Parallel"
replicas: 3
selector:
matchLabels:
app: puppygraph-leader
template:
metadata:
labels:
app: puppygraph-leader
spec:
containers:
- name: puppygraph-leader
image: puppygraph/puppygraph:stable
imagePullPolicy: Always
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CLUSTER_ID
value: "1"
- name: PUPPYGRAPH_USERNAME
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_USERNAME
- name: PUPPYGRAPH_PASSWORD
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_PASSWORD
- name: LEADER
value: "true"
- name: LEADER_SERVICE_ADDRESS
value: "puppygraph-leader-service"
- name: LEADER_NODE_COUNT
value: "3"
- name: LEADER_NODE_DOMAIN_NAME
value: "$(POD_NAME).puppygraph-leader-service.$(NAMESPACE).svc.cluster.local"
- name: COMPUTE_NODE_COUNT
value: "3"
- name: PRIORITY_IP_CIDR
value: "172.31.0.0/16"
- name: DATAACCESS_DATA_REPLICATIONNUM
value: "3"
- name: STORAGE_PATH_ROOT
value: "/data"
- name: CLUSTER_STARTUPTIMEOUT
value: "10m"
volumeMounts:
- name: volume-config-puppygraph
mountPath: /etc/config/puppygraph/
readOnly: true
- name: data
mountPath: /data
resources:
requests:
cpu: "16"
memory: "64Gi"
limits:
memory: "64Gi"
readinessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: volume-config-puppygraph
configMap:
name: configmap-puppygraph
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ebs-sc
resources:
requests:
storage: 200Gi
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-leader-service
spec:
clusterIP: None
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-cluster-proxy
spec:
type: NodePort
selector:
app: puppygraph-leader
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081
nodePort: 30081
- name: gremlin
protocol: TCP
port: 8182
targetPort: 8182
nodePort: 30182
- name: cypher
protocol: TCP
port: 7687
targetPort: 7687
nodePort: 30687
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: puppygraph-compute
labels:
app: puppygraph-compute
spec:
serviceName: puppygraph-compute-service
podManagementPolicy: "Parallel"
replicas: 3
selector:
matchLabels:
app: puppygraph-compute
template:
metadata:
labels:
app: puppygraph-compute
spec:
containers:
- name: puppygraph-compute
image: puppygraph/puppygraph:stable
imagePullPolicy: Always
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: PUPPYGRAPH_USERNAME
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_USERNAME
- name: PUPPYGRAPH_PASSWORD
valueFrom:
secretKeyRef:
name: secret-puppygraph
key: PUPPYGRAPH_PASSWORD
- name: COMPUTE_NODE
value: "true"
- name: COMPUTE_NODE_LEADER_ADDRESS
value: "puppygraph-leader-service"
- name: COMPUTE_NODE_DOMAIN_NAME
value: "$(POD_NAME).puppygraph-compute-service.$(NAMESPACE).svc.cluster.local"
- name: PRIORITY_IP_CIDR
value: "172.31.0.0/16"
- name: STORAGE_PATH_ROOT
value: "/data"
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
cpu: "16"
memory: "64Gi"
limits:
memory: "64Gi"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ebs-sc
resources:
requests:
storage: 30Gi
---
apiVersion: v1
kind: Service
metadata:
name: puppygraph-compute-service
spec:
clusterIP: None
selector:
app: puppygraph-compute
ports:
- name: rest
protocol: TCP
port: 8081
targetPort: 8081