Cluster Deployment

Overview

This article describes how to deploy PuppyGraph in a cluster mode with two main focuses:

Horizontal scalability: Getting better performance by adding more nodes.
High availability: Ensuring that the system remains operational even if some nodes fail.

The cluster deployment in this documentation is based on the Kubernetes, which is a popular container orchestration platform. But any other orchestration platform can be used as well.

Cluster Architecture

A PuppyGraph cluster consists of two types of nodes: leader nodes and compute nodes.

For production deployments, it is recommended to have at least three (3) leader nodes and at least three (3) compute nodes.

Leader Nodes

Leader nodes are responsible for managing the cluster and coordinating the compute nodes. They handle the leader election process, monitor the health of the cluster, and manage the cluster state.

Compute Nodes

Compute nodes are responsible for executing the queries and processing the data. They communicate with the leader nodes to get the cluster state and execute the queries.

Setting up Leader Nodes

Configuration of Leader Nodes

Basic Leader Environment Variables

These environment variables are required for a single leader node to work correctly.

Variable Name	Description
`LEADER`	Identifies the node as a leader. Set to `true` for leader nodes.
`COMPUTE_NODE_COUNT`	Number of compute nodes in the cluster. Used by the leader to determine if the cluster is ready to serve requests.
`PRIORITY_IP_CIDR`	IP range for priority IPs for leader nodes. Should match the Kubernetes cluster CIDR.
`STORAGE_PATH_ROOT`	Root path for persistent storage.

Leader Cluster Environment Variables

These environment variables are required for leader nodes to work correctly in a cluster.

Variable Name	Description
`POD_NAME`	Name of the current leader node. Used by PuppyGraph to detect when to start a new cluster or wait for others.
`CLUSTER_ID`	Specifies the cluster ID. All leader nodes in the same cluster must use the same ID to avoid conflicts.
`LEADER_SERVICE_ADDRESS`	Address or name of the leader service.
`LEADER_NODE_COUNT`	Number of leader nodes in the cluster. Defines the minimum number required to form a cluster and continue startup.
`LEADER_NODE_DOMAIN_NAME`	Domain name of the current leader node. Allows leader nodes to communicate with each other.
`AUTHENTICATION_JWT_SECRETKEY`	Secret key for JWT authentication. It is recommended to use an authentication key with 32 or 64 bytes. It must be either 16, 24, or 32 bytes to select AES-128, AES-192, or AES-256 modes.

Additional Cluster Startup Timeout Configuration

These environment variables are used to configure the startup process of the leader nodes. They define the timeouts for various stages of the startup process.

Variable Name	Description
`CLUSTER_STARTUPTIMEOUT`	Max time the startup process of a leader node can take before being considered failed.
`CLUSTER_POTENTIALLEADERTIMEOUT`	Max time a leader node can take to discover other potential leaders before starting a new cluster.
`CLUSTER_OTHERLEADERTIMEOUT`	Max time a leader node waits for the leader election to complete before it's considered failed. Should be larger than `CLUSTER_POTENTIALLEADERTIMEOUT`.

Leader Service

A leader service needs to be configured to allow leader nodes to communicate with each other. The API service port (default 8081) needs to be exposed for this purpose.

It can also be configured to allow clients to connect to any leader nodes. In this case, the Gremlin service port (default 8182) or openCypher service port over Bolt (default 7687) needs to be exposed if clients need to connect to the leader nodes directly.

Example Leader Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: puppygraph-leader-service
spec:
  clusterIP: None
  selector:
    app: puppygraph-leader
  ports:
    - name: rest
      protocol: TCP
      port: 8081
      targetPort: 8081
    - name: gremlin
      protocol: TCP
      port: 8182
      targetPort: 8182
    - name: cypher
      protocol: TCP
      port: 7687
      targetPort: 7687

Leader Nodes Startup Process

When a leader node starts, it will attempt to discover existing leader nodes through the leader service to check if a cluster is already formed. If it finds other leader nodes, it will register itself with the cluster and join it.

If the leader node cannot find any other leader nodes after a timeout defined as CLUSTER_POTENTIALLEADERTIMEOUT (30 seconds by default), it will only initiate a new cluster if it is the first leader node to start. The eligibility of a leader node to initiate a new cluster is determined by the POD_NAME environment variable. Only the leader node with the POD_NAME ends with -0 will be considered as the first leader node to start. This is used to avoid multiple leader nodes starting at the same time and causing a split-brain situation.

The cluster will then wait for a certain number of leader and compute nodes to join. The number of leader nodes required to form a cluster is defined by the LEADER_NODE_COUNT environment variable. The number of compute nodes required to form a cluster is defined by the COMPUTE_NODE_COUNT environment variable. Once the required number of leader and compute nodes are registered, the cluster node will begin serving requests.

The whole leader startup process has a timeout defined as CLUSTER_STARTUPTIMEOUT (5 minutes by default). If the leader node cannot finish the startup process within the timeout, it will exit with an error.

Setting up Compute Nodes

Configuration of Compute Nodes

Environment Variable	Description
`COMPUTE_NODE`	Indicates that the node is a compute node. Set to `true`.
`COMPUTE_NODE_LEADER_ADDRESS`	Domain name of the leader service. Used by compute nodes to connect and join the cluster.
`COMPUTE_NODE_DOMAIN_NAME`	Domain name of the compute node. Used by the leader node to identify and communicate with the node.
`PRIORITY_IP_CIDR`	IP range for priority IPs. Should match the Kubernetes cluster CIDR.
`STORAGE_PATH_ROOT`	Root path for persistent storage.
`SCRATCH_PATH_ROOT`	Root path for temporary storage.

Compute Node Startup Process

When a compute node starts, it will attempt to discover the leader nodes through the leader service using COMPUTE_NODE_LEADER_ADDRESS. If it finds any leader nodes, it will register itself with the cluster and join it. Leader nodes will then use the registered address to communicate with the compute nodes.

Cluster Management

Failover and Recovery

Once a cluster is formed, the leader nodes will continuously monitor each other as well as the compute nodes. When a leader node is unhealthy, the cluster service will remain unaffected because the other leader nodes will continue to provide the service. Additionally, the remaining leader nodes will attempt to elect a new "primary" leader from among themselves to manage the cluster.

The leader nodes will store the cluster information in the mounted STORAGE_PATH_ROOT. If a leader node is restarted with existing cluster information, it will try to re-register itself with the cluster. If the current cluster does not match the stored cluster information, the leader node will exit with an error.

When a compute node is not available, the leader nodes will detect it and mark it as unhealthy. The compute node will be removed from the cluster after a timeout defined by COMPUTE_NODE_UNHEALTHY_TIMEOUT (default 5 minutes).

After a compute node is restarted, it will try to register itself to the leader nodes again. It will also report the existing data state to the leader nodes, and the leader nodes will accordingly update the data state.

Adding and Removing Nodes

To add a new leader or compute node to the cluster, simply deploy a new instance of the corresponding StatefulSet. The new node will automatically discover the existing cluster through the leader service and join it.

To remove a node from the cluster on purpose, delete the corresponding leader or compute node in the UI first. After that, you can recycle the pod and the persistent volume claim (PVC) associated with the removed node to free up the storage.

Readiness probe

Readiness probe needs to be configured to ensure that the leader nodes are ready to communicate with the rest of the cluster on startup.

The default readiness probe for the leader nodes checks that the :8081/healthz endpoint of the PuppyGraph server returns a 200 status code. On each leader startup, it will use the leader service to discover any existing cluster members in order to decide whether it should join the existing cluster or start a new one.

Example Readiness Probe Configuration

readinessProbe:
  httpGet:
    path: /healthz
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10

Example configuration

Here is an example configuration for deploying PuppyGraph in a cluster mode with three leader nodes and three compute nodes.

It depends on secrets secret-puppygraph to store the username PUPPYGRAPH_USERNAME and password PUPPYGRAPH_PASSWORD for the PuppyGraph server.

The configuration also includes an empty ConfigMap configmap-puppygraph that can be used to store additional configuration files if needed. The ConfigMap is mounted at /etc/config/puppygraph/ in the leader and compute nodes.

Example PuppyGraph Cluster Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap-puppygraph
data: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: puppygraph-leader
  labels:
    app: puppygraph-leader
spec:
  serviceName: puppygraph-leader-service
  podManagementPolicy: "Parallel"
  replicas: 3
  selector:
    matchLabels:
      app: puppygraph-leader
  template:
    metadata:
      labels:
        app: puppygraph-leader
    spec:
      containers:
        - name: puppygraph-leader
          image: puppygraph/puppygraph:stable
          imagePullPolicy: Always
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CLUSTER_ID
              value: "1"
            - name: PUPPYGRAPH_USERNAME
              valueFrom:
                secretKeyRef:
                  name: secret-puppygraph
                  key: PUPPYGRAPH_USERNAME
            - name: PUPPYGRAPH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: secret-puppygraph
                  key: PUPPYGRAPH_PASSWORD
            - name: LEADER
              value: "true"
            - name: LEADER_SERVICE_ADDRESS
              value: "puppygraph-leader-service"
            - name: LEADER_NODE_COUNT
              value: "3"
            - name: LEADER_NODE_DOMAIN_NAME
              value: "$(POD_NAME).puppygraph-leader-service.$(NAMESPACE).svc.cluster.local"
            - name: AUTHENTICATION_JWT_SECRETKEY
              value: "<your-secret-key-here>"
            - name: COMPUTE_NODE_COUNT
              value: "3"
            - name: PRIORITY_IP_CIDR
              value: "172.31.0.0/16"
            - name: DATAACCESS_DATA_REPLICATIONNUM
              value: "3"
            - name: STORAGE_PATH_ROOT
              value: "/data/storage"
            - name: SCRATCH_PATH_ROOT
              value: "/data/scratch"
            - name: CLUSTER_STARTUPTIMEOUT
              value: "10m"
          volumeMounts:
            - name: volume-config-puppygraph
              mountPath: /etc/config/puppygraph/
              readOnly: true
            - name: data
              mountPath: /data
          resources:
            requests:
              cpu: "16"
              memory: "64Gi"
            limits:
              memory: "64Gi"
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 10
      volumes:
        - name: volume-config-puppygraph
          configMap:
            name: configmap-puppygraph
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: ebs-sc
        resources:
          requests:
            storage: 200Gi
---
apiVersion: v1
kind: Service
metadata:
  name: puppygraph-leader-service
spec:
  clusterIP: None
  selector:
    app: puppygraph-leader
  ports:
    - name: rest
      protocol: TCP
      port: 8081
      targetPort: 8081
    - name: gremlin
      protocol: TCP
      port: 8182
      targetPort: 8182
    - name: cypher
      protocol: TCP
      port: 7687
      targetPort: 7687
---
apiVersion: v1
kind: Service
metadata:
  name: puppygraph-cluster-proxy
spec:
  type: LoadBalancer
  selector:
    app: puppygraph-leader
  ports:
    - name: rest
      protocol: TCP
      port: 8081
      targetPort: 8081
    - name: gremlin
      protocol: TCP
      port: 8182
      targetPort: 8182
    - name: cypher
      protocol: TCP
      port: 7687
      targetPort: 7687
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: puppygraph-compute
  labels:
    app: puppygraph-compute
spec:
  serviceName: puppygraph-compute-service
  podManagementPolicy: "Parallel"
  replicas: 3
  selector:
    matchLabels:
      app: puppygraph-compute
  template:
    metadata:
      labels:
        app: puppygraph-compute
    spec:
      containers:
        - name: puppygraph-compute
          image: puppygraph/puppygraph:stable
          imagePullPolicy: Always
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: PUPPYGRAPH_USERNAME
              valueFrom:
                secretKeyRef:
                  name: secret-puppygraph
                  key: PUPPYGRAPH_USERNAME
            - name: PUPPYGRAPH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: secret-puppygraph
                  key: PUPPYGRAPH_PASSWORD
            - name: COMPUTE_NODE
              value: "true"
            - name: COMPUTE_NODE_LEADER_ADDRESS
              value: "puppygraph-leader-service"
            - name: COMPUTE_NODE_DOMAIN_NAME
              value: "$(POD_NAME).puppygraph-compute-service.$(NAMESPACE).svc.cluster.local"
            - name: PRIORITY_IP_CIDR
              value: "172.31.0.0/16"
            - name: STORAGE_PATH_ROOT
              value: "/data/storage"
            - name: SCRATCH_PATH_ROOT
              value: "/data/scratch"
          volumeMounts:
            - name: data
              mountPath: /data
          resources:
            requests:
              cpu: "16"
              memory: "64Gi"
            limits:
              memory: "64Gi"
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: ebs-sc
        resources:
          requests:
            storage: 30Gi
---
apiVersion: v1
kind: Service
metadata:
  name: puppygraph-compute-service
spec:
  clusterIP: None
  selector:
    app: puppygraph-compute
  ports:
    - name: rest
      protocol: TCP
      port: 8081
      targetPort: 8081