Monitoring
PuppyGraph makes it easy to keep an eye on your system's health and performance. With built-in health check endpoints and detailed metrics collection, you can quickly monitor how your PuppyGraph instances are running and catch any issues early.
Health Check
PuppyGraph provides a health check endpoint to monitor the status and readiness of PuppyGraph instances. This is particularly useful for orchestration platforms like Kubernetes and load balancers to determine if a PuppyGraph instance is ready to accept traffic.
Endpoint
The health check endpoint is accessible at :8081/healthz.
When healthy, the endpoint returns 200 OK. If the server is not ready or experiencing issues, the endpoint may return a non-200 status code or be unavailable.
Usage
When the server is starting up, the health check endpoint may not be immediately available. It's recommended to configure your health check with appropriate retries and timeouts to allow for the server to initialize properly.
Docker Container Health Check
For Docker deployments, you can configure a health check in your docker-compose.yml:
services:
puppygraph:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8081/healthz"]
interval: 10s
timeout: 5s
retries: 3
start_period: 60s
Kubernetes readiness probe (cluster deployments)
The health check endpoint is commonly used in Kubernetes deployments as a readiness probe. This ensures that traffic is only routed to PuppyGraph instances that are fully initialized and ready to process requests.
In cluster deployments with multiple nodes, each node exposes its own health check endpoint. The cluster management component and load balancers monitor the health of each node and stop routing traffic to nodes that are marked unhealthy, helping to maintain overall system stability.
Example Kubernetes readiness probe configuration:
readinessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
For more information about cluster deployment, see Cluster Deployment.
Metrics
PuppyGraph supports Prometheus / OpenMetrics format for metrics collection.
Specification
Endpoint
Once enabled, the metrics endpoint becomes accessible at :8081/metrics.
Available metrics
| Metric Name | Description |
|---|---|
| puppy_gremlin_errors_count | gremlin server error count |
| puppy_gremlin_errors_fifteenminuterate | gremlin server mean error rate in 15 minutes |
| puppy_gremlin_errors_fiveminuterate | gremlin server mean error rate in 5 minutes |
| puppy_gremlin_errors_meanrate | gremlin server mean error rate |
| puppy_gremlin_errors_oneminuterate | gremlin server mean error rate in 1 minute |
| puppy_gremlin_op_eval_count | gremlin server op eval count |
| puppy_gremlin_op_eval_fifteenminuterate | gremlin server op eval mean rate in 15 minutes |
| puppy_gremlin_op_eval_fiveminuterate | gremlin server op eval mean rate in 5 minutes |
| puppy_gremlin_op_eval_max | gremlin server op eval max time cost |
| puppy_gremlin_op_eval_mean | gremlin server op eval mean time cost |
| puppy_gremlin_op_eval_meanrate | gremlin server op eval mean rate |
| puppy_gremlin_op_eval_min | gremlin server op eval min time cost |
| puppy_gremlin_op_eval_oneminuterate | gremlin server op eval mean rate in 1 minute |
| puppygraph_client_status | aliveness of client, including gotty, bolt and notebook |
| puppygraph_gremlin_server_status | aliveness of gremlin server |
| puppygraph_node_alive | aliveness of nodes in PuppyGraph cluster |
metrics of name starts with puppy_gremlin are available when metrics for gremlin server is enabled.
Configuration
| Environment Variable | Default Value | Description |
|---|---|---|
METRICS_ENABLED |
false | Enables the metrics endpoint. |
METRICS_AUTH_ENABLED |
true | Enables basic auth (using PuppyGraph credentials) for the metrics endpoint. |
GREMLINSERVER_METRICS_ENABLED |
false | Enables metrics collection for the Gremlin server. |
Prometheus Integration
Use the following Prometheus scrape configuration as a template for collecting PuppyGraph metrics. Replace the placeholder values to match your environment.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: puppygraph
metrics_path: /metrics
static_configs:
- targets: ["<PUPPYGRAPH_HOST>:8081"]
basic_auth:
username: <YOUR_PUPPYGRAPH_USERNAME>
password: <YOUR_PUPPYGRAPH_PASSWORD>
- Set
<PUPPYGRAPH_HOST>to the address where PuppyGraph exposes/metrics. - Replace
<YOUR_PUPPYGRAPH_USERNAME>and<YOUR_PUPPYGRAPH_PASSWORD>with valid PuppyGraph credentials. - Adjust the scrape intervals to match your monitoring requirements.
Datadog Integration
Here's an example of how to integrate PuppyGraph's Prometheus endpoint with Datadog for monitoring.
Run Datadog agent
Follow Start the Datadog Agent with Docker to start Datadog agent.
If metrics authentication is enabled, you need to set PUPPYGRAPH_USERNAME and PUPPYGRAPH_PASSWORD as environment variables in the Datadog agent container:
docker run -d --name dd-agent \
-e PUPPYGRAPH_USERNAME="<YOUR_PUPPYGRAPH_USERNAME>" \
-e PUPPYGRAPH_PASSWORD="<YOUR_PUPPYGRAPH_PASSWORD>" \
-e DD_API_KEY="<DATADOG_API_KEY>" \
-e DD_SITE="<YOUR_DATADOG_SITE>" \
-e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-v /var/lib/docker/containers:/var/lib/docker/containers:ro \
gcr.io/datadoghq/agent:latest
Run PuppyGraph with Datadog labels
Add labels to a PuppyGraph container to make Datadog to discover and scrape its metrics.
For example, if you are running PuppyGraph with Docker, you can add the following arguments to the docker run command:
With metric authentication enabled:
-l com.datadoghq.ad.check_names='["openmetrics"]' \
-l com.datadoghq.ad.init_configs='[{}]' \
-l com.datadoghq.ad.instances="[{\"openmetrics_endpoint\":\"http://%%host%%:8081/metrics\",\"username\":\"%%env_PUPPYGRAPH_USERNAME%%\",\"password\":\"%%env_PUPPYGRAPH_PASSWORD%%\",\"namespace\":\"puppy\",\"metrics\":[\".*\"]}]" \
With metric authentication disabled: