Due to horizontal pod autoscaler(or hpa) settings that are not sufficient to keep up with the rate of events being posted to the caliper-endpoint or due to a large sudden influx of events, the caliper-endpoint can get overwhelmed while attempting to write to PubSub while keeping up with the incoming events. When this happens the caliper-endpoint may become unresponsive to healthchecks or event postings, and the following error can be seen in the caliper-endpoint logs:

WORKER TIMEOUT (pid:297062)
Worker with pid 297062 was terminated due to signal 9

To resolve this error you will need to update the hpa settings for the caliper-endpoint, this process is described in the Scale the UDP doc.


A good staring point for what to set your minReplicas and maxReplicas to you can view the events for the caliper-endpoint in StackDriver logging with the following query:

logName="projects/PROJECT_ID/logs/events"
jsonPayload.reason="ScalingReplicaSet"
jsonPayload.involvedObject.name=~"caliper-endpoint-listener"

You will be able to see the highs and lows of what it was scaled to and can make your decision on settings from there. We recommend that you set the min to a halfway point between the lowest and highest, and the max well above the listed max.

  • No labels