Resolving the Mystery of Zombie Node Services in Kubernetes

We recently found ourselves embroiled in a perplexing issue with our Node applications deployed using Docker on Kubernetes.

Mysteriously, services were spontaneously restarting, while others, confoundingly, remained running but were dead to any incoming requests. Let's unravel this enigma, from its discovery to its resolution, with the hope that our experiences can assist others in a similar bind.

Mystery Unfolds

Our Node services began displaying some eyebrow-raising behaviour:

  • A sporadic yet notable number of services started restarting without discernible cause.
  • Certain others, while ostensibly operational and listening on their assigned ports, were starkly unresponsive to any requests.

Such erratic conduct directly jeopardized our system's stability and efficiency.

Intertwined Clues and Haphazard Diagnosis

Unraveling this situation wasn't straightforward. Even as we delved deep into logs, configurations, and Kubernetes event streams, the solution remained elusive. It was during this quest that a coworker, somewhat serendipitously, uncovered the anomaly of the Node application. Despite its unresponsiveness, it continued to actively listen to its designated port.

File permissions seemed to be at the heart of this issue. Piecing together disparate clues embedded in documentation and combining it with our observations, the trail led us to the npm update notifier. Its role? To periodically verify new npm versions and notify users of any updates. Yet, herein lay the twist: to maintain a record of its last npm update status, the notifier attempted to write to a file. But given our Docker setup, where strict write permissions were enforced and images ran as non-root, this operation invariably failed.

Instead of gracefully failing or throwing a conspicuous error, the application did something rather unexpected: it clung onto its port, listening but coldly ignoring any incoming requests. This behaviour mislead Kubernetes into presuming the pod was operational in many instances.

To validate our growing suspicions, we tinkered with the notifier's status file. By artificially adjusting the file's creation date, a pattern emerged: the system faltered whenever this status file aged past seven days.

Original Dockerfile

Decoding Deprecated Flags

With our problem cornered, the solution seemed imminent. In past iterations, we'd quelled the update notifier using the NO_UPDATE_NOTIFIER flag. Yet, npm 7 threw a wrench in our plans by retiring this flag, a change subtly tucked away in the release notes.

Grand Solution

Our diligence led us to a lifeline: NPM_CONFIG_UPDATE_NOTIFIER. Setting this beacon to false disarmed the update notifier and its contentious disk write attempts.

Recognizing the scale of our challenge:

  • We armed our teams with guidance and context, catering to their varied Docker proficiency, to fix the issue in their services
  • We kept track of the progress of adoption of this new flag across our services

Modified Dockerfile

Reflective Conclusion

Our takeaways?

  • Even subtle changes in third-party utilities can spiral into monumental challenges.
  • Release notes, no matter how mundane, deserve meticulous scrutiny.
  • Ensuring robust security, as we did with our Docker configurations, can sometimes spotlight lurking issues that might otherwise remain camouflaged.

To our peers navigating the vast seas of Node, Docker, and Kubernetes, remember the little savior: NPM_CONFIG_UPDATE_NOTIFIER. It may be diminutive, but it’s powerful enough to ward off an avalanche of issues!

Here's to fewer mysteries and more seamless coding!