Give BigPanda
a try
by Shahar Kedar | March 22, 2014

Naught: Zero Downtime for Node.js Applications

Service downtime is a harmful event to most technology businesses, especially to those who require their services to be constantly available. Downtime has many causes, such as hardware failures and network issues. In today’s web-scale world, application deployment is one of the main reasons for such downtime. This is particularly common with organizations performing Continuous Delivery, in which developers deploy their code at an unprecedented speed. Since there is always a good chance that the new code contains errors, the frequency of application changes holds a high risk of service malfunction.

Naught: Zero Downtime for Node.js Applications

The Traditional “Zero Downtime Deployment”

Most organizations deal with this problem by placing all of their servers in front of a cluster of load balancers and deploying their code to a few servers at a time. As each group of servers is deployed and tested, it is removed from the load balancer. Once all the servers in the group have been updated they are returned to the main cluster. This process is repeated until the new code has been deployed to the entire cluster, thus ensuring that several servers remain operational at all times. This traditional way of avoiding downtime during deployment is sometimes called “zero downtime deployment”, or “rolling deployment.” Unfortunately, this process is far from ideal, and businesses are encountering several problems that stem from it.

The first problem is the immediate capacity reduction that occurs during deployment. Once a group of servers is temporarily removed from the cluster, it is no longer functional. This causes a harmful capacity loss for systems that don’t have sufficient redundancies. This can cause degradation in performance and even downtime for certain users, as the system will not be able to handle the usual workloads with reduced resources.

Secondly, rolling deployment is hard to implement on certain platforms. In Node.js, zero downtime deployment tools are not as widely adopted and known as in other platforms, such as Ruby-on-Rails, where Capistrano and Unicorn are used. Most Node.js developers are simply unaware that such tools exist, or are afraid to use them in production. For this reason, on Node.js the process of removing servers, updating them, and bringing them back to the cluster can be very complex and requires a lot more effort than the regular deployment process.

An additional issue is the complications that arise when encountering an error in the newly deployed code. Dealing with a malfunctioning code on a server that has been removed from its cluster is no easy task. All of these disadvantages lead to a high probability of performance issues and downtime, and for the last several years modern software development organizations have discovered that “zero downtime deployment” does not live up to its name in the modern web-scale world.

Here comes Naught

One way to eliminate downtime during the deployment process, while avoiding the problems of rolling deployment, is to use an excellent open source package called Naught. Naught creates a new worker instance and deploys the new code onto it, while the existing cluster is still fully operational. Only after the new worker is deployed and tested does Naught replace it with an existing instance in the cluster. This process ensures a smooth deployment with no potential downtime during any point of the process. Additionally, since at no point do any servers go out of commission without being immediately replaced, the system experiences no capacity reduction. Naught simply adds capacity to the cluster during deployment and removes it when it is no longer required. Code errors are handled easily as well, as the new instance is linked only after it has been successfully started. At BigPanda we make sure to self test every worker before notifying the Naught master that it is ready. This allows us to almost completely eliminate the risk of starting a damaged worker.

This solution is simple to use and to integrate and requires very little changes to the code. You simply tell the Naught master process that your new worker is up, and Naught will take care of the rest, replacing an old worker with a new one. Naught boasts a few extra features as well, such as automatic instance resuscitation and the ability to run several instances simultaneously.

Downtime is a major obstacle for businesses who want to constantly deliver excellent service. Here at BigPanda, we cannot compromise on having our product up and running 24\7. Our clients rely on our product to manage their system’s troubleshooting process by automating and tracking every one of their incidents at all times. That’s why for us downtime is not an option. Using Naught for our production enables us to ensure that our service is always there – all day, every day.

Shahar Kedar - Infrastructure Team Lead

With over 10 years of experience, I’ve been doing everything from hands on programming to complete system architecture. As infrastructure team leader at BigPanda (and before that at Thomson Reuters) I’m responsible for designing and building the IT and software infrastructure that make products tick. My passions in life (in this order): my wife and son, gourmet, cinema and code as craft.