For NOC Engineers

Technical posts directly aimed at TP; NOC operations or practical use of BP.

ansible-exec: ansible-playbook wrapper for executing playbooks

By |2018-04-17T18:53:38+00:00August 26th, 2014|Blog|

Ansible is a great automation tool. We use it for server provisioning, application deployments and running maintenance scripts. One problem it does have however, is how (in)convenient it is to run playbooks as opposed to regular shell scripts. Write and run enough Ansible playbooks, and eventually you’ll get tired of the repetitive typing your fingers have to do.

Naught: Zero Downtime for Node.js Applications

By |2018-04-17T18:37:34+00:00March 22nd, 2014|Blog|

Service downtime is a harmful event to most technology businesses, especially to those who require their services to be constantly available. Downtime has many causes, such as hardware failures and network issues. In today’s web-scale world, application deployment is one of the main reasons for such downtime. This is particularly common with organizations performing Continuous Delivery, in which developers deploy their code at an unprecedented speed. Since there is always a good chance that the new code contains errors, the frequency of application changes holds a high risk of service malfunction.

4 Ways to Combat Non-Actionable Alerts

By |2018-04-17T18:36:02+00:00April 23rd, 2014|Blog|

Many alerts place an unnecessary burden on Ops teams instead of helping them to solve issues. The main problem is that most alerts are not actionable enough:

  • They point to issues that don’t require a response
  • They lack critical information, forcing you to spend time searching for more insights in order to gauge their urgency

Stop Managing Ops Incidents with Jira or Zendesk

By |2018-04-17T18:35:54+00:00May 2nd, 2014|Blog|

In many ways, incident management for devops is similar to typical issue tracking processes: it facilitates coordination and collaboration of daily tasks. For this reason, tools such as Jira, Zendesk, and even email are often used as solutions for incident management. But incident management faces one unique challenge that makes it different from other issue tracking processes. In addition to human-operated workflows, incident management also relies heavily on machine-driven workflows. Unfortunately, traditional issue trackers and ticketing systems cannot accommodate for this with their current product mechanics.

Building a Fast Ops Incident Dashboard

By |2018-04-17T18:36:12+00:00April 14th, 2014|Blog|

Few things damage productivity as much as waiting. Waiting forces us to context switch, disrupts our creative momentum and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on efficient work. 

Easy Modeling of Distributed Production with Vagrant & Ansible

By |2018-04-17T18:33:41+00:00July 14th, 2014|Blog|

Modeling your production environment correctly is very important for development. Developers need to be able to run and test their code locally for the development process to be efficient, and many times this requires setting up infrastructure that exists in production on their local machines. The basic solution is a simple Vagrant box containing all your infrastructure and application code, like the one we mentioned in our Devbox post.