Over the past couple of months, the BigPanda product team has been hard at work enhancing key tools to help you better manage your incidents. Here are a few of the recent updates designed to help you best leverage BigPanda to fit your needs:
Wondering what the BigPanda product team has been up to lately? In our new regular blog series, we’ll provide you with everything you need to know about new product features, upgrades, integrations, and more! Here are a few of the latest additions you may not have discovered yet:
You’ve solved your noisy alert problem with BigPanda. Now solve your noisy ChatOps problem with BigPanda and HipChat, thanks to HipChat’s new integrations platform, HipChat Connect.
If every incident update were to push a new message, your Ops chatrooms would quickly become more crowded than O’Malley’s Pub on St. Patty’s Day. BigPanda now integrates with HipChat via HipChat Connect, so you can not only view the status of BigPanda incidents in HipChat, but also view incident details with links to relevant actions in the glance view beside the chat room.
We’re happy to announce that BigPanda now integrates with Catchpoint! Catchpoint is a popular cloud-based monitoring tool used by ops teams to measure availability and performance for synthetic transactions and real user web sessions. By integrating with BigPanda, Catchpoint customers can now aggregate all of their monitoring alerts in one place, intelligently clustering them to reduce alert noise and spot critical issues faster.
Ansible is a great automation tool. We use it for server provisioning, application deployments and running maintenance scripts. One problem it does have however, is how (in)convenient it is to run playbooks as opposed to regular shell scripts. Write and run enough Ansible playbooks, and eventually you’ll get tired of the repetitive typing your fingers have to do.
Modeling your production environment correctly is very important for development. Developers need to be able to run and test their code locally for the development process to be efficient, and many times this requires setting up infrastructure that exists in production on their local machines. The basic solution is a simple Vagrant box containing all your infrastructure and application code, like the one we mentioned in our Devbox post.
One of the first things we do right after installing Nagios, is set up email notifications. Without that, how would you know when something went wrong?
In many ways, incident management for devops is similar to typical issue tracking processes: it facilitates coordination and collaboration of daily tasks. For this reason, tools such as Jira, Zendesk, and even email are often used as solutions for incident management. But incident management faces one unique challenge that makes it different from other issue tracking processes. In addition to human-operated workflows, incident management also relies heavily on machine-driven workflows. Unfortunately, traditional issue trackers and ticketing systems cannot accommodate for this with their current product mechanics.
Many alerts place an unnecessary burden on Ops teams instead of helping them to solve issues. The main problem is that most alerts are not actionable enough:
Few things damage productivity as much as waiting. Waiting forces us to context switch, disrupts our creative momentum and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on efficient work.
Service downtime is a harmful event to most technology businesses, especially to those who require their services to be constantly available. Downtime has many causes, such as hardware failures and network issues. In today’s web-scale world, application deployment is one of the main reasons for such downtime. This is particularly common with organizations performing Continuous Delivery, in which developers deploy their code at an unprecedented speed. Since there is always a good chance that the new code contains errors, the frequency of application changes holds a high risk of service malfunction.