Reliability Engineering : 3rd Pillar of DevOps

Reliability : system or component function under stated condition for specified period of time. It includes :

- Availability
- Performance
- Security

These should not be part of non-functional requirements. 

Key Areas of DevOps

1. Extending delivery to production
2. Extending feedback from operations to development (operate for design)
3. Embedding development into operations
4. Embedding operations into development

Dev comes from the school and Ops comes from the street. Reliability engineering = design for operate + operate for design

As per "Site Reliability Engineering" book  by Google, development team handles 100% and after the service reach maturity and stability 5% of operational workload 

Design for operation
  • "Design Pattern" by Gang of Four is for software design, design pattern and architecture. Similar design pattern book for stability is "Release it" by Michale Nygard. Another good book is "The Twelve Factor App"
  • Failure at integration point : Hystrix is open source library by Netflix that wrap a call to integration point with circuit breaker. 
  • config state should be separate from app code and store in environment variable 
  • Factorish github project to move legacy app to 12 factor app. 
  • Follow
  • Replication avoid single point of failures
  • Performance testing can be part of build pipeline. 
  • User profiler tool and APM (Application Performance Management) tools to locate performance bottleneck. 
Operate for Design


Monitoring metrics
  1. service performance & uptime
  2. software components metrics 
  3. system metrics (time series metrics about host) 
  4. app metrics
  5. performance
  6. security
    1. System security : Bad TLS/SSL settings, open ports, system configuration probelms
    2. Application security : XSS/SQL injection, custom events like password reset, invalid logins, new account creation. 
    3. Anomalies 
Monitoring Tools

1. Legacy tools : Nagios, Savics 
2. Simple endpoint monitoring : Pingdom, 
3. system and metric monitoring : Datadog, Netuitive, Ruxit, and Librato, 
4. full application performance management tools: New Relic and AppDynamics. 
5. Open Source tools : graphite, grafana, statsd, gangila, InfluxDB, OpenTSDB,
6. Open source solutions : icinga, sensu similar to nagios
7. container monitoring open source tools: prometheus, sysdig
8. security monitoring tools

Avoid so many metrics. 


5 principles of logging

1. Do not collect log, that you will not use
2. Retain log data as long as it needed to retain by regulatory authority
3. Log all you can, but alert only when actions needed. Define log levels 
4. application availability and security is more needed, compare to logging availability and security 
5. Logs change. Format, content. Let's all take ownership of their log and keep at centralized system. 

Books related to Logging

1. Logging and Log Management
2. The practice of cloud system administration
3. Web Operations.

Log Management Tools

1. Legacy tool : Splunk
2. Open source ELK stack = Elastic search + log stashin + kibana
3. SaaS incident management tool :  Pagerduty and VictorOps
4. Open source :
6. Command dispatcher tools : RunDeck, 

In addition to Monitoring, Metrics and Logging, few more tools for feedback. Incident command system, blameless postmortem and transparent uptime.


Post a Comment