Reliability Engineering : 3rd Pillar of DevOps

Reliability : system or component function under stated condition for specified period of time. It includes :

- Availability
- Performance
- Security

These should not be part of non-functional requirements.

Key Areas of DevOps

1. Extending delivery to production
2. Extending feedback from operations to development (operate for design)
3. Embedding development into operations
4. Embedding operations into development

Dev comes from the school and Ops comes from the street. Reliability engineering = design for operate + operate for design

As per "Site Reliability Engineering" book by Google, development team handles 100% and after the service reach maturity and stability 5% of operational workload

Design for operation

"Design Pattern" by Gang of Four is for software design, design pattern and architecture. Similar design pattern book for stability is "Release it" by Michale Nygard. Another good book is "The Twelve Factor App" https://12factor.net/
Failure at integration point : Hystrix is open source library by Netflix that wrap a call to integration point with circuit breaker.
config state should be separate from app code and store in environment variable
Factorish github project to move legacy app to 12 factor app.
Follow https://martinfowler.com/
Replication avoid single point of failures
Performance testing can be part of build pipeline.
User profiler tool and APM (Application Performance Management) tools to locate performance bottleneck.

Operate for Design

Monitoring

Monitoring metrics

service performance & uptime
software components metrics
system metrics (time series metrics about host)
app metrics
performance
security

System security : Bad TLS/SSL settings, open ports, system configuration probelms
Application security : XSS/SQL injection, custom events like password reset, invalid logins, new account creation.
Anomalies

Monitoring Tools

1. Legacy tools : Nagios, Savics

2. Simple endpoint monitoring : Pingdom,
3. system and metric monitoring : Datadog, Netuitive, Ruxit, and Librato,
4. full application performance management tools: New Relic and AppDynamics.
5. Open Source tools : graphite, grafana, statsd, gangila, InfluxDB, OpenTSDB, mitrics.dropwizard.io
6. Open source solutions : icinga, sensu similar to nagios
7. container monitoring open source tools: prometheus, sysdig
8. security monitoring tools

Metrics
Avoid so many metrics.

Logging

5 principles of logging

1. Do not collect log, that you will not use
2. Retain log data as long as it needed to retain by regulatory authority
3. Log all you can, but alert only when actions needed. Define log levels
4. application availability and security is more needed, compare to logging availability and security
5. Logs change. Format, content. Let's all take ownership of their log and keep at centralized system.

Books related to Logging

1. Logging and Log Management
2. The practice of cloud system administration
3. Web Operations.

Log Management Tools

1. Legacy tool : Splunk
2. Open source ELK stack = Elastic search + log stashin + kibana
3. SaaS incident management tool : Pagerduty and VictorOps
4. Open source : flapjack.io
5. statuspage.io
6. Command dispatcher tools : RunDeck,

In addition to Monitoring, Metrics and Logging, few more tools for feedback. Incident command system, blameless postmortem and transparent uptime.

Life is an exercise to express the InExpressible.

Express YourSelf !

Reliability Engineering : 3rd Pillar of DevOps

0 comments:

Post a Comment

Total Pageviews

Subscribe by E-mail

Labels

Popular Posts

Coming Soon...

Followers

My Social Network

Blog Archive