Kubernetes


1. Design
=========

API : Primitives (Building Blocks) for 

1. deploy 
2. maintain / manage
3. scale 

containerized apps. 

1.1 Pod
=======

* Scheduling unit
* Pod = 1+ co-located containers, associated data volumes and options how container(s) should run
* All containers start in parallel inside pod
* Pod has unique IP within cluster. 
* Can be managed by Kubernetes API or controller via Kubelet.
* Pod exposes collective API as primitives to Kubelet.  https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#pod-v1-core
* ephemeral and disposable
* Pod provides a way to set env variables, mount storage, and feed other information to container. 
* States : pending, running, succeeded, failed, CrashLoopBackOff
* Pod is like an implementation of "composite container pattern"
** pod can have zero or more sidecar containers. Istio add one sidecar container to each pod.  

** pod can have zero or more ambassador containers. It proxy a local connection (to 127.0.0.1) towards outside world. 
** pod can have zero or more adapter containers. It standardize the output. 

Summary : A computer is a collection of resources, some processing, memory, disk, and network interfaces. In K8s the pod is the new computer.

Pod Implementation

* "pause container" is a parent container for rest of the container within pod. 
* "pause container" gets IP address, within cluster. 
* Other containers share network namespace, ipc, pid namespace and access to storage with pause container. 
* Containers within pod can communicate using
- localhost (with port cordination amount containers)
- System V semaphone IPC
- POSIX shared memory etc. 
* "pause container" also reap all zombie processes created by child containers. 

1.2 Labels, Selectors and namespace
===================================

Labels

* Key-Value pair
* attached to pod and node
* grouping mechanism 

Selectors

1. Equality based selector (= and !=)
2. Set based selector (IN, NOT IN, EXISTS)

Selectors has two types

1. Label selector
2. Field selector

Namespace

* Multiple virtual cluster backed by same physical cluster. 
To divide cluster resources among multiple user using cluster quota. 
K8S start with 3 initial namespace
1. "default" 
2. "kube-system" for object created by k8s system
3. "kube-public" reserved for cluster usage. Anyone can access. 
Basically namespace is non-overlapping set of K8s objects. 
Object name must be unique within namespace. 


1.3 Controllers
===============

* Controllers are watch-loops
* Manage a set of podes as per "Labels and Selector"
* reconciliation loop drive cluster state from actual to desirable. So it query API server. 
* Benefits
1. App Reliability
2. Scaling
3. Load Balancing

Examples: 

1. Replication controller: to scale up and down. Maintain correct number of pods. It facilitate horizontal scaling and ensure that Pods are resilient in case of host or application failures. If a container goes down or a host becomes unavailable, Pods will re-start on different hosts as necessary to maintain a target
number of replicas. Now it replaces by Deployment Controller and ReplicaSet. 

2. Deployment controller : Declarative updates (YAML file) for pods and replica set. It updates PodTemplateSpec. So new Replicaset is created with new version of pod. If not OK, rollback to old Replicaset. Deployment ensures that resources are available such as (1) IP Address and (2) Storage. Then deploys ReplicaSet

Replicaset Controller: It deploys and restarts controller untill requested number of containers are running. 

3. Daemonset controller to run 1 pod on 1 node. We can run a specific pod on all node also. "nodeSelector" is used to specify the node. 
4. Job controller 
5. endpoints controller, joins service and pod together,
6. namespace controller, 
7. service accounts and token controller for access management
8. Node controller to manage worker states.
9. Stateful set : manage the deployment and scaling for a set of pods, and provide guarantees about (1) ordering and (2) uniqueness of these pods. But unlike a deployment, a stateful set manages the sticky identity for each of these pods.
10. TTL Controller to clean up completed job

* Kind of controllers
Replicasets
Deploymnets
DaemonSet
Jobs
Services 



1.4 Services
============

* set of pods works together, E.g. tier in multi-tier
* set defined by labels and selector.
* Kubernetes discover pods based on services
* Service is like flexible and scalable agent which connects resources together.
* A service round-robins requests between pods. It is load balancers and front-ends to a collection of Pods.
* Services are the external point of contact for container workloads, accessible via an internal DNS server. 
* A Services’ IP address remains stable and can be exposed to the outside world via an Ingress. It abstracts away the number of Pods as well as virtual IP addresses for each Pod that can change as Pods are scheduled to different cluster hosts.
* Service handles
- access policies for inbound req, 
- useful for resource control, 
- security, 
* Service uses selector
- Equity based: = , ==, !=
- Set-based: in, notin, exists 

Service Discovery

* NodePort
* Load Balancer
* Ingress

The above are microservices. 

Load Balancers:

* HAProxy, 
* Traefik, 
* F5
* nginx
* Cisco
* Avi

2. Architecture
===============

* Master-slave

Master node is controlled by kubectl. kubectl is CLI for k8s
Kubectl has kubeconfig file that stores : server information, authentication information to access API server
For production, min 3 node cluster. 

Master node in production has add-ons like
- DNS service
- cluster level logging by 3rd party Fluentd which filters, buffers and routes log messages. 
- resource monitoring


2.1 C-plane
===========

2.1.1 etcd
==========

* b+tree key-value data store (distributed)
* based on Raft Consensus Algorithm https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14
* It does and find and change entry. It append modified entry at the end. Then mark previous copy for future removal. Removal by compaction process. 
* Single DB OR in master+follower DBs. They communicate with each other to determine who is master. 
* configuration data of cluster , configmap
* represent overall state of cluster
* other components monitors changed at etcd as etcd provides reliable watch query. 
* it stores : job scheduling info, pod details, storage information, cluster state, subnets, ConfigMaps, Secrets etc.
* it can also store ThirdPartyResource. Suppose there is 3rd party resource by name "cron-tab.alpha.ianlewis.org" with version v1 at default namespace, the corresponding custom controller can access it using HTTP GET

http: // localhost: 8001 / apis / alpha.ianlewis.org / v1 / namespaces / default / crontabs

2.1.2 API server
================

* JSON over HTTP
* Validate REST request and update API objects's state at etcd
* It performs CRUD operations at etcd for K8s object data. 
* so client can configure workloads, containers across the worker nodes

2.1.3 Scheduler
===============

* plugable 
* match resource "supply" to workload "demands"
* select node to run pod

* inputs
- resource availability
- resource utilization
- resource requirement
- QoS
- affinity requirements
- anti-affinity requirements
- data locality 
- policy
- user specification 

* supports the use of user-defined custom schedulers

* Workload patterns
- Replica Sets and Deployments
- Statefulsets for services (old name PetSets)
- DaemonSets
- Jobs (run to completion) 
- Cron Jobs

* "pod start" and "pod stop" hook
* "Reschedular" for guaranteed scheduling 

2.1.4 controller manager
========================
* Controller is a daemon that constantly compare the desire state of cluster as per etcd and actual state and then take necessary corrective action. Observer - Diff - Act cycle. 
* Controller manages differnt non-terminating control loops, which regulate the state of the Kubernetes cluster.
* Controller uses Watch API for add/delete/modify of K8S objects at API server. 
* controller should be accessible by k8s worker node of cluster. 
* process to run (1) Daemonset controller (2) Replication controller and many more as per section 1.3
* communicate with API server to create, update, delete (1) pod, (2) service end points (3) etc.

* ReplicaSets" are "low-type" in k8s. "DaemonSets" and "Deployments" are high-type

kube-controler-manager is now cloud-controller-manager. It intereacts with 3rd party tools for cluster management and reporting. Now each kubelet must use --cloud-provider-external settings passed to the binary. 


2.2 Kubernetes Node (worker node OR minion node)
===================

2.2.1 : CRI Container Runtime Interface (container tooling)
=======================================

- cri-o: OCI conformant runtimes.
- rktlet: the rkt container runtime.
- frakti: hypervisor-based container runtimes.

- docker CRI shim.

2.2.2 : CNI Container Network Interface
=======================================

container run time offloads IP assignments to CNI. CNI has various plugins

- Loopback
- Bridge
- MAC vLAN
- IP vLAN
- 3rd Pary plugins

2.2.3 Kubelet (K8S Node Agent) 
=============

* heartbeat for health of node.
* it communicate with API server to see if the pod is to be run on this node. 
* If yes, it executes pod containers via container engine
* it mounts and run pod Secrets, ConfigMaps and volumes. Volumes are within pod
* it respond back the pod and node states to API server, after health check ( / master node) 
It used Podspec YAML file, that describe a pod
API Server / HTTP endpoint / File
* it is effectively 'pod - controller'

2.2.4 Kube-proxy
================

Kube-proxy listen to API server for each service endpoint creation and deletion and sets routes accordingly. 

* n/w proxy + load balancer
* route to container based on IP + port
* It adds iptables rules to connect node IP address and cluster IP address. 
* Process on all worker node
* 3 modes
1. User space mode : monitor Services and Endpoints using random high number port to proxy traffic. 
2. iptables mode
3. ipvs mode : it will rplace iptables. 

The master node communicate with Kubelet and the end-user communicate with Kube-Proxy.

2.2.5 cAdvisor
==============

Agent to collect resource usage. 

2.2.6 supervisord
=================

Restart component, as and when needed. 
It monitors kubelet and other docker processes. 

2.2.7 kube-dns
==============

It resolves Kubernetes service DNS names to IP addresses. 

* High Availability HAProxy auto configuration and auto service discovery for Kubernetes. https://github.com/AdoHe/kube2haproxy 


To get started : kubernetes.io

2.3 Kubernetes architecture in nut-shell
========================================

K8s system is consists of unbounded number of independent asynchronous control loops.

These control loops are reading and writing from/to a schematized resource store as the source of truth. 

The schematized resource store is etcd database. It stores K8s objects. 

The etcd database is accessed by API Server only. API server is like wrapper around K8s database. 

The etcd database sends events about changes at K8s objects to informer. The informer resides in controller. 

This model has proven to be very resilient, evolvable, and extensible.

The life has no cycle. It is just binary. Scheduled or not scheduled. Kubelet on target worker node gets notification, when pod is schedule on that node. 

Kublet assigns the pod creation task to container run time engine. 

Task of container runtime engine (e.g. Docker engine):
- “container” image is loaded 
- network is assigned (using CNI plugin)
- configs are mapped 
- entry point is called

K8s Installation
================

kubeadm is A tool to install k8s on any cloud. 

1. install docker
2. run 'kubeadm init' Get the join tocken
3. On each worker node run 'kubeadm join' along with join token. So all nodes will join the cluster
4. Pod n/w
4.1 All containers can communicate with all containers, without NAT
4.2 All nodes can communicate with all containers, without NAT
4.3 The IP that container sees itself is same as all other see for that container. 


kops is to install k8s cluster on AWS. Azure and GCP has similar tools

Logging and Monitoring
======================

logstash, Fluentd, Filebeats running at pod, will ship the logs to Elasticserach , Kabana



  • cAdvisor to collect container usage statistics. it is per node. 
  • Heapster runs as a pod in cluster. It collect data from kubelet per node. Kubelet collect from cAdvisor. Heapster groups all information by pod with relevant labels. 
  • Promethus framework is for application metrics. it is a time series DB.

All the above 3 tools sends data to Grafana for visualization. 


Enterprise tools : Datadog, Riverbed

Authentication and Authorization
================================

Users:
1. Normal users : Users in LDAP or SSO 
2. Service accounts
* Manage by Kube API server
* Bound to specific namespace
* Its credentials are managed in secrets

1. Username
2. UID
3. group : used for authorization. 
4. Extra fields

Popular authentication

1. x509 client certs. default. CA within k8s cluster. 

By default, a main Kubernetes API server configured with the --client-ca-file=/etc/kuberntes/ssl/ca.pem.  API servers use this CA certificate as the CA to verify client authentication. 

How to generate Client certificate (authenticated by API server CA)?
--------------------------------------------------------------------
1.1. Create a private key for your user. In this example, we will name the file manoj.key:
              openssl genrsa -out manoj.key 2048

1.2. Create a certificate sign request manoj.csr using the private key you just created (manoj.key in this example). Make sure you specify your username and group in the -subj section (CN is for the username and O for the group). As previously mentioned, we will use manoj as the name and bitnami as the group:
              openssl req -new -key manoj.key -out manoj.csr -subj "/CN=manoj“

1.3. Locate your Kubernetes cluster certificate authority (CA). This will be responsible for approving the request and generating the necessary certificate to access the cluster API. Its location is normally /etc/kubernetes/ssl/. Check that the files ca.pem, ca-key.pem files exist in the location.

Generate the final certificate employee.crt by approving the certificate sign request, manoj.csr, you made earlier. Make sure you substitute the CA_LOCATION placeholder with the location of your cluster CA. In this example, the certificate will be valid for 500 days:


openssl x509 -req -in manoj.csr -CA /etc/kubernetes/ssl/ca.pem -CAkey /etc/kubernetes/ssl/ca-key.pem -CAcreateserial -out manoj.crt -days 500 

2. static token files (bearer token authentication)
3. OpenID connect
4. Webhook mode
5. basic authentication

Popular authorization

1. ABAC : Attribute based Access control. Access based on (policy)attributes of
- users
- resources
- objects
- enviornments etc.
2. RBAC: Role-based Access control. Verbs are list, get, watch. Important objects defined with YAML are : Roles, ClusterRoles, RoleBindings and ClusterRoleBindings.
3. Webhook


RBAC/ABAC can only be applied to users already defined/added via authentication process.


Role bindings binds role and (1) user OR (2) group OR (3) service accounts