K8s GW API


Examples: 

Istio, Kong, Envoy , Gluee , Trafeik, kgateway, Contour, NGINX, Kong Gateway and many more as per https://gateway-api.sigs.k8s.io/implementations/#gateway-controller-implementation-status

Protocols: gRPC, HTTP/2, and WebSockets

The structure of a Kubernetes Custom Resource Definition (CRD) or manifest file is referred to as an API. This is because it refers to the structure of the API in the Kubernetes control plane

Migration from ingress https://gateway-api.sigs.k8s.io/guides/migrating-from-ingress/#migrating-from-ingress

extension points

ingress has 2 extension points

1. annotations

2. resource end point


primary extension points in GW API:


1. External references


1.1 HTTP Route Filter

1.2 Backend Object Reference

1.3 Secret Object Reference

Here GW API reference 'external reference'

2. Custom implementations


e.g. The RegularExpression type of the 'HTTP Path Match'


3. Policies

A "Policy Attachment" is a specific type of metaresource

Here, Policy reference 'GW API'

  • GW API is not API GW
  • GAMMA (Gateway API for Mesh Management and Administration) initiative
  • A 'waypoint proxy' is a proxy server deployed inside the mesh for E-W traffic. It can be (1) per namespace level destination services (2) few service(s) within a namespace as destination (3) destination services from multiple namespaces. 
3.1 Traffic Policy
* It has transformation field to manipulate header and payload for request and response both. It is based on inja language

* It has ai field for prompt enrichment. 

  ai:
    promptEnrichment:
      prepend:
      - role: SYSTEM
        content: "Parse the unstructured text into CSV format."

1. GatewayClass 

- It is at cluster level. so no namespace

- Annotations at GatewayClassfor vendor specific

- It defines controller capabilities

2. Gateway

- Each Gateway defines one or more listeners, which are the ingress points to the cluster

- You can control which services can be connected to this listener (allowedRoutes) by way of their namespace — this defaults to the same namespace as the Gateway 

- Advanced featues like 

-- request mirroring, 

-- direct response injection, 

-- and fine-grained traffic metrics

-- Traffic spilt

- In Istio APIs, a Gateway configures an existing gateway Deployment/Service that has been deployed. In the Gateway APIs, the Gateway resource both configures and deploys a gateway

- one can attach HPA and PodDisuptionBudget to gateway deployment. 

3. HTTP Route: 

- any combinations of hostname, path, header values, HTTP method and query parameters.

  • Paths (e.g., /headers, /status/*)
  • Headers (e.g., User-Agent: Mobile)
  • Query Parameters (e.g., ?version=beta)
  • Methods (e.g., GET, POST)
You can also define multiple matching criteria and even combine them with an AND or OR operator.

- hostname (optional) at HTTP route shall match with hostname at Gateway->Listener->hostname

- A definition of the Gateway to use (in ParentRefs), is referenced by name and namespace

- The backendRefs that defines the service to route the request to for this match

- advanced pattern matching and filtering on arbitrary headers as well as paths.

1. RequestRedirect : E.g. Redirect HTTP traffic to HTTPS

2. URLRewrite

3. <Request|Response>HeaderModifier

4. RequestMirror

5. CORS

6. ExtensionRef for custom filter. E.g. DirectResponse

    filters:

    - type: ExtensionRef

      extensionRef:

       name: direct-response

       group: gateway.kgateway.dev

       kind: DirectResponse

- In the Istio VirtualService, all protocols are configured within a single resource. In the Gateway APIs, each protocol type has its own resource, such as HTTPRoute and TCPRoute.

- Traffic splitting is done by specifying multiple backendRef, with weight

- timeout, retry, sessionPersistence Session persistence, (= sticky sessions or strong session affinity), ensures that a client's requests are consistently routed to the same backend instance for the duration of a session. based on cookie or a header

- Route and Gateway can be in different namespace. If Gateway is defined with

    allowedRoutes:

      namespaces:

        from: Same

then Route and Gateway shall be in same namespace. We can have group of namespaces with label selector and specify those namespace using label at Gatway resource. 

    allowedRoutes:

      namespaces:

        from: Selector

        selector:

          matchLabels:

            self-serve-ingress: "true"

* 4. TLS Route

5. GRPCRoute

* 6. TCPRoute

* not v1, GA

Details: https://gateway-api.sigs.k8s.io/reference/spec/


If you are using a service mesh, it would be highly desirable to use the same API resources to configure both ingress traffic routing and internal traffic, similar to the way Istio uses VirtualService to configure route rules for both. Fortunately, the Kubernetes Gateway API is working to add this support. Although not as mature as the Gateway API for ingress traffic, an effort known as the Gateway API for Mesh Management and Administration (GAMMA) initiative is underway to make this a reality and Istio intends to make Gateway API the default API for all of its traffic management in the future.

https://gateway-api.sigs.k8s.io/mesh/


Gateway controller is for North South traffic. mesh controller is for East West traffic

7. ReferenceGrant: for cross-namespace reference. 

8. Inference Extension

K8s offers following mechanisms to optimize GPU usage, 

- time slicing, 

- Multi-Instance GPU (MIG) partitioning, 

- virtual GPUs, and 

- NVIDIA MPS 

for concurrent processing.

Effective GPU utilization means

- hardware allocation; 

- how inference requests are routed across model-serving instances. 

- how inference requests are load-balanced across model-serving instances. 

Simple load-balancing strategies often fall short in handling AI workloads effectively, leading to suboptimal GPU usage and increased latency.

Inference requests V/s traditional web traffic

- It often takes much longer to process, sometimes several seconds (or even minutes!) rather than milliseconds, 

- It has significantly larger payloads (ie, with RAG, multi-turn chats, etc). So a single request can consume an entire GPU, So making scheduling decisions far more impactful than those for standard API workloads. So, these requests need to queue up while others are being processed.

AI Models are stateful

- They maintain in-memory caches, such as KV storage for prompt tokens,

- They load fine tuned adapters like LoRA to customize response for specific user/organisation. 

So routing decisions are based on

- current state (in-memory caches, adapters)

- available memory, and 

- request queue depth.

So Inference aware routing through

8.1 Inference Model

- maps user facing model name to backend model

- traffic splitting between fine-tuned adapaters

- Priority based on real time interaction OR best-effort batch job

8.2 Inference Pool

- It is for platform operators managing model-serving infrastructure.

- a group of model-serving instances 

- specialized backend service for AI workloads.

- It manages

-- inference-aware endpoint selection, 

-- intelligent routing decisions based on real-time metrics such as 

--- request queue depth and 

--- GPU memory availability.

* Inference Pool is mapped with HTTP Route->backendRefs

* Inference Model has poolRef to link with Inference Pool

Inference Pool has extensionRef (EPP = Endpoint picker) . If name for Inference Pool is xyz then extensionRef is "xyz-endpoint-picker" It is similar to K8s service, as it also has selector and target port

9. DirectResponse

10. Backend

For external endpoint


0 comments:

Post a Comment