ML Interfacing pipeline
NIM is off the self interfacing framwork
NIM is about decide which model to use, based on number of GPU? Which GPU? Performance criteria (throughput v/s latency)? floating point library. NIM can autodetect hardware
GenAI with RAG has many services.
NIM operator to deploy RAG application CR: 1. NIM Cache (PVC) 2. NIM service 3. NIM pipeline all service can increase together.
NIM monitoring and autoscaling: Prometheus
1. Utilization of hardware
2. inter token latency
3. first token time generation.
4. request per second
Monitoring of NIM
2 seconds, 15 chat user etc are input for SLA.
NIM monitoring operator choose metrics from many metrics exposed by NIM
Autoscaling
In the sample chat application : milvus DB is needed. RAG is frontend service
0 comments:
Post a Comment