Inference is when models are actually used (as opposed to training). For agents, each interaction involves inference.
Considerations
- Latency requirements
- Cost per request
- Hardware requirements
- Batching strategies
Optimization
- Model quantization
- Speculative decoding
- Caching
- Smaller models for simple tasks