Course
digicode: MLPERF
ML Performance Engineering: Inference Optimization & Deployment
Take your deep learning models to the next level! Optimize inference with TensorRT and ONNX for maximum speed. Learn professional model tuning, live monitoring, and cost-effective deployment for cloud and edge systems.
Duration
2 days
Price
1'850.–
Course documents
Digital courseware & code repositories
Course facts
Download as PDF- Performance Analysis: Profiling ML pipelines in a systematic manner and distinguishing between compute-bound and memory-bound bottlenecks
- Inference Acceleration: Implementing techniques such as layer fusion and precision tuning with NVIDIA TensorRT and ONNX for maximum speed
- Efficient Resource Management: Applying quantization (INT8/FP16) to significantly reduce VRAM consumption without compromising model accuracy
- Production-Ready Pipelines: Developing optimized data loading strategies with NVIDIA DALI
- Stage-parallel execution: Designing pipelines for multiple models in series and preventing latency spikes through efficient memory management and asynchronous processing
- Automated performance testing: Optimizing robust CI/CD pipelines for ML models and capturing variations in execution times, defining baselines, and reliably detecting statistically significant performance regressions
- Scalable deployment: Optimizing costs and hardware resources through dynamic batching and monitoring during operation
This MLOps course offers a technical deep dive into the optimization of deep learning models.
Key topics:
- Frameworks & Formats: A Deep Dive into ONNX and the NVIDIA Ecosystem
- Inference Engines: Integrating NVIDIA TensorRT and Engine Building
- Data Loading: Accelerating Pipelines with NVIDIA DALI
- Techniques: Quantization, Layer Fusion, and Precision Tuning (Note: The core concepts covered are directly applicable to LLM inference)
- Infrastructure & Scheduling: GPU vs. CPU, CUDA fundamentals, and static vs. dynamic batching to handle variable API workloads in production
- End-to-End Pipelines: Orchestrating cascaded models (multi-model inference), optimizing pre- and post-processing, and avoiding CPU/GPU bottlenecks
- Deployment & Edge AI: Architectural differences between cloud GPUs and resource-constrained edge devices, containerization (Docker), and API integration (e.g., REST via Flask/FastAPI)
- Testing & CI/CD: Local vs. automated benchmarks. Strategies for dealing with hardware noise, warm-ups, and «flaky» performance tests
- Readiness & Observability: Profiling, performance analysis, and setting up live monitoring for inference during operation (e.g., with metrics via Grafana)
- Practice-oriented mix: The course combines theoretical concepts with intensive hands-on workshops
- Hands-on Lab: Participants work directly in Python environments on real-world optimization challenges (e.g., optimizing a computer vision model for edge or cloud hardware)
- Case Studies: Discussion of best practices and pitfalls from the trainer’s real-world deployment scenarios
- Interactive Benchmarking: Live profiling of models to make the effects of the optimization steps learned immediately measurable
ML Engineers, AI Engineers, Software Architects, DevOps Engineers, Backend Developers, and Data Scientists with a focus on deployment.
- Confident use of Python
- Experience with deep learning frameworks (focus on PyTorch)
- Basic understanding of ML model architectures
- Basic understanding of asynchronous processing
- Basic knowledge of Linux
- Basic understanding of REST APIs and containerization (Docker)