Course Outline

Foundations of Agentic Systems in Production

  • Agentic architectures: loops, tools, memory, and orchestration layers
  • Lifecycle of agents: development, deployment, and continuous operation
  • Challenges of production-scale agent management

Infrastructure and Deployment Models

  • Deploying agents in containerized and cloud environments
  • Scaling patterns: horizontal vs vertical scaling, concurrency, and throttling
  • Multi-agent orchestration and workload balancing

Monitoring and Observability

  • Key metrics: latency, success rate, memory usage, and agent call depth
  • Tracing agent activity and call graphs
  • Instrumenting observability using Prometheus, OpenTelemetry, and Grafana

Logging, Auditing, and Compliance

  • Centralized logging and structured event collection
  • Compliance and auditability in agentic workflows
  • Designing audit trails and replay mechanisms for debugging

Performance Tuning and Resource Optimization

  • Reducing inference overhead and optimizing agent orchestration cycles
  • Model caching and lightweight embeddings for faster retrieval
  • Load testing and stress scenarios for AI pipelines

Cost Control and Governance

  • Understanding agent cost drivers: API calls, memory, compute, and external integrations
  • Tracking agent-level costs and implementing chargeback models
  • Automation policies to prevent agent sprawl and idle resource consumption

CI/CD and Rollout Strategies for Agents

  • Integrating agent pipelines into CI/CD systems
  • Testing, versioning, and rollback strategies for iterative agent updates
  • Progressive rollouts and safe deployment mechanisms

Failure Recovery and Reliability Engineering

  • Designing for fault tolerance and graceful degradation
  • Retry, timeout, and circuit breaker patterns for agent reliability
  • Incident response and post-mortem frameworks for AI operations

Capstone Project

  • Build and deploy an agentic AI system with full monitoring and cost tracking
  • Simulate load, measure performance, and optimize resource usage
  • Present final architecture and monitoring dashboard to peers

Summary and Next Steps

Requirements

  • Strong understanding of MLOps and production machine learning systems
  • Experience with containerized deployments (Docker/Kubernetes)
  • Familiarity with cloud cost optimization and observability tools

Audience

  • MLOps engineers
  • Site Reliability Engineers (SREs)
  • Engineering managers overseeing AI infrastructure
 21 Hours

Delivery Options

Private Group Training

Our identity is rooted in delivering exactly what our clients need.

  • Pre-course call with your trainer
  • Customisation of the learning experience to achieve your goals -
    • Bespoke outlines
    • Practical hands-on exercises containing data / scenarios recognisable to the learners
  • Training scheduled on a date of your choice
  • Delivered online, onsite/classroom or hybrid by experts sharing real world experience

Private Group Prices RRP from £5700 online delivery, based on a group of 2 delegates, £1800 per additional delegate (excludes any certification / exam costs). We recommend a maximum group size of 12 for most learning events.

Contact us for an exact quote and to hear our latest promotions


Public Training

Please see our public courses

Testimonials (3)

Provisional Upcoming Courses (Contact Us For More Information)

Related Categories