Get in Touch

Course Outline

EXO Infrastructure as Code

  • Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
  • Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management
  • Using Nix flakes for reproducible EXO builds and developer environments
  • Writing Ansible playbooks or shell scripts for unattended cluster provisioning

Reproducible Builds and CI Integration

  • Pinning dependencies and building the dashboard in CI pipelines
  • Running EXO smoke tests in GitHub Actions or GitLab CI runners
  • Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
  • Versioning custom model cards alongside application code

Cluster Discovery and Networking Automation

  • Configuring mDNS and static DNS for reliable libp2p node discovery
  • Automating network profile creation and Thunderbolt bridge management on macOS
  • Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate dev, staging, and prod clusters
  • Firewall rules and network segmentation for multi-tenant environments

Storage and Model Lifecycle Management

  • Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
  • Mounting NFS or SAN shares as read-only model repositories for fast provisioning
  • Garbage collection of stale caches and versioned weight retention policies
  • Automating model pre-downloads and health checks before rolling updates

Monitoring and Alerting

  • Shipping EXO logs to centralized logging (ELK, Loki, or Splunk)
  • Building Grafana dashboards from EXO_TRACING_ENABLED output
  • Alerting on cluster membership changes, OOM events, and inference latency spikes
  • Correlating macmon hardware telemetry with model performance regressions

Update, Rollback, and Disaster Recovery

  • Staging EXO binary updates in a canary node before fleet-wide rollout
  • Model-level rollback: switching between quantized versions without re-downloading
  • Backing up and restoring cluster state, custom namespaces, and cached weights
  • Documenting recovery runbooks for total cluster rebuild scenarios

Security Hardening and Compliance

  • Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
  • Implementing API rate limiting and IP whitelisting for EXO endpoints
  • Isolating clusters with VLANs and zero-trust network policies
  • Auditing access and maintaining an inventory of deployed models and versions

Requirements

  • Experience with DevOps practices (CI/CD, IaC, container orchestration)
  • Familiarity with macOS or Linux system administration and package management
  • Understanding of networking, DNS, and storage concepts

Audience

  • DevOps engineers
  • Infrastructure architects
  • SREs responsible for on-premise AI workloads
 21 Hours

Custom Corporate Training

Training solutions designed exclusively for businesses.

  • Customised Content: We adapt the syllabus and practical exercises to the real goals and needs of your project.
  • Flexible Schedule: Dates and times adapted to your team's agenda.
  • Format: Online (live), In-company (at your offices), or Hybrid.
Investment

Price per private group, online live training, starting from £4800 + VAT*

Contact us for an exact quote and to hear our latest promotions

Testimonials (2)

Provisional Upcoming Courses (Contact Us For More Information)

Related Categories