--}}

JOB DESCRIPTION


Responsibilities:

  • Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
  • Design and implement monitoring systems including availability, latency and other salient metrics.
  • Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
  • Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

You may be a good fit if you:

  • Have extensive experience with distributed systems observability and monitoring at scale
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
  • Have experience with chaos engineering and systematic resilience testing
  • Can effectively bridge the gap between ML engineers and infrastructure teams
  • Have excellent communication skills.

Strong candidates may also:

  • Have experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
  • Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)
  • Understand ML-specific networking optimizations like RDMA and InfiniBand.
  • Have expertise in AI-specific observability tools and frameworks
  • Understand ML model deployment strategies and their reliability implications
  • Have contributed to open-source infrastructure or ML tooling


Salary

Competitive

Monthly based

Location

London, England, United Kingdom

Job Overview
Job Posted:
1 week ago
Job Expire:
3w 2d
Job Type
Full Time
Job Role
Engineer
Education
Bachelor Degree
Experience
3+ Years
Slots...
1

Share This Job:

Location

London, England, United Kingdom