Run:AI Case Studies Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’
Edit This Case Study Record
Run:AI Logo

Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Run:AI
Analytics & Modeling - Machine Learning
Application Infrastructure & Middleware - Data Exchange & Integration
Automotive
Discrete Manufacturing
Product Research & Development
Autonomous Transport Systems
Machine Condition Monitoring
Cloud Planning, Design & Implementation Services
Data Science Services
Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle.
Read More
Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop.
Read More
Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running.
Read More
Wayve's GPU utilization increased from less than 45% to over 80%.
The number of jobs running on Wayve's system increased significantly.
Wayve's teams were able to access more GPUs and run more workloads, increasing overall productivity.
Increase in GPU utilization from less than 45% to over 80%.
Significant increase in the number of jobs running on the system.
Increased access to GPUs for teams, leading to increased productivity.
Download PDF Version
test test