Run:AI Case Studies Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Edit This Case Study Record

	Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’ Run:AI

Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Run:AI

Technology Category	Analytics & Modeling - Machine Learning Application Infrastructure & Middleware - Data Exchange & Integration
Applicable Industries	Automotive
Applicable Functions	Discrete Manufacturing Product Research & Development
Use Cases	Autonomous Transport Systems Machine Condition Monitoring
Services	Cloud Planning, Design & Implementation Services Data Science Services
Challenge	Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle. Read More
About Customer	Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop. Read More
Solution	Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running. Read More Log in to view content
Contents

Technology Category

Analytics & Modeling - Machine Learning

Application Infrastructure & Middleware - Data Exchange & Integration

Applicable Industries

Automotive

Applicable Functions

Discrete Manufacturing

Product Research & Development

Use Cases

Autonomous Transport Systems

Machine Condition Monitoring

Services

Cloud Planning, Design & Implementation Services

Data Science Services

Challenge

Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle.

About Customer

Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop.

Solution

Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running.

Impact #1	Wayve's GPU utilization increased from less than 45% to over 80%.
Impact #2	The number of jobs running on Wayve's system increased significantly.
Impact #3	Wayve's teams were able to access more GPUs and run more workloads, increasing overall productivity.

Impact #1

Wayve's GPU utilization increased from less than 45% to over 80%.

Impact #2

The number of jobs running on Wayve's system increased significantly.

Impact #3

Wayve's teams were able to access more GPUs and run more workloads, increasing overall productivity.

Benefit #1	Increase in GPU utilization from less than 45% to over 80%.
Benefit #2	Significant increase in the number of jobs running on the system.
Benefit #3	Increased access to GPUs for teams, leading to increased productivity.

Benefit #1

Increase in GPU utilization from less than 45% to over 80%.

Benefit #2

Significant increase in the number of jobs running on the system.

Benefit #3

Increased access to GPUs for teams, leading to increased productivity.

Download PDF Version

Overview

Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Operational Impact

Quantitative Benefit