
location_on301, Union Street, Central Business District, Belltown, Seattle, King County, Washington, 98101, United States
The EC2 UltraServer Availability team is a high-performing engineering organization responsible for maintaining high availability of NVIDIA-based ML infrastructure at scale. We manage end-to-end repair and recovery workflows for GB200 and GB300 UltraServers, from initial problem detection through repair and recovery. Our team drives operational excellence through continuous improvement of problem detection, repair efficacy, and customer impact mitigation. We work closely with hardware engineering, data center operations, and EC2 service teams to ensure reliable, efficient recovery of critical ML compute capacity. This is a high-impact role leading a two-pizza team of talented engineers solving complex technical challenges in one of Amazon's fastest-growing infrastructure domains.
As a Software Development Engineer II, you will design, build, and maintain cloud-based repair and recovery workflows for NVIDIA GB200 / GB300 UltraServers. This role orchestrates repair and recovery operations from impairment detection through completed recovery, requiring expertise in AWS services, system architecture, and cross-functional collaboration with Capacity Management, Hardware Engineering, and Datacenter Operations to manage AI/ML infrastructure.
This is a hands-on position in which you will own everything from end to end: requirements gathering, designs, design reviews, implementations, code reviews, incremental feature launches, operations, mentoring, and the driving of continuous improvement. You will work in environments where the technology strategy is defined but the solution design is not, building solutions that are stable, logical, testable, and efficient while making independent trade-off decisions.
The interview process typically includes a technical deep-dive, system design discussions, and team fit assessments to ensure alignment with our engineering standards and culture.
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit amazon.jobs for more information.
Work model: On-site
Skills: Aws, System Architecture, Software Development, Software Programming Language, Source Control Management, Continuous Deployments, Testing, Operational Excellence, Hardware Integration, Software Integration.
Education: Bachelor's degree in computer science or equivalent.
301, Union Street, Central Business District, Belltown, Seattle, King County, Washington, 98101, United States
Seattle, Washington
Recrutus helps candidates discover roles that match their skills and helps teams reach qualified applicants faster. Browse by metro, discipline, or work style — from internships to senior leadership.