
location_on2110, Griswold Lane, Austin, Travis County, Texas, 78703, United States
The Trainium Manufacturing, Quality and Reliability (MQR) Team is part of AWS Annapurna Labs, a wholly owned subsidiary of Amazon focused on developing custom silicon and servers. Annapurna Labs designs cutting-edge AI platforms for the world's largest cloud services provider, including the Nitro, Graviton, Inferentia, and Trainium families of processors.
Machine Learning Annapurna (MLA) functions as a vertically integrated team, bringing together software, firmware, hardware, and silicon design within a single organization. The Training Servers and Systems organization under MLA encompasses Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability. This role sits within the MQR team, where we drive the definition, execution, and testing of key product aspects alongside an experienced cross-disciplinary staff and external partners.
As a Senior Reliability Engineer, you will engage with an open, collaborative peer environment to conceive and design infrastructure technologies. You will work closely with internal inter-disciplinary teams and outside partners to drive product definition and execution in manufacturing. Your mission is to ensure the reliability of future technologies by leading the identification and validation of product and component risks, working with design teams to mitigate them, and defining the test methodology and coverage required to assure product reliability.
In this role, you will provide technical leadership and mentor engineers while performing deep dives into technologies aligned with the product roadmap. You will be responsible for defining reliability tests implemented during manufacturing, driving process improvements to address reliability issues, and performing reliability predictions for failure mechanisms in products under development and those in the field. Additionally, you will work with multiple vendors and ODMs to standardize component manufacturing and reliability expectations.
If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit amazon.jobs for more information.
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Our inclusive culture empowers Amazonians to deliver the best results for our customers.
Los Angeles County Applicants: Job duties for this position include working safely and cooperatively with other employees, supervisors, and staff; adhering to standards of excellence despite stressful conditions; communicating effectively and respectfully; and following all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position, including the ability to adhere to company policies, exercise sound judgment, effectively manage stress, work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company's reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.
Work model: On-site
2110, Griswold Lane, Austin, Travis County, Texas, 78703, United States
Austin, Texas
Experience working in a fast-paced environment similar to a high-tech start-up; Reliability modeling and materials characterization experience; Master's Degree or PhD in Reliability Engineering or related field; Demonstrated ability to uncover systemic issues prior to new product introduction; Working understanding of server subcomponents (CPU, GPU, memory, HDD, SSD, motherboard, thermal system, peripherals, etc.); Analytical, test plan and test procedure development experience related to server compute platforms or with high-tech hardware
General Motors • Sunnyvale, California
SpaceX • Bastrop, Texas
DenMat Holdings, LLC • Lompoc, California
Skills: Machine Learning, Reliability Engineering, Reliability Statistics, Reliability Tests, Reliability Modeling, Materials Characterization, Test Plan Development, Test Procedure Development, Server Compute Platforms, High-Tech Hardware.
Education: Bachelor's degree in Electrical or Mechanical Engineering, Engineering Technology, or Reliability Engineering required; Master's Degree in Reliability Engineering or related field preferred; PhD in Reliability Engineering or related field preferred.