Nathan Stanley
Software Engineer with 10 years experience
Australian currently in Seattle, WA

I build distributed LLM training systems and ML infrastructure at Amazon. I turn ambiguous problems into concrete tools and frameworks that engineers and scientists love to use. Currently working at the scale of 10k H200 GPUs on AWS. Always looking for my next challenge.
Amazon
Post-Training on Customer Data
Amazon SFAI Ā· 2025-2026 Ā· YouTube
āMake Rufus learn from how customers actually use it.ā
I took a broad directive to improve Rufus using real customer interactions and turned it into three parallel workstreams: building a secure training environment on AWS, developing training applications using Verl and Megatron-LM for 500B+ parameter models, and curating datasets from real customer traffic. I lead all three efforts and coordinate with science teams to pull it together. Today, a team of scientists uses this framework, infrastructure, and data pipeline to iterate on training recipes for different Rufus applications.

LLM Training Infrastructure
Amazon SFAI Ā· 2024-2025 Ā· YouTube
āOur training runs keep failing. Fix it.ā
Starting from a vague mandate to "improve training efficiency," I systematically debugged reliability issues across the entire stack for large-scale pretraining runs on 5,000+ GPUs. I built observability tooling, identified failure modes at the container, scheduler, and hardware levels, and implemented fixes that improved availability from 85% to 99%. These infrastructure patterns were adopted broadly across Amazon through contributions to AWS Batch, EC2, and an internal GPU pooling platform.

Rufus Studio
Amazon SFAI Ā· 2023-2024
āPrompt changes take too long. Do something about it.ā
Tasked with improving prompt engineering velocity, I designed and built an internal platform that mirrors production with the ability to edit prompts deep in the inference stack. The tool provides a full-featured IDE experience for prompt template editing with live evaluation. I led a team of 6 engineers to build it, reached 300+ weekly active users in the first month, and reduced prompt deployment time from 17 days to 3 days. Rufus Studio has since become the canonical platform for all Rufus internal tooling.

Distribution Center Technology
Amazon DCTech Ā· 2021-2022
āWe're launching in 4 months and nothing can handle the load.ā
As scalability lead for Amazon Grocery distribution centers, I discovered that critical APIs couldn't meet TPS targets due to architectural bottlenecks in downstream services. I led a cross-team war room, redesigned loading patterns using parallel fanout and caching, and achieved 20x throughput and 8x latency improvements across 15 APIs. The parallel loading library I built was adopted by 15+ teams across Amazon. I also designed a CQRS-based location recommendation system to handle long-term scale.

Non-Prime Customer Experience
Amazon Retail Ā· 2019-2020
āHow do we convert more shoppers into Prime members?ā
Led large-scale A/B experimentation on Amazon product pages to optimize the shopping experience for non-Prime customers. Designed and analyzed experiments across millions of daily sessions, identifying high-impact changes to pricing display, shipping messaging, and conversion funnels. The changes I drove generated over $50M in annualized profit through improved conversion rates and Prime subscription growth.
a/b testing button spacingis not for me
MiClub
MiMembership
2017-2019 Ā· Website
āGolf clubs need modern membership software.ā
Over two years, I worked with two other developers to build the premier golf membership management system in Australia. We solved complex challenges including smart entity-based search, design system consistency, ORM performance tuning, MySQL InnoDB optimizations, offline mode support, and integrations with Golf Australia's handicapping systems. The platform now serves 500+ clubs.

Pace of Play
2019 Ā· Website
āClubs can't figure out who's causing slow play.ā
Hearing that clubs consistently struggled to maintain player speed and identify slowdowns, I invented a novel solution. We tracked players via the scoring app GPS and built a custom algorithm to identify slow players. The frontend resembled a video player over Google Maps where administrators could scrub through any day's timeline and observe player movement. GPS smoothing algorithms provided high-quality signal, and summary reports helped clubs address problematic patterns.

āDigital scoring apps are about to be allowed in competition.ā
When Golf Australia announced they would allow digital scoring in official competitions, I convinced the company to build our own mobile app. The app reached top 10 in the Australian App Store sports category and exploded in popularity during COVID-2020 when golf became one of the few permitted outdoor activities.

Golf Platform Performance
2015-2019 Ā· Website
āThe legacy system is slow. Make it fast.ā
As platform lead, I profiled and optimized MiClub's 2M+ line legacy golf management codebase across the full stack: database engine configuration, query profiling, index optimization, webapp threading, and page load performance. I also built a performance tracker for MiScore to monitor the quality and speed of our OCR scorecard scanning system, reducing scan times from 5 seconds to under 1 second.

Side Projects
Crewly
Ā· Website
A web app for planning daily boat trips. Track crew members, manage guest lists, and check weather conditions all in one place. Built to simplify the logistics of coordinating group outings on the water.

Ripper
Stealth AI project for real estate agents Ā·
Coming soon.
lets see what that $200 claudesubscription can really do
Experience
Education
M.Eng. Software & Machine Learning
University of Western Australia Ā· Distinction (First Class Honours)
B.Eng. Software Engineering & Computer Science
University of Western Australia Ā· Distinction (First Class Honours)
High School (Western Australia) Ā· ATAR 99.90 (top 0.1% percentile across math and science)