Site Reliability Engineer - Core Storage Infrastructure


• Build tooling to improve the automation of operations.  This includes automatic failure detection and remediation, application deployment, OS/Kernel/JVM/Firmware deployment, capacity planning, and fleet management.

• Diagnose, and troubleshoot complex distributed systems handling millions of queries per second, petabytes of data, and develop solutions that have a significant impact at our massive scale.

• Collaborate with SWE teams to sustain and optimize the availability, reliability, and performance of production services.

• Work and collaborate with the diverse hardware, software and networking teams throughout the company to design next-generation distributed storage platforms.

• Troubleshoot issues across the entire stack - hardware, software, application and network.

• Participate  in a 24x7 on-call rotation.


• 5+ years of managing services in a distributed, internet-scale *nix environment.

• Practical knowledge of at least one programming language (Python, Go, Ruby, Perl).

• Demonstrable knowledge of Linux operating system internals, TCP/IP, filesystems, disk/storage technologies.

• Familiarity with systems management tools (Puppet, Chef, Capistrano, Ansible, etc)

• Hands-on operational experience on managing JVM services.

• Ability to prioritize tasks and work independently

• Track record of practical problem solving, excellent communication, and documentation skills

• BS degree in Computer Science or Engineering, or equivalent experience.


Corporate / senior

North America

Seattle , United States


United States



