• Build tooling to improve the automation of operations. This includes automatic failure detection and remediation, application deployment, OS/Kernel/JVM/Firmware deployment, capacity planning, and fleet management.
• Diagnose, and troubleshoot complex distributed systems handling millions of queries per second, petabytes of data, and develop solutions that have a significant impact at our massive scale.
• Collaborate with SWE teams to sustain and optimize the availability, reliability, and performance of production services.
• Work and collaborate with the diverse hardware, software and networking teams throughout the company to design next-generation distributed storage platforms.
• Troubleshoot issues across the entire stack - hardware, software, application and network.
• Participate in a 24x7 on-call rotation.
• 5+ years of managing services in a distributed, internet-scale *nix environment.
• Practical knowledge of at least one programming language (Python, Go, Ruby, Perl).
• Demonstrable knowledge of Linux operating system internals, TCP/IP, filesystems, disk/storage technologies.
• Familiarity with systems management tools (Puppet, Chef, Capistrano, Ansible, etc)
• Hands-on operational experience on managing JVM services.
• Ability to prioritize tasks and work independently
• Track record of practical problem solving, excellent communication, and documentation skills
• BS degree in Computer Science or Engineering, or equivalent experience.
Corporate / senior
Seattle , United States