top of page

Staff Site Reliability Engineer - New York

Job Description:

Compensation Range - 200,000 USD - 350,000 USD

Key responsibilities

Site Reliability Engineers (SRE) fill the mission-critical role of ensuring that their complex, large-scale systems are healthy, monitored, automated, and designed to scale. You will use your background in software engineering combined with experience as an operations generalist to work closely with their development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about applying a software engineering approach to the operations problem-space, involving deep knowledge of their platforms as well as their various use-cases. You are both a generalist, capable of picking up and working with multiple, disparate systems, and an expert, having an ability to dive deep into specific topics and quickly master them.

Serve as a primary point responsible for the overall health, performance, and capacity of our business facing platforms, e.g. our Kafka service
Gain deep knowledge of our complex platforms, business applications, and use-cases
Assist in the roll-out and deployment of new platforms or features to facilitate their rapid iteration and continuous improvements
Develop tools to improve their ability to rapidly deploy and effectively monitor and maintain custom applications or services in a large-scale Linux environment
Work closely with development teams to ensure that platforms are designed with “operability” and “usability” in mind
Function well in a fast-paced, rapidly changing environment
Participate in a 24x7 rotation for second-tier escalations.

Qualifications

B.S. (M.S. preferred, and Ph.D a plus) in Computer Science, Engineering, Physics, or Mathematics
Developer background with experience in two or more of C++, Java, Python, or Node.js
5+ years in a Linux-based large-scale systems role
Experience managing container orchestration platforms such as Kubernetes
Experience building self-service APIs and tuning, sharding, and partitioning systems to auto-manage platforms at scale
Knowledge of most of these: data structures, relational and non-relational data-stores, networking, Linux internals, file systems, distributed systems, and related topics
Experience in containerizing applications and services a plus
Experience using AWS or GCP at scale a plus
Experience with random fault injection (Chaos Engineering) and building self-healing capabilities into platforms a plus
Commits to Kafka source code would be a huge plus
Strong interpersonal communication skills and ability to work well in a diverse, team-focused
environment with other SREs, SWEs, product managers, etc.

To apply, please send your Updated CV, Notice Period and Expected compensation to fazila.khan@talentbea.com

bottom of page