Lead Site Reliability Engineer

at BitMEX (view profile)
Location San Francisco, CA, United States
Date Posted August 8, 2019
Category Engineering, Software Development
Job Type Office · Full time

Description

The Site Reliability Team consists of hybrid systems and software engineers who are responsible and take ownership for management of large scale infrastructure while improving reliability and automation. SREs are integrated within the DevOps team, and we're looking for engineers who want to be a part of developing infrastructure software, maintaining it, and scaling it.

How You'll Make an Impact

As one of the first contributors to our SRE team, you will be in the position to influence our core DevOps culture, and greatly improve the velocity, reliability & confidence managing the key components of our trading platform infrastructure, which handles $5B worth of trades volume each day.

  • Build and lead an SRE team responsible for BitMEX’s uptime and performance
  • Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
  • Work with development partners to shape the architecture, design, and implementations of new and existing systems to enhance their reliability, performance, efficiency, and scalability
  • Ensure all key services are measured, monitored and raising alerts when needed

Required skills and qualifications

  • 5+ years in a similar Site Reliability Engineer role
  • BS or MS in Computer Science or a related technical discipline. Equivalent practical experience is a reasonable substitute
  • Experience working with critical low-latency real-time data / API pipelines
  • Experience working in Terraform / Chef / Kubernetes environments hosted on AWS
  • Experience in the Linux environment and a good understanding of its fundamentals and internals: the user/kernel-space divide, cgroups, filesystems (incl. ZFS) and modern memory management, threads and processes, etc
  • Solid understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems (incl. k/v stores, and distributed fs stores)
  • Proven knowledge of the TCP/IP stack, latency matters, internet routing and load balancing
  • Ability to drive projects end-to-end and excel with minimal technical supervision, while embracing reliability constraints and proactiveness
  • Capacity to multitask and give equal attention to a variety of functions while under pressure

Local candidates only. We cannot accept remote applicants at this time. 

Drop files here browse files ...