Site Reliability Engineer

at BitMEX (view profile)
Location San Francisco, CA, United States
Date Posted August 8, 2019
Category Engineering, Software Development
Job Type Office · Full time

Description

The Site Reliability Team consists of hybrid systems and software engineers who are responsible and take ownership for management of large scale infrastructure while improving reliability and automation. SREs are integrated within the DevOps team, and we're looking for engineers who want to be a part of developing infrastructure software, scaling it, and maintaining it.

How You'll Make an Impact

As one of the first contributors to our SRE team, you will be in the position to influence our core DevOps culture, and greatly improve the velocity, reliability & confidence managing the key components of our trading platform infrastructure, which handles $5B worth of trades volume each day.

  • Guide the company scale our electronic trading platform, and define capacity planning in order to anticipate and prepare for growth.
  • Automate deployment & configuration processes across the board
  • Develop reliability frameworks and tools for use by all engineers
  • Lead incident response / analysis & share on-call for BitMEX’s most critical systems

Required skills and qualifications

  • BS or MS in Computer Science or a related technical discipline. Equivalent practical experience is a reasonable substitute.
  • Demonstrable knowledge of Terraform / Chef
  • Demonstrable programming skills in Go, and an ability to pick up new ones
  • Demonstrable knowledge of container technologies, and container orchestrators such as Kubernetes
  • Experience in the Linux environment and a good understanding of its fundamentals and internals: the user/kernel-space divide, cgroups, filesystems (incl. ZFS) and modern memory management, threads and processes, etc
  • Solid understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems (incl. k/v stores, and distributed fs stores)
  • Proven knowledge of the TCP/IP stack, latency matters, internet routing and load balancing
  • Ability to drive projects end-to-end and excel with minimal technical supervision, while embracing reliability constraints and proactiveness
  • Capacity to multitask and give equal attention to a variety of functions while under pressure

Local candidates only. We cannot accept remote applicants at this time. 

Drop files here browse files ...