Site Reliability Engineer - DevOps
HotDoc provides software that helps patients manage their personal health and relationship with their doctor. Our goal is to create the best healthcare experience possible for everyone in Australia. We’re well on our way with over 12,000 doctors and 1m+ patients per month using our software.
We’re a diverse, ethics-driven team of about 70 people working out of our plant-filled office with the best terrace in the Melbourne CBD. A handful of our team work remote from Perth, the Gold Coast and New Zealand.
As Site Reliability Engineer, you’ll be responsible for ensuring that our infrastructure and tooling is secure, reliable and performant as our platform and team expands.
We’re looking for someone who takes pride in building robust and deterministic infrastructure that brings out the best in the product and team. You care about infrastructure as code, developer ergonomics and security.
We Value People Who
- Are empathetic and care about doing right by others
- Take ownership of problems and their solutions
- Are prepared to speak up and question the status quo
- Don’t take themselves too seriously
- Have experience and like working in an agile, fast-paced environment
- Are continuously learning and curious
- As engineers, prefer to influence what gets built by understanding the user’s problems and opportunities
- Have an interest in healthcare, or a personal link to the healthcare industry
- Write code and processes with which to configure and administer Kubernetes clusters running on AWS.
- Own our High Availability and Disaster Recovery processes and testing.
- Assist other engineers with resource definitions and deployment processes.
- Take part in maintenance upgrades to Postgresql and Redis.
- Review of security and compliance processes, such as static vulnerability analysis and intrusion detection.
- Continuously improve our CI, CD and internal tooling to minimise friction for the team.
- Administer monitoring and alerting with Prometheus, the ELK stack, and CloudWatch.
- Maintain our on-call runbooks, train the team in incident response, and participate in the on-call roster.
- Capacity planning, performance, efficiency and latency optimisation
- Kubernetes, Docker, Helm
- Prometheus, AlertManager and Grafana or similar stack
- Experience with Unix/Linux systems (e.g. inodes, system calls)
- Experience with CI, such as Buildkite, a plus
- Postgresql (SQL performance tuning and troubleshooting) a plus
- Working knowledge of Ruby on Rails a plus
- Influence over how and why things are done
- Regular work from home days encouraged
- Any learning resources (books etc) paid for
- Domestic conference attendance paid for
- Team retreats to Phillip Island, including the satisfaction of beating the CEO at mini-golf
- Spacious terrace overlooking the Yarra, plus a fine selection of beverages on tap, fresh fruit, and snacks
- Subsidy for gym membership, personal training, or yoga.
Who should apply, and how?
We celebrate diversity, and welcome applicants of all types, and from all backgrounds. If you think you have the skills required, please submit a short cover letter and an up to date CV.