• Work with partners to shape the architecture, design, and implementation of new and existing systems to enhance their reliability, efficiency and scalability.
• Assume the role of Incident Commander in high priority incidents, getting hands-on when required to improve the TTR.
• Apply resiliency engineering disciplines to avoid reoccurrence of incidents: drive incident response, analysis and remediation.
• Ensure that critical services have a properly configured monitoring and alerting setup and that operational hygiene is applied to guarantee their continuity.
• Design, write and maintain software to improve the performance of services and the connected operational profile.
• Actively develop and maintain the Observability Platform that teams at TomTom rely on to monitor their services and aid them during incident response
• Be part of a collaborative environment, working together across many teams to ensure that systems are performing as well as possible.
• Shift left operations and support the growing autonomy of our DevOps teams.
• Support the definition of the SRE strategy and roadmap.
• 5+ years of working experience in a production environment, covering software and system engineering.
• 3+ years of production experience operating Linux systems on cloud or bare metal, covering infrastructure as code, configuration management and monitoring.
• Extensive experience designing, developing, troubleshooting and evolving large scale distributed systems.
• Proficient in Java, C++, Python or any other modern programming language
• Good understanding of Unix/Linux systems internals (e.g. memory management, file systems, threads and processes, system calls).
• Good understanding of networking protocols and theory (e.g. TCP/IP, UDP, DNS, HTTP/HTTPS).
• Experience working with AWS, Azure or a similar cloud environment at scale.
• Excellent written and oral communication skills, ability to collaborate successfully with technical and non-technical stakeholders.
• Ability to establish successful mentorship relationships with colleagues, expressing technical leadership without pulling rank and role modeling the SRE principles.
• Business acumen, ability to prioritize high ROI work, strong sense of ownership.
• Experience working with Kubernetes and Prometheus in production.
• Experience with operating mission-critical SaaS workloads in large scale cloud infrastructure.
• Expert level certifications for AWS, Azure, GCP, Kubernetes, etc.
Want to help shape the future of mobility? By joining our Service Platform Product Unit, you will support Product Development and IT Service Management to ensure we get the most out of our tools and services. You will be an essential part of the entire operation, keeping everything moving so we can continue innovating on a global, real-time scale.
After you apply
1. First call: If your application matches the role, then it’s time to put a voice to the name! We’ll call you to set up an interview.
2. First interview: In this interview, we want to know more about you – what excites you about location technology and how can you help us solve global challenges.
3. Online assessment: We’ll send you an assignment - use your expertise to show us what you’ve got.
4. Second interview: We'll dive into your potential role, showing you how you’ll fit into your team and contribute to our vision.
5. The final decision: Cue the fireworks, because we’ll start the onboarding!