Ivan Kovalev

Site Reliability Engineer

About Me

I am a Site Reliability Engineer (SRE) passionate about designing scalable, reliable, and efficient systems. Mostly interested in self hosting and bare metal deployments. This is my personal website where I’ll showcase my work, experiments, and thoughts over time.

Work Experience

Optimizely – Senior Site Reliability Engineer

At Optimizely, I oversee the management and deployment of critical services across multiple cloud environments, including both Google Cloud Platform (GCP) and Amazon Web Services (AWS). My responsibilities include managing the full lifecycle of our cloud databases and developing comprehensive Terraform modules that streamline provisioning, configuration, and monitoring.

By defining key metrics and implementing robust monitoring strategies, I ensure proactive detection of issues and maintain optimal performance levels. Among my most notable achievements was the seamless migration of a high-traffic CDN, handling over 80,000 requests per second, from one cloud provider to another without any downtime. This project exemplified my commitment to reliability, scalability, and smooth operational transitions.

ING – Site Reliability Engineer

During my time at ING, I played a key role in maintaining and optimizing a self-hosted machine learning platform running on Kubernetes. This involved ensuring the platform met defined Service Level Agreements (SLAs) and Service Level Objectives (SLOs), as well as establishing and refining our incident management processes.

Beyond the day-to-day operations, I promoted the use of open-source solutions and best practices, enabling the team to leverage a wider ecosystem of tools and technologies. Through these efforts, the ML platform operated more reliably, ensuring timely, accurate insights that supported ING’s data-driven decisions.

Booking.com – Site Reliability Engineer

At Booking.com, I maintained critical infrastructure responsible for internal authentication and authorization. By enhancing the reliability of these systems, I ensured seamless access management for thousands of internal users.

I also introduced automated integration tests using GitLab CI, improving code quality and facilitating smoother deployments. Furthermore, I implemented rate limiting for distributed applications in Kubernetes to manage high-volume traffic effectively. Together, these initiatives increased operational efficiency and minimized downtime.

VK – Site Reliability Engineer

I began my professional career at VK, where I focused on ensuring the reliability, scalability, and high availability of various large-scale services. My portfolio included the social network Moi Mir, as well as Donationalerts, Boosty, and VKPay.

My responsibilities covered the full lifecycle of the infrastructure: from installing operating systems on bare-metal servers to configuring them for production use. I was deeply involved in monitoring system health, troubleshooting performance bottlenecks, and taking part in on-call rotations to address critical incidents swiftly. During periods of high load, I ensured services remained stable and efficient.

One of my most significant achievements was the successful deployment of MySQL Orchestrator. This solution streamlined our database operations, making it possible to relocate servers between data centers without data misalignment or downtime. As a result, we improved our system’s resilience against server outages and network failures, ensuring uninterrupted service for millions of users.

My Public Projects

Nothing here yet, but I’ll update this section with my public projects soon.