Our map-making platform can introduce over one billion changes on a monthly basis to the world map repository. In order to ensure accuracy, we check every change against strict quality rules before going live. Currently, we have defined around 3,000 business checks. Every change is traceable in terms of operations performed, as well as accountability.
Additionally, our system delivers data via its read interface to the tools that transform our internal map representations to formats required by our business partners. All of this happens in real time. A lot of elements go into our platform and require heavy processing, which means that even small system outages can be costly and inconvenient. In this blog post, we explain an important part of the architecture that helps us minimize downtime risk: monitoring.
Our monitoring system consists of three main pillars:
Prometheus with Alert manager
ELK stack (Elasticsearch/Logstash/Kibana)
Prometheus collects metrics that are indicated by numeric values, such as CPU usage and data. These metrics are transferred with the attached metadata, such as the instance name and network interface identifier.
We mainly use ELK for gathering application logs containing detailed information about processed transactions, time taken and operation result.
CloudWatch acts as a supplementary tool that provides information on our AWS managed services. This infrastructure collects around 75 million metrics and logs per hour. We also extract other useful information from this data stream, including signals, which help with automatic traffic shaping, preventing downtime, fulfilling SLAs and sometimes, unfortunately, waking up on-call team members.
When our system was introduced into production, our monitoring capabilities were limited – we knew that we had downtime, however, finding the root cause required considerable time and manual work. As time went by, we gained more experience and could react faster, but we wanted to do even better. A closer look at the system revealed that most of our problems were coming from overloaded database instances.
Heavy queries required greater processing power (both in terms of CPU and I/O) causing increased response times, timeouts and, as a result, spikes in the failure rate. Our first response was to develop a component called "database throttler", which is used to detect overloaded instances and limit the number of queries that are executed. We implemented this by limiting the number of available connections assigned to a given database user. It improved our system’s stability and predictability.
Then, we discovered another possible limitation of the database throttler to improve on – our granularity, which was working on a user level. In our system, the database user is assigned per functional cluster (basic cluster division is latency bound vs throughput bound, the second category is writeable vs read-only). Part of the systems dedicated for tools ingesting map data were used by dozens of different software components. A single tool could cause overload of a database instance, which ultimately throttled the whole functional cluster. Thus, we wanted something that could give us control on a more fine-grained level.
API management for accountability
We decided to introduce API management. It usually enables AAA (Authorization, Authentication and Accountability) implementation. In our case, we were mostly interested in accountability. Our customers received dedicated endpoints and, as soon as their API call reached our backend, it was already provided with information regarding the request originator. With significant refactoring effort, we carried that information across all our webservice layers. Finally, this metadata trace was attached to the query executed on the database engine.
With the help of Prometheus, we collected that data to find out:
How much database processing time queries of a given endpoint took (approximately).
How many queries were originated by a given endpoint.
This information reflected the ratio of processing power taken by a given endpoint on a particular database and it became a good indicator of the origin of overloaded instances.
Thanks to the above features and collected data, we were able to detect overloaded database instances and track down the endpoint that was most likely the root cause. Now, we needed something that would let us influence the given endpoint’s behavior. Here, we used one of the standard features of API management – rate limiting, which was dynamically adjusted. Additionally, we have introduced the exponential backoff feature in our API client library, which slows down the pace of sent requests whenever the endpoint limit is reached.
Together with the dynamically set request quotas per endpoint – based on Prometheus alerts – we had more control over the flood of heavy queries. In short, it works in a way that whenever a database is overloaded, the endpoint with the biggest ratio in the total executed query time is identified and its quota on API management is decreased, resulting in a slower pace of incoming requests.
A side effect of this approach, we have learned, is that sometimes, even if the given endpoint was throttled down to one request per minute, it still overloaded the database. How is that possible? Our system is highly parallelized – this means that a single request sent by a client is forked into dozens of sub-requests towards other components, such as web services and databases. Identification of such “killer requests” helped us find and improve weak spots in our software. We have also implemented a handler that is triggered by the overloaded database alert, which terminates queries abusing the system. This is used as a last resort when all other measures are not helping.
The dynamics of our system
One important element to keep in mind is that overloaded databases might not necessarily be caused by a flaw in the software’s design. Our system is very dynamic and is capable of adjusting its backend performance to fulfill the demand, so higher loads on certain parts of the system can also mean that we need to scale it out.
Throttling, in this case, results in slowing down the incoming requests flow to the part of the system that is currently adding processing capacity Ultimately, this allows us to operate without impacting the success rate. Databases have significant scaling inertia (compared to web services) and temporarily withholding the stream of requests buys us time to operate more efficiently.
Ensuring balance and stability
Another function of our traffic shaping feature is setting request quotas to the default level. When a given endpoint is not used for some time (usually a couple of hours) its quota is set to a safe level that will protect the backend against an uncontrolled burst of requests. "Flattening" the incoming traffic also allows us to scale out if that is required.
So far, all described cases have been about throttling down the endpoint. If we took this approach in every case, sooner or later, there would be no incoming traffic to the system. Fortunately, there is also a mechanism to upgrade the quota for the client.
Thanks to the data we have collected, the monitoring system knows which endpoints' requests are suffering from being limited. In cases where a given endpoint is not causing overload of any backend's component, a signal is triggered and the proper handler upgrades the quota.
Both parts of the monitoring backend that are responsible for limit regulation of requests work independently. In this way, the overall system constantly looks for a balance where demand, backend utilization and stability are kept at an optimal level.
At the time of writing this blog, the next generation of traffic shaping was released, which introduces improvements in quota limits adjustments for particular endpoints.