Observability For Quality

Observability is crucial in enabling fast development and innovation on the Infobip scale. Without accessible, fast, and accurate insight into the state of each component/service, it's impossible to provide/ensure the quality of service that our customers expect.

We use the most commonly used tools in observability stacks in IT systems. Because of Infobip's global presence and scale, we constantly experiment with new and better solutions to support our growth.

Observability at Infobip includes:

  • Metrics monitoring (VictoriaMetrics clusters, multiple Prometheus instances with a total of 300M active series);

  • Events monitoring (Grafana Faro with 85M daily events;

  • Logging (Azure Data Explorer with 1PB of logs);

  • Communication logs (OpenSearch clusters with 100B documents).

The most common metrics cover service level indicators (SLIs) such as traffic, error rate, latency, and saturation.

Based on this data, every component/service has alerts to meet SLAs defined by the component/service team owner. Alerts are primarily created in Alertmanager/Prometheus and sent to OpsGenie. In OpsGenie, each team defines on-call schedules, notification policies, etc., according to their way of work.

Visualizations are another valuable tool for engineers to understand their systems' state better. The most used visualization tool in Infobip is Grafana, with many data sources and over 4250 dashboards.

Having specific business use cases at such a scale, it was challenging to find a "commercial silver bullet", so in addition to all tools mentioned above, we created our own alerting tool to help us cover blind spots:

  • Based on InfluxDB and Elasticsearch;

  • Integrated with OpsGenie;

  • Ability to create alerts based on any fixed or dynamic threshold and anomaly detection;

  • Predefined diagnostic pages for each alert type.

Last updated