Observability For Quality

Observability is crucial in enabling fast development and innovation on the Infobip scale. Without accessible, fast, and accurate insight into the state of each component/service, it's impossible to provide/ensure the quality of service that our customers expect.

We use the most commonly used tools in observability stacks in IT systems. Because of Infobip's global presence and scale, we constantly experiment with new and better solutions to support our growth.

Observability at Infobip includes:

  • Metrics monitoring (VictoriaMetrics clusters, multiple Prometheus instances with a total of 100M active series);

  • Events monitoring (New Relic with 85+TB monthly ingest);

  • Logging (Graylog with 200+ instances);

  • Communication logs (Elasticsearch clusters with 130+B documents).

The most common metrics cover service level indicators (SLIs) such as traffic, error rate, latency, and saturation.

Based on this data, every component/service has alerts to meet SLAs defined by the component/service team owner. Alerts are primarily created in Alertmanager/Prometheus, sometimes in New Relic and Graylog, and sent to OpsGenie. In OpsGenie, each team defines on-call schedules, notification policies, etc., according to their way of work.

Visualizations are another valuable tool for engineers to understand their systems' state better. The most used visualization tool in Infobip is Grafana, with many data sources and over 2650 dashboards.

Having specific business use cases at such a scale, it was challenging to find a "commercial silver bullet", so in addition to all tools mentioned above, we created our own alerting tool to help us cover blind spots:

  • Based on InfluxDB and Elasticsearch;

  • Integrated with OpsGenie;

  • Ability to create alerts based on any fixed or dynamic threshold and anomaly detection;

  • Predefined diagnostic pages for each alert type.

Last updated