Infobip Engineering Handbook
Start HereJoin Infobip EngineeringBack to Infobip.com
  • Start Here
    • Infobip At A Glance
    • What We Believe
    • Infobip Engineering Timeline
  • Become A Better Engineer
    • Are You Bored At Work?
    • Steep Learning Curve
    • Freedom To Choose Your (Engineering) Hammer
  • Tech Stack & Architecture
    • The Scale of Our Systems
    • Platform Architecture
    • Observability For Quality
  • How We Code & Deploy
    • Development Flow
    • Testing (And The Freedom To Choose Your Tests)
    • Troubleshooting
    • Incident Management
    • Deployments and Disasters RPG
    • Engineering Enablers
    • A-Team
    • Collaboration Tools
  • Engineering Culture
    • Engineering Principles - In Practice
    • How Growth Impacts Infobip's Values
    • Culture of Approachability
    • Paid Interventions
    • How We Improve Our Culture
    • Employee Feedback Process
  • Key Processes
    • LeSS
    • OKRs
    • One Backlog
  • Self-Managed Teams
    • You Build It, You Own It
    • Examples of Infobip Teams
  • Community
    • Student and Youth Programs
    • Engineering Insider
    • Dev Days Conference
    • Meetups
    • Writing for Engineers
    • Publishing your ideas
    • ShiftMag
    • Hack Days
    • Startup Tribe
    • Infobip Shift Conference
  • Career Development
    • Career Development
    • Switching Positions
  • Benefits
    • Benefits Overview
    • ESOP & Bonuses
    • Engineering Education Budget
    • Learning & Knowledge Sharing
    • Attending Conferences (And Speaking At Them)
    • Good Hardware
    • Vacation & Well-being
  • Hiring & Onboarding
    • Hiring Process - Step by Step
    • Your Onboarding Plan
    • Engineering Onboarding Program
    • Referral Program
  • A Day In The Life - At Infobip
  • An Engineer's Log: No Such Thing as a Typical Day
  • 😊Join Infobip Engineering
  • Impressum
Powered by GitBook
On this page
  1. Tech Stack & Architecture

Observability For Quality

PreviousPlatform ArchitectureNextHow We Code & Deploy

Last updated 1 year ago

Observability is crucial in enabling fast development and innovation on the Infobip scale. Without accessible, fast, and accurate insight into the state of each component/service, it's impossible to provide/ensure the quality of service that our customers expect.

We use the most commonly used tools in observability stacks in IT systems. Because of , we constantly experiment with new and better solutions to support our growth.

Observability at Infobip includes:

  • Metrics monitoring (VictoriaMetrics clusters, multiple Prometheus instances with a total of 100M active series);

  • Events monitoring (New Relic with 85+TB monthly ingest);

  • Logging (Graylog with 200+ instances);

  • Communication logs (Elasticsearch clusters with 130+B documents).

The most common metrics cover service level indicators (SLIs) such as traffic, error rate, latency, and saturation.

Based on this data, every component/service has alerts to meet SLAs defined by the component/service team owner. Alerts are primarily created in Alertmanager/Prometheus, sometimes in New Relic and Graylog, and sent to OpsGenie. In OpsGenie, each team defines on-call schedules, notification policies, etc., according to their way of work.

Visualizations are another valuable tool for engineers to understand their systems' state better. The most used visualization tool in Infobip is Grafana, with many data sources and over 2650 dashboards.

Having specific business use cases at such a scale, it was challenging to find a "commercial silver bullet", so in addition to all tools mentioned above, we created our own alerting tool to help us cover blind spots:

  • Based on InfluxDB and Elasticsearch;

  • Integrated with OpsGenie;

  • Ability to create alerts based on any fixed or dynamic threshold and anomaly detection;

  • Predefined diagnostic pages for each alert type.

Infobip's global presence and scale