About the customer

  • Our client was a leading trading exchange in India. They have been serving the client with a variety of trading products/commodities, including Intra Day, Reverse/Forward Auction, and Standard Auction.

  • Our client kept trading open for end users 24 hours a day, seven days a week, with a high rate of liquidity even at midnight and 100% accuracy on processing time.

Exchange Monitoring Setup

Industry

Energy Sector

Technologies

Prometheus, Kubernetes, AWS, Grafana

Project Duration

2 Months

Project-Details

Project Details

  • Bigoh was responsible in setting up a Monitoring system for the Exchange, This monitoring system was suppose to capture the logs from various systems and flash an alert once any logs were missed or any error log was reported.

Problem

  • Client had a cluster set up in Kubernetes, with Nodes in each cluster. Client wanted to know the health of individual Nodes as well as the Clusters.
  • The client desired to set up a monitoring mechanism in which the respective stakeholders could be notified in near real time for any failed logs as well as their previous resolution steps.
  • Client was required to go through a compliance process initiated by the authorized regulatory body to ensure that all logs were captured and that there was no downtime in the system.
  • The client’s workflow was divided into several stages, each of which had a predetermined grace period. The customer requested notification when a particular step’s grace period was about to expire. 
  • Also, since trading was involved, each activity had to be completed within a specific time frame. If there was even a millisecond delay, a notification as a warning or failure would be triggered.
  • Client wanted to achieve network monitoring on the network layer, such as switches and virtual machines on which the applications were running.
  • Client needed application monitoring for both server side and the client apps such as Angular Client side rendered App and Mobile App
  • Client was keen to track the network trace as well for the End consumer to figure out the API latency, heartbeat and APi service map as well to figure out the time taken in interacting with different microservices
  • Client also wanted to figure out the turnaround time for a particular set of operation to be completed using the logs triggered from the consumer facing application
  • Client wanted to view the health and metrics such as (Slow query Log, CPU Utilization) of multiple products in a single unified platform
  • Incident mapping and tracing was a much needed for client, to perform better RCA when did the similar incident occur in the past
problem
Our-Approach

Our Approach

  • With the help of Promtel, we were able to achieve the part on the monitoring of kubernetes and the cluster under it
  • Loki was used for capturing the logs that were being triggered by the applications both on the server end and the client end
  • Business rules were written and post the grace period was over, a notification was sent over teams
  • To send notifications across teams, we configured team connectors in Grafana and created a group across teams where the notifications were delivered.
  • Well defined structure of logs were created, Loki was capturing the logs and due to custom code we were handling the type of log that got triggered (If the API responded with 4XX or 5XX notification was triggered)
  • Prometheus was used for application and network monitoring, capturing logs from both the consumer and infrastructure end.
  • With the prometheus, we were able to view the end user’s network trace, as well as all records such as API latency and Service map. We could also see how long it took for one of the workflows to complete.
  • Grafana was used to plot charts and graphs, and it was linked to various data sources from which the data was obtained.
  • A unified dashboard was created to depict the health of various products in the form of a line graph (Data being shown was slow query logs, CPU utilization etc)
  • All this setup was designed for real-time monitoring, there was no delay and all triggers were triggered in real time.
  • Client was able to view the turnaround time for a specific activity as network traces for a particular consumer machine were being recorded real time on Remote Dashboards
  • Zipkin was used to track the API Service Map, API Latency, and the amount of time it took to communicate with other microservices.
  • Client was able to view the incident mapping report as well, Also the logs were captured for repetitive instances which allowed the operator to arrive at the solution quickly
  • For security part Edge security and OWASP Top 10 practices were followed
  • Due to the use of Promtel, the respective teams could easily see the health of each cluster and node in the dashboard.

Benefits

  • Client has been able to successfully pass the audit being conducted by the regulatory body in only two attempts
  • Incidents of system getting down were ~0
  • There were less than 1.5 percent of trade loss due to the notifications and alerts getting triggered timely
Benefits