How We Improved Our Notification System

Published in

Trendyol Tech

8 min readDec 15, 2023

Introduction

Let’s start with knowing each other. We are the Order Management System team for Trendyol Instant Delivery. What we do is manage orders’ lifecycles and take action according to them. There are multiple ways to affect an order’s lifecycle. We are going to look into one specifically and that is our Picker friends.

Pickers are the workers at the market who accept, prepare, and hand the order to the courier. They do also other things like calling customers if some items are missing, notifying us that some items’ stocks are empty, and so on. That’s why we’ve developed a mobile application just for Pickers to use. We call it simply the Picker App

When there is an order, it needs to be accepted and prepared. But we need to tell the Pickers to do it. How can we notify them? They can’t just refresh the page forever. We needed a notification system.

Our First Approach

Since our team has a good knowledge of Spring Framework, Kafka, and Couchbase, we thought we could use these technologies to solve our problem. We created a backend application that listens for our order lifecycle events (e.g. created, cancelled, delivered) and sends a notification request to Firebase Cloud Messaging. Thus, the Picker App receives a push notification and Pickers are notified. Let’s take a look at architecture.

First, there are multiple notifications that need to be sent for multiple different events. So our application listens to multiple Kafka topics and sends a request to FCM to send notifications.
Before sending the notification, it inserts a row to lock that notification. Since we are running this app with multiple replicas, there’s a small chance that multiple consumers are going to send the same notification. That’s why we lock that notification so that only one app sends that notification.

There can be multiple Pickers working at the same store. We need to send the same notification to all Pickers in that store. So there are two ways to deliver notifications.

We could store FCM tokens that we would receive from the Picker App and we would send the notification to each Picker individually.
We could subscribe Picker apps to some FCM topic and send a single notification to that topic and each Picker app that subscribed to that topic would receive the notification.

We’ve decided that the second option suited us better since we wanted to send the same notification to all Pickers under the same store. So we subscribed Picker apps to topics related to their stores. For example, Picker1 and Picker2 work at Store A and Picker3 works at Store B. Picker apps of Picker1 and Picker2 would subscribe to topic like “store-a” and Picker3 would subscribe to topic like “store-b”.
With that, we could send a single notification to a topic and notify multiple Pickers.

Problems With This Approach

We’ve used this approach for months and saw some areas to be improved and some to be completely changed.

The first problem was that there were many complaints from Pickers about them not getting the notifications while their colleagues at the same store would receive the notifications. We suspected that the Picker app somehow couldn’t subscribe to the topic but that was not the case because it wasn’t happening always, only sometimes. We couldn’t suspect that our backend cannot send the notification to FCM since some Pickers receive the notification.

The other problem was the scalability and traceability of the overall system. That was mostly about backend architecture.

While we were sending the notification to FCM, it would take up to 1 second to send a single notification. And since we were using Kafka, our parallelism was directly related to our Kafka topics’ partition count. We could implement a complex async consumer mechanism without losing error/retry mechanism but it would either be not fully reliable or too complex.

Another way to solve the scalability problem was simply scaling up the partition count of Kafka topics. But we were listening to domain events and multiple consumers need those events not just our Notification System. We would increase the partition counts for only one consumer. So our Support Domain would directly impact the decision-making in the Main Domain. We didn’t like this idea.

So we decided to refactor our system.

The New Approach

With the problems we mentioned above in mind, we refactored two main things. First, the way we deliver notifications, and second, the overall backend architecture.

Changing the way we deliver notifications was mostly straightforward. What we did was to send FCM tokens from the Picker App to the backend. Then backend would store these tokens for future notification sends.

There are two ways to invalidate tokens. First, the Picker App can simply call the /unsubscribe endpoint. Second, the next time the notification system tries to send a notification to the target app, if the FCM responds with UNREGISTERED we remove that FCM token from our database.

Changes in the Architecture

We want the system to be more flexible. When there is a new channel to trigger notification, we wanted it to change nothing but just simply extend. So the first decision was to put all notifications in one Kafka topic. This way we could route all notifications to there no matter where they are coming from. And there would be a single consumer (with multiple replicas) and all it would do is simply send the notification to FCM.
We’ve used Kafka, Couchbase, and Couchbase-Kafka Connector to achieve this.

Since there are multiple sources for notifications, we needed to specify a contract model for notifications. With this contract model, we can send the notifications to another consumer as explained above. The contract model looks like this:

{
  "uniqueId":"(fcmToken)-(eventID)"
  "destination":{
    "address":"(fcmToken or FCM topic name)"
    "type":"(TOKEN or TOPIC)"
  }
  "userId": 123456
  "title":"New Order 🔔"
  "message":"You've got new order incoming! You can start preparing the order 1000000123"
  "args":{
    "orderNumber":"1000000123"
    "packageId":"10000000053"
    "deeplink":"typa://Order/Packages/1000000123"
  }
  "orderNumber":"1000000123"
  "eventType":"INSTANT_PACKAGE_CREATED"
  "auditInfo":{
    "correlationId":"8cef46a1-5a45-4754-8614-63562f994001"
    "agentName":"instant-ff-listener"
    "executorUser":""
  }
  "overdueTime":1702032991306
}

Let me explain the fields one by one:

uniqueId is the key we use for the locking mechanism to prevent duplicate notifications from being sent. I’ll mention it later in detail.
destination specifies where the notification will be sent to. It can be sent directly to one device via FCM Token or to the FCM topic.
userId is for tracing. We can track which notifications are sent to specific users.
title and message are push notification’s title and body texts.
args are directly mapped to FCM Push Notification’s data segment
orderNumber is for tracing. We can track which notifications are sent for specific orders.
eventType is used to determine notification channels on both Android and iOS
auditInfo where we keep our tracing variables. I’ll explain how we trace later in detail.
overdueTime is used to prevent sending outdated notifications. We don’t want to notify our workers 1 hour after the order is created. The current time window for that is 5 minutes.

This whole document is stored in Couchbase and then published to Kafka using Couchbase-Kafka Connector (for outbox pattern).

Locking

Since we don’t send the actual notification in one transaction (we just store the manifest of that notification) we needed to refactor our locking mechanism. Because if we lock notifications like before, the consumer that actually sends the notification can consume that message twice and there’s no reason for it to not send that notification twice. So we moved our locking into the consumer that listens for notification events (the event above).

As mentioned earlier, we send push notifications to multiple users for one action. For example, an order is created and we send push notifications to multiple devices. So we need a locking key for every device. We decided to use “(eventId)-(fcmToken)” as our locking key.

That way we can prevent the same notifications from being sent multiple times.

Tracing

There is definitely a need for tracing to see which notifications are sent or not. As we mentioned before, we have a few fields for audit. Among them, the most important one is correlationId. We set correlationId to MDC in our notification-system service. We put lots of logs in our app with orderNumber, userId, and much more information. We can search logs by orderNumber, userId, uniqueId, or more importantly correlationId. By doing so, we can see the notification’s entire lifecycle. What triggered it? How many devices will receive this notification? Did we send them successfully? We can find answers to all of this.

Further Scalability Improvements

The next thing we needed to focus on was scalability. Since we stopped consuming multiple Kafka topics and started to consume only one, we needed to increase the partition count of that topic. We’ve done some load tests and decided to go with 64 partitions. Currently, we don’t upscale our application replicas to that number but whenever it is necessary, we can.

But we can’t upscale the partition count to infinity. Our infrastructure guys wouldn’t like that. So if we ever need to further scale our throughput there are some things we can do:

We can consume a batch of messages, and process them in parallel. But since we are using Kafka, we’d need to disable auto-ack mode and commit the offset ourselves. But what if there are errors in some messages and not others? In that case, we would need to publish the messages with errors to an error/retry topic and commit the offset anyway. This way we can increase our throughput by maintaining error/retry mechanism -which is important-
We can consume the messages one by one as usual. But this time we’d send them to a worker pool. For example, ExecutorService in Java or Dispatcher in Kotlin Coroutines. We’d process them in an async way. With this approach, we must pay attention to two things. First is back pressure and we’d need to handle it ourselves otherwise we would constantly accept new notifications to send. The second one is to gracefully shut down our app. Otherwise, we would accept the notifications but cannot send them since our app is exited with notifications not sent.
We need to keep the order of notifications for our use case. But if your use case doesn’t need that, then you can use other technologies than Kafka to scale up your consumers without impacting your messaging system infrastructure.

We’ve seen significant improvements in feedback about notification delivery from our users. The codebase became more clean and easy to develop and the system is now more extensible.

About Us

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.

Trendyol - Backend Developer

We were founded in 2010 with a dynamic and agile start-up spirit. The trust of around 30 million customers and 250,000…

jobs.lever.co