How We Built a Scalable Email & SMS System with RabbitMQ

When you’re running a SaaS platform that sends thousands of transactional emails and SMS messages daily—appointment reminders, invoices, estimates, lead notifications—you quickly realize that “just send it” doesn’t scale.

Here’s how we evolved our messaging architecture from direct API calls to a queue-based system that handles failures gracefully, prevents duplicates, and keeps our main application responsive.

The Problem with Direct Sending

Our initial implementation was straightforward:

await emailService.sendInvoiceEmail(customer.email, invoiceData);
await messagingService.sendSMS(customer.phone, reminderText);

This works fine for a few hundred messages a day. But as we scaled, we hit several issues:

1. Request Timeouts

Email and SMS providers sometimes take 2-5 seconds to respond. When a user clicks “Send Invoice,” they shouldn’t wait for Twilio or SendGrid to acknowledge the message.

2. No Retry Logic

If SendGrid returns a 503, the email just… doesn’t send. The user might not even know it failed.

3. Duplicate Messages

Network hiccups could cause our code to retry, sending the same appointment reminder twice. Customers don’t appreciate that.

4. Scaling Bottlenecks

Our API servers were doing the heavy lifting of email rendering and API calls instead of just handling HTTP requests.

The Solution: Decouple with Message Queues

We introduced RabbitMQ as a message broker between our main backend and a dedicated messaging service. Here’s the architecture:

┌─────────────┐     ┌───────────┐     ┌──────────────────┐
│   Backend   │────▶│  RabbitMQ │────▶│ Messaging Service│
│   (API)     │     │  (Queue)  │     │   (Consumer)     │
└─────────────┘     └───────────┘     └──────────────────┘
                          │                    │
                          │                    ▼
                          │           ┌──────────────────┐
                          │           │ Email Providers  │
                          │           │ (SendGrid, etc.) │
                          │           └──────────────────┘
                          │                    │
                          ▼                    ▼
                    ┌───────────┐      ┌──────────────┐
                    │  Events   │◀─────│    Twilio    │
                    │  Queue    │      └──────────────┘
                    └───────────┘

How It Works

1. Publishing Messages

When our backend needs to send an email, it publishes a message to RabbitMQ instead of calling the provider directly:

const { getMessagePublisher } = require('./infrastructure/rabbitmq');

async function sendInvoiceEmail(invoice, customer) {
  const publisher = getMessagePublisher();

  const result = await publisher.publishEmail({
    to: customer.email,
    subject: `Invoice #${invoice.number} from ${company.name}`,
    html: renderInvoiceTemplate(invoice),
    context: {
      companyId: invoice.companyId,
      entityType: 'invoice',
      entityId: invoice.id,
      recipientId: customer.id
    }
  });

  // Returns immediately - email is queued
  return { queued: true, messageId: result.messageId };
}

The API response is instant. The actual email sending happens asynchronously.

2. Queue Structure

We use three queues for different message types:

Queue	Purpose
messaging.email	Standard transactional emails
messaging.email.priority	Urgent emails (password resets, OTPs)
messaging.sms	All SMS messages

Priority emails get processed first, ensuring time-sensitive messages aren’t stuck behind a batch of marketing emails.

3. The Consumer Service

A separate Node.js service consumes messages from these queues:

channel.consume('messaging.email', async (message) => {
  const { payload, context, retry } = JSON.parse(message.content);

  // Idempotency check - prevent duplicates
  if (await wasAlreadySent(message.messageId)) {
    channel.ack(message);
    return;
  }

  try {
    const result = await sendViaProvider(payload, context);
    await markAsSent(message.messageId);
    await publishEvent('message.sent', {
      correlationId: message.messageId,
      provider: result.provider
    });
    channel.ack(message);
  } catch (error) {
    if (retry.currentAttempt < retry.maxAttempts) {
      channel.nack(message, false, true); // Requeue
    } else {
      channel.nack(message, false, false); // Dead letter queue
      await publishEvent('message.failed', {
        correlationId: message.messageId,
        error: error.message
      });
    }
  }
});

4. Event Feedback Loop

When the consumer sends a message (or fails permanently), it publishes an event back to RabbitMQ. Our backend listens for these events to update notification status:

consumer.on('message.sent', async (event) => {
  await notificationService.updateStatus(event.correlationId, 'sent');
});

consumer.on('message.failed', async (event) => {
  await notificationService.updateStatus(event.correlationId, 'failed');
});

Key Design Decisions

Idempotency with Redis

Every message has a unique ID. Before sending, the consumer checks Redis to see if that ID was already processed. This prevents duplicates even if a message is redelivered.

async function wasAlreadySent(messageId) {
  const key = `sent:${messageId}`;
  const exists = await redis.exists(key);
  if (!exists) {
    await redis.set(key, '1', 'EX', 604800); // 7-day TTL
  }
  return exists;
}

Graceful Fallback

We kept direct-send code paths intact. If RabbitMQ is down, the system falls back to synchronous sending:

if (publisher.isEnabled()) {
  await publisher.publishEmail(emailData);  // Queue-based
} else {
  await emailService.sendRawEmail(emailData);  // Direct fallback
}

Dead Letter Queue

Messages that fail after 3 retries go to a dead letter queue. A separate process alerts our team and allows manual retry.

Separate Cron Jobs

Time-based messaging runs in the messaging service:

Job	Schedule	Purpose
Appointment Reminders	Every 15 min	Reminders 7 days and 1 day before
Lead Nurture	Every hour	Process drip email campaigns

Results

After deploying this architecture:

API response times dropped 40% - No more waiting for email providers
Message delivery rate improved to 99.7% - Retries catch transient failures
Zero duplicate messages - Idempotency keys work
Better visibility - Every message tracked with full context
Easier debugging - Failed messages in DLQ with error details

When You Don’t Need This

This adds complexity. You probably don’t need it if:

You send fewer than 1,000 messages/day
Message delivery isn’t business-critical
You’re a small team without DevOps capacity
Your providers have built-in retry

Start simple. Add queues when you feel the pain.

Tech Stack

Message Broker: RabbitMQ
Backend: Node.js/Express
Consumer: Separate Node.js service
Idempotency Store: Redis
Email Providers: SendGrid, Mailgun, SMTP
SMS Provider: Twilio

Conclusion

Decoupling message sending from your main application feels like overkill—until it isn’t. The first time your app stays responsive during a SendGrid outage, or you catch a duplicate before it annoys a customer, you’ll be glad you made the investment.

Build it incrementally. Start with queue infrastructure, add consumers, then migrate high-volume messages first. Keep fallback paths working until you trust the system.