FlowGlow Lighting
Smart lighting platform with IoT backend
End-to-end platform for connected lighting systems: ESP32 firmware with MQTT comms, a Node consumer as device backend, Supabase as the data layer with real-time broadcast, a Flutter app for end customers and an Angular admin dashboard for support and roll-outs.
01 Problem
FlowGlow wanted to turn a "dumb" smart lamp into a fully connected product — BLE provisioning, live control via app, reliable diagnostics when things go wrong, firmware OTA with rollback and integration into smart-home ecosystems. All on an architecture that handles hundreds of devices concurrently without every state change drowning the backend.
02 Solution
Architecture in four layers: ESP32 firmware with versioned MQTT topics, a Node consumer as device backend, Supabase with Postgres as the data layer (including per-user real-time broadcast), a Flutter app for end customers and an Angular dashboard for support, firmware roll-outs and diagnostics. Deployed via Coolify on a single Hetzner box.
03 Outcome
Hundreds of devices monitored in real time, firmware updates roll out in staged waves and coredumps come in automatically — support has a symbolicated stack trace in hand before a customer ever calls. Smart-home integration with Google Home and Alexa as the next step went live in early 2026.
Architecture in four layers
The platform is built around four cleanly separated layers, and each layer talks to its neighbours over a single, documented protocol:
- Firmware (ESP32 / C++): brings up Wi-Fi after boot, reads credentials from NVS and connects to the Mosquitto broker over TLS. Publishes state changes on
d/<id>/state, OTA events ond/<id>/ota_events, crash summaries ond/<id>/coredump, and accepts commands ond/<id>/cmd. - MQTT consumer (Node.js): subscribes to every device topic, validates payloads, deduplicates, and writes the result transactionally into Postgres. Also publishes commands the other way by polling a
lamp_commandstable. - Supabase / Postgres: the single source of truth. RLS policies govern visibility per customer, a
BEFORE INSERTtrigger ondevice_eventssnapshots every transition, and anAFTER UPDATEtrigger ondevice_statebroadcasts the change to exactly the right user channel. - Clients: a Flutter app for end customers (iOS, Android) and an Angular dashboard for support and operations. Both speak GraphQL for reads and Realtime Broadcast for live updates — no direct MQTT path into the frontend.
The strict separation means each layer ships independently: firmware updates roll out across devices, the consumer is a stateless worker behind Coolify, the dashboard is an SPA, the Flutter app a store build. No “big-deploy” moment on the calendar, no lock-step between frontend and backend.
Approach
Three design choices run through the entire backend and are why the system stays stable with hundreds of devices online at once.
First: nothing blocks. The consumer is purely async, every message its own transaction. If Postgres briefly lags, messages queue at the broker, not inside the code. Second: everything is idempotent. Every handler can receive the same message any number of times without corrupting state. Third: diagnostics come in automatically. When a device crashes, it uploads an ELF coredump on its own — nobody has to call a customer and say “please send me the log”.
// Staged firmware rollout: 1% → 10% → 100% with a health gate between waves.
export async function scheduleRollout(v: string) {
const stages = [0.01, 0.10, 1.00];
for (const share of stages) {
await rolloutTo(v, share);
await waitForHealthGate(v, { minHours: 24 });
}
} Highlights
Topic schema with explicit contracts
Every topic family has a Zod schema that validates incoming messages before they reach the database. Schema violations land in a dead-letter queue with context (device id, raw payload, topic) instead of toppling the backend. This was a deliberate response to the first generation: a single bad firmware build could lock up the whole backend by sending unparsable payloads.
import { z } from 'zod';
export const StateV1 = z.object({
device_id: z.string().uuid(),
fw_version: z.string(),
brightness: z.number().int().min(0).max(100),
color_kelvin: z.number().int().min(2200).max(6500),
on: z.boolean(),
seq: z.number().int(), // monotonic, for dedup
ts: z.iso.datetime(),
});
export type StateV1 = z.infer<typeof StateV1>; Atomic state update as a Postgres RPC
State change and audit event run in a single transaction through a SECURITY DEFINER function. The BEFORE INSERT trigger on device_events snapshots the current device_state as previous_state — and is guaranteed to see the state before the upsert, never a half-written mix.
CREATE OR REPLACE FUNCTION public.process_device_state_update(
p_device_id uuid, p_event_data jsonb, p_state_data jsonb
) RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
-- 1. Event first — trigger snapshots previous_state from device_state.
INSERT INTO device_events (device_id, event_type, event_data, source, source_id)
VALUES (p_device_id, 'state_change', p_event_data, 'device', p_device_id::text);
-- 2. Upsert device_state, only fields present in payload.
INSERT INTO device_state (device_id, power, brightness, rgb_color, online, last_seen)
VALUES (p_device_id,
(p_state_data->>'power')::boolean,
(p_state_data->>'brightness')::int,
(p_state_data->>'rgb_color')::int,
true, NOW())
ON CONFLICT (device_id) DO UPDATE SET
power = COALESCE((p_state_data->>'power')::boolean, device_state.power),
brightness = COALESCE((p_state_data->>'brightness')::int, device_state.brightness),
rgb_color = COALESCE((p_state_data->>'rgb_color')::int, device_state.rgb_color),
online = true,
last_seen = NOW();
END $$; Automatic coredump capture with deduplication
When a device crashes, it first publishes a summary on d/<id>/coredump (stack pointer, reset reason, ELF SHA). The consumer dedupes on (elf_sha256, pc, cause). On a new crash signature it sends a get_coredump command back; the device then streams the ELF in chunks via coredump_raw plus a pre-crash log tail via log_tail. Re-occurrences just bump a counter — that avoids redundant uploads when a single bug hits 200 devices at once.
async function handleCoredumpSummary(deviceId: string, summary: CoredumpSummary) {
if (!summary?.elf_sha256 || !summary?.pc || summary?.cause === undefined) {
return; // drop incomplete summary
}
const result = await db.upsertCrashSummary(deviceId, summary);
if (!result) return;
// Auto-trigger raw upload only on first occurrence of this signature.
if (result.isNew && result.rawElfStatus === 'pending') {
const customerId = await db.getDeviceCustomerId(deviceId);
const commandId = await db.queueGetCoredump(deviceId, customerId);
await db.markCrashUploadRequested(result.crashId, commandId);
assembler.registerExpectedUpload(commandId, deviceId, customerId);
}
} Per-user broadcast instead of postgres_changes
The naive path in Supabase Realtime — postgres_changes — streams every WAL change through the Realtime server, which evaluates each session’s filter clauses and RLS policies before pushing the row to authorised clients. That keeps the data correct (RLS filters server-side, no leaks), but with hundreds of devices and hundreds of sessions, it becomes hundreds of filter evaluations per WAL entry and the Realtime server is the bottleneck. Instead, a trigger broadcasts each device_state change directly to a single per-user channel user:<uid> (private: true) — addressing happens once in the trigger, and the Realtime server doesn’t have to ask every session. On subscribe, the Flutter app pulls a snapshot via GraphQL; deltas take it from there.
@override
Future<void> subscribe() async {
await unsubscribe();
final userId = _supabase.auth.currentUser?.id;
if (userId == null) return;
// 1. Snapshot via GraphQL — the app starts with an accurate baseline.
await _fetchInitialStates(userId);
// 2. Per-user broadcast channel for deltas.
_channel = _supabase
.channel('user:$userId', opts: const RealtimeChannelConfig(private: true))
.onBroadcast(event: 'device_state_change', callback: _onBroadcast)
.subscribe();
} Device-local timers with reconnect re-arming
Automations like “switch off in 30 minutes” run locally on the device, not in the backend — much more resilient to WAN flaps. But if Wi-Fi dies mid-timer and the device reboots, the timer is gone on the hardware. The consumer notices the offline → online transition, computes remaining seconds from the DB column expires_at, and re-arms every non-expired timer with the same execution_id. The device overwrites its slot in place — idempotent regardless of how often the re-arm logic fires.
private async rearmTimers(deviceId: string): Promise<void> {
const rows = await this.db.fetchActiveTimersForDevice(deviceId);
if (rows.length === 0) return;
for (const row of rows) {
const remainingMs = new Date(row.expires_at).getTime() - Date.now();
const seconds = Math.max(1, Math.floor(remainingMs / 1000));
await this.db.queueTimerArm(deviceId, row.automation_id, {
execution_id: row.execution_id, // same slot → idempotent
seconds,
inner_steps: row.inner_steps,
});
}
} Throughput math, in 118 bytes of WebAssembly
The whole architectural story above — per-user broadcast, dedup at the consumer, idempotent state — exists because the naive answer (every device, every state change, fanned out everywhere) does not scale. Move the sliders to see why:
// 118 bytes of WebAssembly · runs in your browser, no JS fallback
msgs / s
—
data rate
—
after dedup
—
// computed in WebAssembly
What mattered in production
Three things that didn’t show up in load tests but turned out to be decisive in production.
ACK topology isn’t symmetric. Devices ACK on d/<id>/ack, but the ID form differs by source: UUIDs for command_id, locally generated string ids for timer fires. The consumer routes by which field is present (command_id vs timer_id), not by separate topics — a consolidation that cut a lot of complexity later.
OTA rollouts rarely fail at the flash, more often at re-verify. After a successful flash, the device boots, verifies the new partition and sends ota_success. If it hangs between boot and verify, we treat it as ota_failed after 10 minutes and roll back. Without that watchdog, at least one bad build would have bricked half the fleet.
Ops scripts go through the same RLS layer. Internal scripts (bulk updates, diagnostic exports) authenticate via Supabase service-role and write into device_events with source = 'admin'. That turned out to be gold during audit: every change to customer data has a traceable origin, even when manually triggered.
05 Key decisions
Architecture choices that paid off
MQTT topics versioned from day one
Every topic carries a version suffix (`d/<id>/state`, later `state/v2`). When the payload changes, a new version runs in parallel — existing devices keep speaking v1, new firmware emits v2. No big-bang migration window needed and no awkward "update every device first, then deploy the backend".
Staged firmware rollouts, not all-or-nothing
Updates ship in three waves: 1% of the fleet, then 10%, then 100%. Between waves, an automatic health gate halts the rollout if anomalies appear (crash rate, heartbeat drop, failed OTA events). A broken build hits 1% of devices, not 100%.
Idempotent message handlers in the consumer
Every handler can process the same message multiple times without side effects. Crash summaries dedupe on `(elf_sha256, pc, cause)`, timer ACKs only finalise an execution row if it is still `running`. MQTT redeliveries (QoS 1) and reboot loops cannot create duplicate bookings or duplicate state changes.
Atomic state update instead of two round trips
A `process_device_state_update` RPC writes the event and upserts the `device_state` in one transaction. The `BEFORE INSERT` trigger on `device_events` snapshots the previous state cleanly — no race between "event inserted" and "state already changed". Small in theory; the difference between useful and useless audit logs in practice.
Per-user broadcast over postgres_changes fan-out
Instead of `postgres_changes`, where the Realtime server evaluates each session's filter clauses and RLS policies on every WAL entry, a trigger publishes every state change directly to `user:<uid>` via `realtime.broadcast`. The Flutter app subscribes to that one channel. Scales linearly with messages instead of sessions, and makes authorization trivial: one channel per user, one auth check at subscribe time.
Coolify on owned hardware instead of AWS IoT Core
Instead of AWS IoT Core or Azure IoT Hub: one Hetzner server, Coolify for orchestration, Mosquitto as the broker, Postgres for data. GDPR-compliant by default, monthly cost under €50, full control over retention and backups — important in a market where data protection is a sales argument.
07 Lessons learned
- 01 Versioning MQTT topics from day one saves you from big-bang migrations later.
- 02 Idempotency in the consumer handler is not optional — it is a precondition.
- 03 Staged rollouts with health gates catch problems on 1% of the fleet, not 100%.
- 04 Supabase Realtime Broadcast scales much further than the naive `postgres_changes` path.
- 05 In the field, coredumps with ELF symbols are the difference between "the app crashes sometimes" and an actual bug ticket.
- 06 One shared broker code path in C++ and TypeScript is more tractable than two "cloud-native" stacks once devices are actually in customers' hands.