AppSync subscriptions: waiting for start_ack can still result in missing events
It seems like that when AppSync returns a `start_ack` message in response to a subscription `start` it won't necessarily mean that all future events will be delivered.
Subscriptions are the mechanism to deliver real-time events from AppSync. It is based on WebSockets and its protocol is documented [here](https://docs.aws.amazon.com/appsync/latest/devguide/real-time-websocket-client.html).
In the protocol, a client needs to send a `start` message with the GraphQL query to start receiving updates. Then AppSync responds with a `start_ack` if everything is OK and then sends `data` events whenever an update happens.
Reading the documentation my impression was that `start_ack` is the moment when the subscription is live and all future events will be delivered. But what I'm seeing is that **it's not the case**. Even when the event is strictly triggered after the `start_ack` is received sometimes it is not delivered to the client.
Why is it a problem?
A common pattern for APIs with real-time updates is to subscribe to updates first then query the current state. This way there is no "temporal dead zone" when updates are lost. But that requires a definitive *point in time* when the subscription is live. Without that, it's only best-effort and messages will be lost every now and then especially in cases when the subscription is made just before the event, common in tests and some async workflows.
Real-time updates, especially in AppSync, is a complex topic and it's easy to get wrong. I've [written about it before](https://advancedweb.hu/shorts/apollos-subscribetomore-is-the-wrong-abstraction/), it has a [separate section in my book](https://www.graphql-on-aws-appsync-book.com/client-side/implementing-subscriptions/), and I even [made a library](https://github.com/sashee/appsync-subscription-observable) because I wasn't particularly happy with the AWS-provided one.
I noticed tests using subscriptions timeouting for a long time now, but i wrote it off as "something complex is happening" and added some retries to handle it. A message is published to IoT Core that triggers a Lambda, that writes to DynamoDB then it triggers the subscription. A lot can go wrong so it's realistic that the 10-ish seconds sometimes pass.
But I then started working on a simpler setup and still noticed that some events seemingly never arrive. This time I could pinpoint the issue because if a parallel subscription is opened before then the event is delivered there. So the problem must be that the subscription is not live even though AppSync says it is.
Hopefully, it will get fixed soon.
Bug report opened [in the AppSync repo](https://github.com/aws/aws-appsync-community/issues/405).
#aws #appsync
Originally published [on my blog](https://advancedweb.hu/shorts/appsync-subscriptions-waiting-for-start_ack-can-still-result-in-missing-events/)