AppKit Email Login Currently Unavailable
Incident Report for Reown
Postmortem

TL;DR

On September 16, 2024 from 5am Singapore to 137pm Singapore AppKit Embedded Wallet Email login & cloud.walletconnect.com email delivery were broken due to an outage of Postmark, an email delivery service.

We don’t know the exact numbers of customers affected but assume at least dozens.

The issue was reported:

  • AppKit Email: reported by an internal AppKit user at ~9am Singapore
  • Cloud: reported via Twitter at 11:09am Singapore

Summary

The issue started at 5am Singapore. An internal user reported at 9:46am Singapore. An operator started investigating at 11:07am Singapore and reproduced the issue.

The operator suspected Magic, the key management service/authentication layer backing the AppKit Wallet, would be at fault. Operator paged Magic in Slack providing evidence that it doesn’t look like Postmark.

At 11:32am Magic provided evidence that it appears that the issue is constrained to Postmark.

Operator made an account with Sendgrid, an alternative mailing provider, but got blocked by their fraud detection for unknown reasons and was unable to proceed.

At 1:38pm operator noticed that they could disable the custom SMTP provider and rely on Magic’s email provider which fails over to Sendgrid.

Around the same time another operator switched Cloud over to Supabase mailing instead of Postmark.

The other operator created a Sendgrid account as well and switched Cloud to Sendgrid as Cloud was getting rate limited by Supabase.

At ~430pm the second Sendgrid account also got blocked.

At 640pm Singapore the Magic configuration was switched back to Postmark such that the sender of emails would appear as @walletconnect.com again.

Root Cause

The root cause was Postmark’s SSL certificate expiring at 5am Singapore.

5 Whys

  1. Why did the AppKit Email / Cloud Signup not work?

Because emails were not delivered.

  1. Why were the emails not being delivered?

Because Postmark, the outgoing email service we use for both platforms, had an outage.

  1. Why was the outage not discovered faster

We don’t execute email login on either Cloud or AppKit as a Canary flows. The Canary flows we have don’t exercise sign up (Cloud) or email login (AppKit).

  1. Why did the remediation take ~2h after the initial report?

The operator was not aware that disabling the custom SMTP provider setting was an option.

  1. Why was the operator not aware of this option?

The operator should have asked Magic - who were helping to remediate - if they have ideas of how to resolve this quicker.

What could we have done better?

  1. Discovery: we could have automatically detect both Cloud Login/AppKit Email being down through the use of Canaries
  2. Remediation: we could have failed over quicker to non-custom-SMTP quicker
  3. Previous outage follow up: we could have already had a Sendgrid account after the end-of-July outage of Postmark where they didn’t win trust.

How can we prevent this from happening again?

Have a Sendgrid account ready for redundancy or even investigate automatic failover.

Action items

  1. Short-term: set up Sendgrid account for backup @Derek Rein
  2. Mid-term: contemplate covering email flows in Canaries

    1. Cloud: @Cali Armut
    2. AppKit: @Tomas Rocchi

Links

https://status.postmarkapp.com/notices/5jmmv4cyfqboak2v-service-issue-outbound-smtp-sending-issues

Posted Sep 16, 2024 - 10:50 UTC

Resolved
This incident has been resolved.
Posted Sep 16, 2024 - 10:50 UTC
Monitoring
We've temporarily switched SMTP providers for Postmark. We are monitoring the situation with Postmark to switch back. But all systems are operational again
Posted Sep 16, 2024 - 06:59 UTC
Update
Cloud App is also affected by Postmark outage. We are unable to send signup/password reset emails.
Posted Sep 16, 2024 - 05:32 UTC
Identified
The Email Login functionality of AppKit is currently down due to a downstream service being down.

All other WalletConnect systems including the Relay are not affected

AppKit Social login still works

We will update here
Posted Sep 16, 2024 - 03:38 UTC
This incident affected: Cloud App and AppKit.