Deployments failing / Asset availability on the Swiss region
Incident Report for Divio Status Page
Postmortem

On the first of July 2024, a Monday, several clients hosting their application on the Exoscale Switzerland regions contacted support to report they had issues connecting to their S3 storage. We started investigating right away and discovered their access keys were invalid (InvalidAccessKey exceptions).

We promptly checked all the S3 storages hosted on Exoscale, confirmed that the api keys of 50% of them have been deleted, and reached out to Exoscale. They pointed out the deleted API keys were all using a legacy mode and were scheduled for deletion after a deprecation period. Exoscale warned about the upcoming deletions via email, but due to a misconfiguration in the email notifications settings on our Exoscale organizations we never received it.

Now that we identified the root cause, the second step was to provision new, non-legacy API keys for the problematic buckets. Exoscale deprecated at the same time part of their APIs used by our orchestrator, so we spent the day migrating our codebase to use the new APIs endpoints. On Tuesday, we were able to recreate all the api keys and update the application configurations. By redeploying the application, the S3 storage connection was restored.

On Wednesday, after checking the daily nightly backups, we noticed some of the backups on the fixed storages were failing - some objects were not accessible to the new API keys (head_object: 403 Forbidden). This was only the case on storages that were first provisioned years ago.

This specific issue was related to how the storages were managed back then, when Exoscale features were less advanced. For those storages, the legacy API keys belonged to a sub-organization created specifically for each storage (usr-). As the new API keys belonged to the parent organization, the objects missing proper ACLs (granting full control to the parent organization) could neither be read nor written.

We first tried to set bucket ownership controls on the problematic buckets to force all objects to belong to the parent organization, but this action disables the ACLs set at the object level, effectively making all objects private. Most applications only store publicly readable assets in their storage. We thus reverted the ownership controls and looked for a better solution.

The solution was sketched out and thoroughly tested on Thursday, and applied to all storages on Friday at noon, effectively ending the outage. It consisted of the following steps, to be carried on each storage:

  1. Find all the files missing the ACLs granting full control to the organization the new API key belong to,
  2. Create an API key on the sub-organisation,
  3. Use the API key created in 2 to patch the ACLs of the objects listed in 1.

We took all the possible measures to ensure this won’t happen again and ensured the notifications were now properly configured. On the positive side, the solution to the outage allowed us to strengthen the security of the storages and improve the overall management of permissions on Exoscale buckets.

We are sorry for the inconvenience and would like to thank all the Exoscale customers for their patience and their help in the handling of this incident.

Posted Jul 25, 2024 - 08:48 UTC

Resolved
This incident has been resolved.
Posted Jul 09, 2024 - 09:08 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 08, 2024 - 07:57 UTC
Investigating
We are currently investigating issues with our infrastructure provider Exoscale which relates to general management and bucket management.
Posted Jul 04, 2024 - 15:20 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 03, 2024 - 12:28 UTC
Update
Some users might experience a problem to access objects in a bucket in this region. We continue to work on a full resolution of the issue.
Posted Jul 02, 2024 - 14:19 UTC
Update
A fix has been implemented and deployments and backups are working again.

We still experience an issue during creation of new buckets in the region and continue to investigate.
Posted Jul 01, 2024 - 15:03 UTC
Identified
We are currently investigating an issue making deployments on the Swiss region impossible.
We have found the root cause and are actively working on it. Existing applications are not impacted.
Posted Jul 01, 2024 - 14:07 UTC
This incident affected: Customer Sites (CH) (Deployment).