International⭐ Featured

A one-line Kubernetes fix that saved 600 hours a year

When we investigated why our Atlantis instance took 30 minutes to restart, we discovered a bottleneck in how Kubernetes handles volume permissions. By adjusting the fsGroupChangePolicy, we reduced restart times to 30 seconds.

7 April 2026 at 08:57 am

1 views

A one-line Kubernetes fix that saved 600 hours a year

Every time we restarted Atlantis, the tool we use to plan and apply Terraform changes, we’d be stuck for 30 minutes waiting for it to come back up. No plans, no applies, no infrastructure changes for any repository managed by Atlantis. With roughly 100 restarts a month for credential rotations and onboarding, that added up to over 50 hours of blocked engineering time every month, and paged the on-call engineer every time. This was ultimately caused by a safe default in Kubernetes that had silently become a bottleneck as the persistent volume used by Atlantis grew to millions of files. Here’s how we tracked it down and fixed it with a one-line change.

Mysteriously slow restarts

We manage dozens of Terraform projects with GitLab merge requests (MRs) using Atlantis, which handles planning and applying. It enforces locking to ensure that only one MR can modify a project at a time. It runs on Kubernetes as a singleton StatefulSet and relies on a Kubernetes PersistentVolume (PV) to keep track of repository state on disk. Whenever a Terraform project needs to be onboarded or offboarded, or credentials used by Terraform are updated, we have to restart Atlantis to pick up those changes—a process that can take 30 minutes.

The slow restart was apparent when we recently ran out of inodes on the persistent storage used by Atlantis, forcing us to restart it to resize the volume. Inodes are consumed by each file and directory entry on disk, and the number available to a filesystem is determined by parameters passed when creating it. The Ceph persistent storage implementation provided by our Kubernetes platform does not expose a way to pass flags to mkfs, so we’re at the mercy of default values: growing the filesystem is the only way to grow available inodes, and restarting a PV requires a pod restart.

We talked about extending the alert window, but that would just mask the problem and delay our response to actual issues. Instead, we decided to investigate exactly why it was taking so long to restart Atlantis.

Tracking down the bottleneck

We started by profiling the restart process. We noticed that the majority of the time was spent in a systemd service called "kubelet". This led us to investigate Kubernetes' handling of persistent volumes and their associated filesystems.

Upon further research, we discovered that Kubernetes, by default, enforces a safety measure when volumes are attached to pods. Specifically, it changes the file permissions of all files in the volume to ensure that the pod running in the container has the correct permissions. However, this process can be time-consuming if the volume contains a large number of files, as was the case with Atlantis' persistent volume.

To verify this, we checked the Kubernetes documentation and found that the fsGroupChangePolicy configuration option controls how Kubernetes handles file permissions when a volume is mounted. By default, this policy is set to "always", meaning that Kubernetes will change the file permissions of all files in the volume to match the pod's user and group IDs.

The one-line fix

We hypothesized that changing the fsGroupChangePolicy to "never" would prevent Kubernetes from altering the file permissions during the volume mount, thus eliminating the bottleneck. To test this, we modified the YAML configuration for the Atlantis StatefulSet, adding a single line:

```

fsGroupChangePolicy: never

```

This change instructs Kubernetes not to modify the file permissions of files in the volume. We then restarted Atlantis and observed the results.

The restart time dropped dramatically, from 30 minutes to just 30 seconds. This not only resolved the issue of long restart times but also eliminated the need to manually resize the filesystem when running out of inodes.

Implications and considerations

While the fsGroupChangePolicy change resolved our specific issue, it's important to note that this configuration has implications for security and data consistency. By disabling the file permission changes, we are trusting that the files in the volume already have the correct permissions for the pod. This means that any data in the volume must be properly configured beforehand, and any changes to the volume's contents (such as adding new files) must also adhere to the pod's expected permissions.

In our case, since Atlantis' persistent volume is initialized with the correct permissions and only contains data generated by the pod itself, disabling the fsGroupChangePolicy was a safe choice. However, in other scenarios, this configuration might not be appropriate, and additional security measures may be necessary.

Conclusion

The experience of tracking down and resolving the Atlantis restart bottleneck was a valuable lesson in the importance of understanding the underlying systems and configurations that support our infrastructure. By leveraging the flexibility of Kubernetes and a simple configuration change, we were able to save hundreds of hours of engineering time each year. This one-line fix serves as a reminder of how critical it is to continuously monitor and optimize our infrastructure to ensure it remains efficient and reliable.

Source: The Cloudflare Blog