Skip to main content

Clerk Changelog

Infra Changelog - Oct 9, 2025

Category
Infra
Published

Database migration, high-update to append-only, direct job queuing, and more

This post is part of a regular series where Clerk shares details about ongoing infrastructure improvements. These updates happen continually behind the scenes, and will not impact end-users beyond improving reliability and performance.

For previous updates, see Infra Changelog – Sep 25, 2025.

Changes since last update

In-network database migration to minimize latency

A migration of our Cloud SQL database within GCP’s network was completed to minimize latency between Cloud Run and Cloud SQL. It was performed by GCP to reverse a prior migration that had increased network latency.

Although the first migration was unnoticed except by our latency monitoring, the reversal was coincident with a “connectivity issue” that lasted 11 minutes. Thankfully, our session failover mechanism succeeded and prevented active users from being signed out. However, new sign ups, sign ins, and other account changes failed during this time.

We are awaiting a full root cause analysis from GCP to determine if the two events were related. We will share the analysis when it becomes available.

Once connectivity was restored, we verified that latency improved as expected.

Convert high-update workload to append-only

We had a particular high-update table that created undue stress on the database in periods of high activity.

We’ve replaced that high-update workload with an append-only workload. The change was carried out over the course of a week, where we incrementally enabled dual writes until we reached 100% of traffic, then leveraged dual reads over several days to ensure consistent behavior. Finally, we removed dual writes and now only leverage the append-only workload.

Direct queuing of background jobs

We leverage PubSub for background jobs. Many jobs require transactional guarantees, which we achieve by writing the job details to the database within a transaction. Others don’t, like non-essential jobs for analytics and telemetry.

Previously, all jobs were written to the database, regardless of if they required a transactional guarantee. Going forward, jobs that don’t require transactional guarantees will be queued directly to reduce the database load.

Database tuning

We’ve continued efforts to tune autovacuum and fillfactor settings. Though this is an ongoing effort that is never truly “done,” we feel we’ve made the major adjustments necessary to take advantage of our recently upgraded database specs.

Query and index tuning

This is a regular activity that will likely be in every update, but it is still included in the interest of completeness. This time, the biggest improvement came from replacing a DISTINCT with an alternate strategy.

In progress

Blue/green database upgrade automation

We’re building an automated mechanism for blue/green database upgrades, which we’ll leverage for version upgrades and settings changes that typically require minutes of downtime.

With automation, we expect these upgrades to complete in a few seconds maximum, and are exploring solutions to minimize request failures during that window.

Application-to-database “chattiness” optimizations

We’ve started focusing more on and reducing the overall number of queries per request, by either changing our application logic or leveraging database functions to perform multiple queries at once.

These changes will make us more resilient to future database VM migrations within the network, which are typically done administratively by GCP and out of our control. Our intention is to support up to 2ms of network latency between our compute and the database without a noticeable impact on user experience.

Reduce session API requests

(Promoted to in progress)

We’ve discovered a bug in our session refresh logic that causes individual devices to send more refreshes than necessary. We believe resolving this bug can significantly reduce our request volume.

Improved monitoring

We’re re-analyzing all of our recent degradation and outage incidents and improving our monitoring suite. So far, we’re optimistic that we can achieve earlier notice of potential issues by tuning our existing monitors and adding new ones.

Planned

Database connection pooler and major version upgrade

(Unchanged since last update)

Clerk has historically only used a client-side database pooler inside our Cloud Run containers. Though we’ve known this is non-optimal, we did this because Google did not offer managed database pooler, and we were skeptical of putting a self-hosted service in the middle of two managed services (Cloud Run and Cloud SQL).

In March, Google released a managed connection pooler in Preview, and it reached General Availability this week. However, using the connection pooler will require a major database version upgrade, and in our particular case, a network infrastructure upgrade. We are collaborating with Google to determine how we can achieve both upgrades safely and without downtime.

Simultaneously, we are investigating other database and connection pooler solutions that can run within our VPC.

We plan to leverage our blue/green automation for these upgrades.

Further session API isolation

(Unchanged since last update)

Currently, the session API must failover to a replica during primary database downtime, which is not ideal since the primary database is still impacted by other workloads. We are pursuing solutions that would lead to a session-specific database.

Additional service isolation

(Unchanged since last update)

While working to isolate our Session API, we’ve already developed a handful of techniques that can be re-used to isolate other services. When done, we’d like isolated workloads for each of our product areas:

  • Session
  • Sign up
  • Sign in
  • User account management (profile fields)
  • Organizations
  • Billing
  • Fraud

Additional staged rollout mechanisms

Today, our staged rollout mechanisms usually target all traffic in increasing percentages. We intend to build more targeted rollout mechanisms, for example by customer cohort or products used.

Contributor
Colin Sidoti

Share this article

Clerk Leap Integration

Category
Integrations
Published

Introducing Clerk's integration with the AI developer agent, Leap.

Clerk is now available as an authentication provider for Leap, an AI developer agent.

To get started, builders can prompt Leap to create a new application with Clerk for authentication. Behind the scenes, Leap will provision a Clerk application on your behalf so you can start building immediately. When you're ready to configure your application, you can claim the generated application via the Clerk dashboard. To learn more about Leap's Clerk integration, visit the Leap documentation.

Clerk + AI builders: better together

Clerk is an excellent companion to generative AI tools, like Leap. With drop-in authentication and billing primitives, you can build and ship real-world applications faster, not just prototypes. Focus on iterating and building out your application with battle-tested primitives from Clerk.

If you're building AI tools and interested in integrating with Clerk, with us to learn more.

Contributors
Bryce Kalow
Kevin Wang
Nate Watkin

Share this article

Organization slugs disabled by default

Category
Organizations
Published

Organization slugs are now disabled by default for new applications.

Starting today for new applications, when you enable our Organizations featureset, organization slugs will be disabled by default.

Previously, you needed to pass a hideSlug prop to organization components to hide the slug field, requiring manual configuration. Now, when disabled, the slug field won't be displayed in organization components by default.

Opt-in to Organization Slugs

If you'd still like slugs to exist alongside Organizations in your application, toggle "Enable organization slugs" in Organization settings.

Contributor
Laura Beatris

Share this article

Infra Changelog - Sep 25, 2025

Category
Infra
Published

Isolated compute for session API, database tuning, and more

Starting with this post, we will regularly share details about ongoing infrastructure improvements. These updates happen continually behind the scenes, and will not impact end-users beyond improving reliability and performance.

Released this week

Isolated compute for our Session API

Following the incident on June 26, we began isolating our session API infrastructure from the rest of our frontend-facing API, which includes workloads like sign ups, sign ins, user profiles.

To keep our session API running during an incident, we require access to session-relevant storage and compute during that downtime.

We started with storage, which was the more challenging of the two workstreams. Thankfully, this was released ahead of the incident last week, and helped reduce the blast radius.

When queries against our primary database failed, our new “outage resiliency” mechanism kicked in and started working off a read replica to serve session tokens instead. Here’s a chart of the tokens served:

Outage resilience spike

Unfortunately, this chart does not reflect the full volume of session tokens that were requested during the incident, since many failed because our compute was exhausted. Here’s what happened:

  • Clerk uses Google Cloud Run with auto-scaling for compute. It serves all requests for sessions, sign ups, sign ins, organizations, and billing.
  • When the database failed, session requests continued to succeed because of our failover mechanism, but non-session requests got stuck behind query timeouts, and overall request latency increased.
  • Increased request latency triggered auto-scaling of Cloud Run until the configured maximum containers was hit. During this autoscaling, the session requests continued to succeed.
  • Once the maximum was hit, throughput for our frontend API Cloud Run service dropped sharply, and session requests were no longer reliably being served.

The solution to this is to serve session requests from a separate Cloud Run service than the rest of our frontend API. That way, session requests can retain high throughput against the read replica during a primary database incident, while the rest of the of the frontend API can wait for the primary database to recover.

We’ve built this so our session API compute is always on and handling all session requests, so we do not need to wait for a failover of Cloud Run. Here are charts showing our request volume move to the new Cloud Run service for our session API after release:

Frontend API requests - now with session requests removed:

Frontend API decrease

Session API requests running independently:

Session API start

Database tuning

Following the auto-upgrade last week, it was expected that we’d need to retune our database in response to its improved overall performance. Below is the improvement to average query latency after a process of reindexing and adjusting our auto-vacuum settings:

September 18: First full stable day

  • P50: 40μs
  • P95: 115μs
  • P99: 703μs

September 24: Yesterday

  • P50: 40μs (no change)
  • P95: 96μs (16% improvement)
  • P99: 584μs (17% improvement)

In progress

Reducing latency between our compute and our database

Working with GCP support, we learned that there is an opportunity to reduce the network latency between Cloud Run and Cloud SQL, to improve our overall request latency. We expect this to be completed within the next week.

Continued database tuning

We have more database tuning ahead. We expect modest additional improvements as we continue to monitor auto-vacuum settings, and begin adjusting fillfactor settings.

Planned

Where possible, convert high-write workloads from update to append-only

Our original architecture depended on frequent updates, which has become burdensome on our database as we’ve scaled.

Where possible, we plan to reduce our use of this pattern, and instead rely on append-only tables. In the process, we may opt to move these workloads to a time-series database like ClickHouse.

Reduce session API requests

We’ve discovered a bug in our session refresh logic that causes individual devices to send more refreshes than necessary. We believe resolving this bug can significantly reduce our request volume.

Further session API isolation

Currently, the session API must failover to a replica during primary database downtime, which is not ideal since the primary database is still impacted by other workloads. We are pursuing solutions that would lead to a session-specific database.

Additional service isolation

While working to isolate our Session API, we’ve already developed a handful of techniques that can be re-used to isolate other services. When done, we’d like isolated workloads for each of our product areas:

  • Session
  • Sign up
  • Sign in
  • User account management (profile fields)
  • Organizations
  • Billing
  • Fraud

Database restart resilience

Over the last few years, one of the benefits of Cloud SQL has been that it can achieve most database upgrades with only a few seconds of downtime. Clerk has application logic to ensure that requests in these few seconds are retried and unnoticeable to users.

But now, Clerk is rapidly approaching the point where we need to execute operations that require longer primary database downtime. We require additional application logic to handle writes during this downtime without impacting users.

Database connection pooler and major version upgrade

Clerk has historically only used a client-side database pooler inside our Cloud Run containers. Though we’ve known this is non-optimal, we did this because Google did not offer a managed database connection pooler, and we were skeptical of putting a self-hosted service in between two managed services (Cloud Run and Cloud SQL).

In March, Google released a managed connection pooler in Preview, and it reached General Availability this week. However, using the connection pooler will require a major database version upgrade, and in our particular case, a network infrastructure upgrade. We are collaborating with Google to determine how we can achieve both upgrades safely and without downtime.

Simultaneously, we are investigating other database and connection pooler solutions that can run within our VPC.

Contributor
Colin Sidoti

Share this article

SAML ForceAuthn

Category
SAML
Published

Clerk now supports configuring the ForceAuthn parameter on SAML authentication requests.

For users with SAML integrations, the Clerk dashboard now supports configuring the ForceAuthn on a per-connection basis.

This is especially important on shared or multi-user devices where a previous user may still have an active SSO session at the Identity Provider (IdP). When ForceAuthn is enabled, Clerk includes the ForceAuthn=true parameter on the SAML AuthnRequest so the IdP will ignore any existing SSO session and require the user to re‑authenticate (password, MFA, etc.). This prevents the next person on the same machine from silently inheriting access due to someone else’s logged-in IdP session.

Expectations

Existing SAML connections are unchanged—ForceAuthn remains off by default to preserve current sign‑in behavior. If you enable it, users will be prompted to re‑authenticate at the IdP on every SSO sign‑in for that connection.

How to enable

In the Clerk Dashboard, navigate to the SSO Connections page

  1. Select your SAML connection
  2. Select the Advanced tab
  3. Enable Force authentication
  4. Save
Contributor
Kevin Wang

Share this article

Last-used sign-in method badge

Category
Product
Published

Users can now easily identify their last-used sign-in method with a visual badge indicator.

The sign-in experience now includes a helpful badge that displays on the last-used sign-in method, making it easier for users to quickly identify and select their previously used authentication option.

The badge appears automatically based on the user's sign-in history and requires no additional configuration from developers on new applications.

Existing applications can opt-in to this feature for their instances via the Clerk Dashboard.

Contributors
Tom Milewski
George Vanjek

Share this article