Fortnite Crushes PUBG With New Concurrent Player Record, Servers Come Crashing Down

Fortnite continues to dominate and smash through expectations as force to be reckoned with in the [...]

fortnite

Fortnite continues to dominate and smash through expectations as force to be reckoned with in the age of PUBG. Since Epic Games implemented the free-to-play Battle Royale mode, the title has continued to soar in popularity but it looks like that popularity came at a pretty heavy price as it surpasses the PlayerUnknown's 3.2 concurrent players record.

Fortnite has officially beaten out the chicken dinner connoisseur with an astonishing 3.4 million concurrent players! The downside - the servers didn't know what to do with this and promply shut down. The Fortnite crew took to their website to share the good, and the bad, news:

"Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn't come without issues! This blog post aims to share technical details about the challenges of rapidly scaling a game and its online services far beyond our wildest growth expectations.

Also, Epic Games needs YOU! If you have domain expertise to solve problems like these, and you'd like to contribute to Fortnite and other efforts, join Epic in Seattle, North Carolina, Salt Lake City, San Francisco, UK, Stockholm, Seoul, or elsewhere!"

If you fit this bill, you can apply yourself right here!

The went into incredible detail showing what exactly happened with the huge outage in an effort for player transparency while also tackling how to avoid this in the future! You can see all of the graphs and full breakdown here, but below is what the team is actively doing to fix this issue and stop it from happening again:

Our top focus right now is to ensure service availability. Our next steps are below:

  • Identify and resolve the root cause of our DB performance issues. We've flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends.
  • Optimize, reduce, and eliminate all unnecessary calls to the backend from the client or servers.Some examples are periodically verifying user entitlements when this is already happening implicitly with each game service call. Registering and unregistering individual players on a game play session when these calls can be done more efficiently in bulk, Deferring XMPP connections to avoid thrashing during login/logout scenarios. Social features recovering quickly from ELB or other connectivity issues. When 3.4 million clients are connected at the same time these inefficiencies add up quickly.
  • Optimize how we store the matchmaking session data in our DB. Even without a root cause for the current write queue issue we can improve performance by changing how we store this ephemeral data. We're prototyping in-memory database solutions that may be more suited to this use case, and looking at how we can restructure our current data in order to make it properly shardable.
  • Improve our internal operation excellence focus in our production and development process.This includes building new tools to compare API call patterns between builds, setting up focused weekly reviews of performance, expanding our monitoring and alerting systems, and continually improving our post-mortem processes.
  • Improve our alerting and monitoring of known cloud provider limits, and subnet IP utilization.
  • Reducing blast radius during incidents. A number of our core services are globally impacting to all players. While we operate game servers all over the world, expanding to additional cloud providers and supporting core services in multiple geographical locations will help reduce player impact when services fail. Expanding our footprint also increases our operational overhead and complexity. If you have experience in running large worldwide multi cloud services and/or infrastructure we would love to hear from you.
  • Rearchitecting our core messaging stack. Our stack wasn't architected to handle this scale and we need to look at larger changes in our architecture to support our growth.
  • Digging deeper into our data and DB storage. We hit new and interesting limits as our services grow and our data sets and usage patterns grow larger and larger every day. We're looking for experienced DBAs to join our team and help us solve some of the scaling bottlenecks we run into as our games grow.
  • Scaling our internal infrastructure. When our game services grow in size so do our internal monitoring, metrics, and logging along with other internal needs. As our footprint expands our needs for more advanced deployment, configuration tooling and infrastructure also increases. If you have experience scaling and improving internal systems and are interested in what is going on here at Epic, let's have a chat.
  • Performance at scale. Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience. If you have experience with large scale performance tuning and want to come make improvements that directly impact players please reach out to us.
  • MCP Re-architecture
    • Move specific functionality out of MCP to microservices
    • Event sourcing data models for user data
    • Actor based modeling of user sessions

Problems that affect service availability are our primary focus above all else right now. We want you all to know we take these outages very seriously, conducting in-depth post-mortems on each incident to identify the root cause and decide on the best plan of action. The online team has been working diligently over the past month to keep up with the demand created by the rapid week-over-week growth of our user base.

While we cannot promise there won't be future outages as our services reach new peaks, we hope to live by this great quote from Futurama, "When you do things right, people won't be sure you've done anything at all."

0comments