The Ys X: Nordics PC Optimization Journey

Ys X: Nordics

Explore uncharted islands, engage in fast-paced action combat, and experience the high seas adventures of unlikely allies Adol and Karja in Ys X: Nordics!

Even more so than for the turn-based JRPGs we frequently work on, solid performance is essential for a fast paced action RPG like Ys X. In this post, I'll briefly outline the various optimizations we made to the PC version of the game over the development period of our port. [h2]The Overall Timeline[/h2] Let's not bury the lede and instead jump right to the results: [img]{STEAM_CLAN_IMAGE}/44880176/f318a0b154e91a5580dd6d413579c9109b2fd299.png[/img] The image above shows performance at a fixed, very challenging benchmarking spot (see below for details). To give you an idea of the time scales here, 0.1 was released into beta at the end of July, 0.2 was just a few days later, and we reached version 0.8 in the middle of September. [expand type=details expanded=false]This is a spot in Balta Island which we found to be the absolute worst-case in terms of CPU-bound performance. All settings are at their maximum, except for resolution and anti-aliasing, to make sure that the game is CPU-bound. We load into a fixed save and then report the FPS after waiting for 30 seconds in order for everything to settle. The system is my development workstation, which is a Core i9 12900k with an RTX 4090 -- but since we are interested in the relative comparison between versions here, that's not very relevant.[/expand] I'll now go into some detail on the more interesting optimizations we carried out. [h2]Before 0.1: Fun with Drivers[/h2] The version 0.1 you see in the chart above is the first version we released into beta, but of course not the first one we built for PC in general. As performance (and, sadly, sometimes even correctness) can vary wildly across different hardware vendors, we have Nvidia, AMD and Intel hardware internally for testing. Luckily, our main developer on Ys X PC during its early phase was working on AMD hardware, so he immediately noticed that something was off. [img]{STEAM_CLAN_IMAGE}/44880176/997b26e14368e4bf955612a9d04b1b4c9359a78f.png[/img] During the intro scene pictured above (which looked a lot worse at that time, especially in terms of aliasing, but that's a different story), his framerate dropped [b]below 5 FPS[/b], on a rather high-end system. He immediately asked others to test on different hardware, and we found that there was no issue on Nvidia GPUs. Something like this could well turn into a multi-week deep dive into API intricacies and driver behaviour, but fortunately we already saw something similar in Trails through Daybreak -- though not nearly to the same extent. To make a long story short, back then we found that particular uses of memory in DX11 led to catastrophic performance loss on AMD, and we solved this issue by changing the buffering implementation of GPU uploads. Applying the same procedure to Ys X also solved this performance drop. As a side note: as AMD might in the future change the driver implementation, we made this into a setting. It's not in the menu, but if you are an AMD user and curious about this, you can set the [code]extra_host_write_buffering[/code] setting in the "General" category to "Always" or "Never" rather than the default "Auto", and watch your performance plummet. [h2]From 0.1 to 0.8: Low-hanging Fruit and Gradual Refinement[/h2] There's not too much to say about this part: despite the overall impact being very high, it's the result of many small tweaks and improvements that aren't particularly interesting on their own. Things like dealing with static geometry more effectively, skipping unnecessary recomputation and so on. The general idea for CPU performance optimization is that you look at profiler output and try to optimize anything which seems to take an inordinate amount of time. That's where the "low-hanging fruit" idea comes from: at first, when you start optimizing, you can work on things that have a large overall impact, and potentially achieve huge improvements very quickly. The more optimized the software becomes, the harder it is to make further headway. This is pretty visible in the overall comparison chart, especially if you consider that it was only a few days from 0.1 to 0.2 and several weeks from that to 0.8. [h2]Additional Parallelization[/h2] This is by far the most exciting step, and also the one that is really a rather questionable idea. When we started to work on the port, the game had a main update thread, a graphics thread, and several background threads for audio and asynchronous tasks. Of those, only the graphics and main update thread do any real CPU work, and only the main thread was limiting performance. Looking into that main update thread further, we discovered that a large chunk of time is spent updating all the actors in the scene individually -- that includes characters, monsters, pickups, and basically everything else you can see. as well as some things you can't see such as event triggers. So the basic idea is rather obvious: perform these updates in parallel. Of course, in reality it's not quite that simple, since each of these update steps can and will interact arbitrarily with some global system or state that was developed under the assumption that everything is sequential. The surprising thing is really that we got it to work, but a lot of development time went into debugging problems that arose due to this parallelization. Ultimately, due to the level of synchronization required, the actor updates only really scale up to 3 threads -- but the improvement is still substantial, especially since it is most pronounced in the hardest-to-run areas (with more individual actors). [img]{STEAM_CLAN_IMAGE}/44880176/8b7030dff9bfbb0a23f490db4400058332cf2aa3.png[/img] [h2]GPU Query & Input Optimization[/h2] The final step on the GPU optimization journey was more related to improving frametime stability at high framerates rather than pure performance. The image above shows a frametime chart comparison, and you can see that not only is the release version much faster, it is also a lot more stable. It is important to note here that this is [b]without a frame limiter or V-sync[/b] -- because you usually only see such a flat frametime chart with one of those engaged. What we did here is purposefully introduce a one-frame-off synchronization point between GPU and CPU progress. My initial thought was that this would improve stability but reduce framerate, and so it would have to be a setting, but in actual testing across several configurations it turns out that it improves [i]both[/i]. I'm not 100% sure why, but my theory -- after investigations with [url=https://github.com/wolfpld/tracy]Tracy[/url] -- is that it has to do with thread scheduling decisions from the OS being improved in this case. We also greatly improved performance in the presence of high-polling-rate mice, and if you really want to get into the weeds of that, [url=https://ph3at.github.io/posts/Windows-Input/]here is an entire article about just this issue[/url]. [h2]Conclusion & the Big Lie[/h2] Before we close out I have something to confess: [b]I lied to you[/b]. In the performance comparison image at the start of this post, it says that there is "identical quality across all versions". This is not really true. When implementing parallelization, we actually got rid of a system that optimized performance by only updating the animations of distant monsters and characters once every several frames. So the parallel version actually does substantially [i]more[/i] work, more quickly. So, overall, I believe that from a CPU performance perspective Ys X is the most optimized game we've ever released. It of course requires a bit more performance than something like the Crossbell games, but almost any imaginable system should be GPU limited at appropriate settings -- and we'll talk more about that aspect -- settings and features -- in a future article. Cheers, Peter "Durante" Thoman, CTO, PH3