Just how useful are PC benchmark modes really?
Optimising performance needs to be easier - and that means we need better tools.
Have you ever loaded up a new PC title, run the in-game benchmark, tweaked settings for optimal performance then discovered that actual gameplay throws up much lower frame-rates, intrusive stutter or worse? It's a particular frustration for us here at Digital Foundry, and it leads to a couple of very obvious questions: firstly, if benchmark modes are not indicative of real-life performance, what use are they? And secondly, if their use is limited, how representative of real-life gaming are the graphics card reviews that use them, including ours?
Before we go on, it's fair to point out that not every benchmark mode out there is useless beyond redemption. In fact, there are a range of great examples that do set you up reasonably well for tweaking for optimal performance. And then there are others which actually drain system resources more than the actual game, which we'd argue is of more use than those that inflate their performance figures.
However, there are some particularly striking examples we have to highlight simply because the delta between benchmark mode and real world performance is absolutely massive. Perhaps the most notorious example we can muster is Tomb Raider 2013. It tests just one scene - the initial shipwreck scene from the beginning of the game - and it shows the camera panning around the Lara Croft character model. It's a scene that's easy to replicate in-game where we find that the same hardware running the same scene at the same settings produces anything up to a 21 per cent performance deficit.
The follow-up - Rise of the Tomb Raider - is an improvement of sorts, but still has issues. Three scenes are rendered, taken from the snow peak at the beginning of the game, a beautiful run through the opening of The Prophet's Tomb and finally a fly-through of the Geothermal Valley. It's a benchmark that should be immensely useful to users because the game has a very specific problem - performance is great through all the early levels until you hit the hub areas like the Geothermal Valley, at which point performance drops significantly. However, the benchmark representation plays out with no issues at all - actual gameplay sees performance dip to anything up to 35 per cent compared to the benchmark mode for this area!
Possibly the worst example of a genuinely useless benchmark mode is the notorious Batman Arkham Knight. The content chosen for benchmarking showcases some environments and effects work, but all it's actually doing is organising some pretty cinematic shots for the user to look at with some fairly meaningless numbers coughed up at the end. The game itself is notorious for its very poor background streaming and generally bad open world performance, particularly evident during fast traversal - such as driving in the Batmobile. Stutter is present to some degree on every PC configuration we've ever tested this title with, yet there is none of it on display in the benchmark.
As things stand, in-game benchmark modes are often not reflective of actual performance realities or variability, and they tend to overly focus on the GPU as the sole limiting factor in game performance - as Rise of the Tomb Raider and Tomb Raider demonstrate. This is problematic because not every PC owner is playing on a powerful i7 and the reality is that game performance is not entirely defined by its rendering - though to be fair, it is typically the first bottleneck you'll encounter.
But that intrusive stuttering that often represents gaming performance at its worst? CPU and storage are main culprits there, with the simulation work, animation, AI and generating draw calls for the GPU often causing big dips to in-game performance. There's also the load incurred by streaming and decompressing new data, which can have profound implications - as we saw in Arkham Knight. And even CPU-related performance can bottleneck in different ways as there are systems which are heavily parallelised - like graphics, dispatch, physics or animation, while AI and gameplay logic are often more single-threaded in nature.
So, essentially, we need things to change. We need in-game benchmarks that run the gamut of the actual experience. That means a number of things, but specifically we'd like to see the inclusion of near-camera explosions, AI in the numbers you'd see in-game, along with draw distances and effects work that are again representative of the title under load. A good example of this is the Grand Theft Auto 5 benchmark. After a series of pretty time-lapse shots, it gets down to the nitty gritty of punishing your hardware, taking you through the world at breakneck speed via a jet flyby before shifting to the perspective of a player character in a world filled with AI and a decent amount of effects work.
Earlier, I mentioned benchmarks that push things to the extreme, over-emphasising potential performance issues, and we have a great example of that in the form of Metro Last Light, where the in-game tool is an absolute worst case scenario, with lots of AI and animation and close-to-view alpha effects. Some might say it's a sobering and harsh impression of actual in-game performance, but it's more useful for tuning than a benchmark that massively over-inflates frame-rates, or doesn't stress test the CPU side of the engine at all.
There's also a strong case for including an after action report showing how performance held up and where the issues were. The standard metrics of lowest, average and max frame-rates don't really help that much and certainly don't inform of you how the 'feel' of a game can be impacted by elements like v-sync-hidden CPU drops, frame-pacing issues or periodic stutter - all of which are smoothed over by average frame-rates. The perfect implementation would be frame-time graphs covering the demo to see what and how those drops occur. Gears of War does this beautifully, making it one of the very best - and actually useful - benchmark tools we've seen. In short, we'd like to see the kind of data offered up by our new benchmark widget, but with a wider scope in test situations.
Far Cry 5 GTX 1060 vs RX 580: Ultra Settings
So, for example, we'd like to see multiple benchmark types like, say, GPU, CPU and game-based test runs. The PC hardware sites get specific tests for specific components, a CPU benchmark shows how the game scales across different types of processor in processor-heavy tasks while a game benchmark can combine the two to give the consumer a better understanding of what the actual gameplay experience looks like on their hardware.
And finally, there's another basic improvement that could make all the difference - improved, more communicative settings screens. Users are often dazzled with a vast array of different tweakables with no idea of what they actually do. Ubisoft's recent Far Cry 5 is a great example of how easy this is to get right - adjust each setting brings up an image on-screen showing how what difference the change actually makes. And better yet, allow the game to keep running as you make your tweaks, so you can see the changes that adjustments make in real-time: ARMA 3 and Final Fantasy 15 are two pretty good examples of this.
But returning to the actual usefulness of benchmarks as things stand, the question has to be raised about the impact of less useful examples on the accuracy of hardware reviews. Well, for graphics cards at least, even if the test runs aren't representative of the gameplay experience, they do accurately portray relative performance between various GPUs running the same graphics workloads. And by running a good-sized sample of them, conclusions can be drawn about the relative power of the graphics hardware being tested.
However, what is missing is a good workout of the GPU driver layer, and how a particular graphics card may run on a less capable CPU. This can be crucial to actual gameplay - a case in point is Call of Duty: Infinite Warfare. Couple a Pentium G4560 with mainstream GPUs like the GTX 1060 and the RX 580 and the Nvidia hardware pulls ahead, while AMD stutters hard and drops to the mid-40s. Swap out that Pentium for an i5 or better and the 580 is the clear winner on this title - by a process of elimination it's AMD's DX11 driver overhead that's the issue here, something that isn't measured in a world of GPU reviews powered by ultra-fast i7s.
It's food for thought when it comes to how useful GPU reviews are, especially for those holding onto older CPUs like the perennial Core i5 2500K - but the fact is that COD: IW doesn't have any kind of benchmark mode, let alone a CPU-based variant - we uncovered this through gameplay testing alone. Adding a '2500K' set of benches (or some other arbitrary lower-power processor) to a GPU review sounds like a great idea to show how a new graphics card runs within a new PC - but the fact is that we're going to need decent benchmarking modes from game-makers to measure this accurately.
And this returns us to the basic reality that in the here and now, only a select minority of PC games are actually equipped with the benchmarking tools required needed to accurately test performance across all aspects of the system, and so users tend to tweak in-game instead - which defeats the purpose of having them in the first place. In-game benchmarks need to evolve, and if they do, the user gets a better experience and more fully understands how their system works, while PC hardware reviews can better inform users about the best kit to buy.