Do you remember when the major technological innovations in games debuted on PCs? The rise of cross-platform development and the advent of PC technology in the current generation of consoles has been the scene of profound change. Today more than ever, PlayStation and Xbox technologies set the foundation for a visual experience. PC upgrade vectors are somewhat limited – often in terms of resolution and processing speed. However, the arrival of PC real-time ray tracing technology is changing the game, and Metro Exodus of 4A Games offers one of the most exciting and innovative games we've seen in a long time. . This is a title that is excellent on consoles, but presents a truly revolutionary visual experience on the latest PC hardware.
The game is fascinating on many levels. First of all, as we approach the end of this generation of consoles, it is actually the first title built from scratch for the latest generation hardware of 4A Games – real pioneers of graphic technology. He also sees the 4A transition from a traditional linear style course to a more open play style, even though the narrative element is much more defined, and the missions can be approached in a much more similar way to that of Crysis . Think of it more as a kind of "wide" level design, as opposed to a Ubisoft-style sandbox. Whatever the case may be, this transition requires an in-depth rethinking of how the Metro world is rendered and informed, while maintaining the extreme details seen in Metro's previous titles. And remember, all this must work not only on the latest and best-performing PCs and improved consoles, but also on the basic Xbox and PlayStation hardware.
And then, there's the next-generation, more forward-looking features in the game. Real-time ray tracing is now possible on PCs equipped with Nvidia RTX graphics cards. While what we saw at Gamescom was very impressive, we looked at the very first implementation of the 4A Games tracing, with frame rates of 1080p 60 frames per second on the high end RTX 2080 Ti. And that raises an obvious question: how will small cards work? The answer comes down to 4A, which revises its RT implementation, reorganizes its technology to achieve results equivalent to its superb global tracer lighting solution, while allowing all RTX family GPUs to perform well. .
All this to say that while we were waiting for the Metro Exodus exam code, Digital Foundry had a lot questions about the directions taken by 4A with its latest project, about how its engine has been improved and upgraded since the last time we saw it in Metro Redux titles and of course on the way it provided and optimized one of the most beautiful tracing implementations we've seen. Ben Archard, rendering programmer for 4A, and the technical director of the developer, Oles Shishkovstov, answer our questions in depth.
What are the most important changes in 4A engine functionality between the Metro Redux and Metro Exodus versions? Just looking at Metro Exodus, it seems that many modern elements of this generation appear in a very refined form, as well as the effects initiated by the 4A engine: physics-based materials, overall volumetry, directional blur of objects on consoles, extensive use of parallax / tessellation mapping, lots of GPU particles, etc.
Ben Archard: A host of new features and a conceptual shift in the way we approach them. Stochastic algorithms and denoising are now an important axis of rendering. We will start with stochastic algorithms because they are used in many features and it is a generic term for some techniques.
Suppose you are trying to model and analyze a large and complex system that has a very large number of individual elements (too much information to reasonably follow). You can either literally count each data point and draw your statistical conclusions in a brutal way, or you can randomly select some information representative of the set. Think about doing a random survey of people on the street or a random medical test on a few thousand patients. You use a much smaller set of values, and although it does not provide you with the exact data obtained by checking everyone in these situations, you still get a very close approximation when you analyze your results. The trick, in these examples, is to make sure you choose well-distributed samples so that everyone is truly representative of a wide range of people. You basically get the same result but with much less effort spent on data collection. That's the Monte Carlo method in a nutshell.
Linked to this, the other main part of the stochastic analysis is some randomization. Of course, we do not do anything really randomly and we would not want it either. A better way of saying this is the generation of sample noise or jitter. The reason noise is important is that it breaks the usual patterns of sampling, which your eyes are very good at detecting on the images. In the worst case, if you sample something that changes with a frequency similar to the one you're sampling (which is low because of Monte Carlo), then you can choose results that are too homogeneous, and you can miss details between the two. . You can only select light spots on a surface, for example, or only metal parts in a grid. Thus, the noise breaks the folding artifacts.
The problem is that when you try to reduce your number of samples, sometimes to one or less per pixel, you can really see the noise. That's why we have a denoising TAA. Each image will look very noisy, but when you accumulate information on a few images and attenuate the noise as you go, you can create the cover you need. I'll mention your recent RE2 demo analysis video when you capture an image immediately after a kinematic, in which there is only one image of noisy data. You will also see it in a lot of games where you come out of a corner and suddenly, a lot of news about the scene is revealed, and you have to start building from scratch. What I'm trying to say here is why we (and all the others) have generally chosen to do things this way and what is the trade-off. You end up with a noisier picture that you have to filter with a lot of work, but the benefits are an image with less aliasing and the ability to compute more complex algorithms less often.
So it's kind of the story of a lot of these modern features. Their calculation is very complicated and involves a lot of data. So we try to reduce the number of times we compute them, and then filter them later. Of course, the infographic is full of examples of situations in which you want to estimate very closely a quantity of data, but with as few calculations as possible. The ray tracing is an obvious example because there are many more photons of light than the actual number of rays that we project.
We also use it for hair where there are more fine wicks than you would like to use geometry, all of which are too small for individual pixels. It is used in many image sampling techniques, such as shadow filtering, to generate penumbra on multiple images. In addition, in the reflections in the screen space, it is actually a kind of 2D ray tracing. We use depth jitter in volumetric lighting: with our atmospheric simulation, we integrate regular depth values to generate a volume texture. Each voxel, as you deepen the texture, develops on the previous ones, so you get an effective fog density for a given distance. But of course, only a volume texture of a depth of 64 voxels covering a great distance is fairly faithful, so you can end up with the appearance of depth plans. Adding some deep jitter helps to solve this problem.
The usual ambient occlusion in the screen is another technique that involves collecting a large number of samples in the surrounding depth buffer to estimate the amount of light blocked by a given pixel. The number of pixels to sample for good data increases with the square of the distance at which you want the pixel to be affected. So it is very important to reduce the number of samples here, and again, noisy AOs can be filtered from one image to the other. Incidentally, this is one of the reasons (and not the only) reasons why AO will have to follow the ray tracing method in the future. The very large range at which objects can directly affect the occlusion is so large with the RT that it eventually becomes impractical to accurately sample enough pixels in that radius. And this before entering the amount of information lost during deep pixel rasterization or when they are off-screen.
So, actually, one of the main goals of the rendering engine has been changed to become more selective when we perform very large complex computations, and then we spend a lot of image time filtering, denoising, and aliasing of the final image. And this has the advantage of allowing these calculations (which we do less frequently) to be much more sophisticated.
This is a link to an old article (1986) by Robert Cook. It's in a fairly simple English and it's a very good read. This shows where much of this thinking comes from. It was a cutting-edge research for offline rendering 30 years ago. As you read it, you will be struck by the extent to which it exactly matches what we are currently working on in real time. Much of it remains very relevant and, as the author said at the time, the field of denoising was an active area of research. That's always the case and it's there that most of the work on RTX has been. Cook was working on the assumption of 16 frames per second (rays per pixel), which we still can not afford, but we hope that if the technology would get its own Moore's Law. That said, I doubt that they have 4K TVs to support. Even in this case, it is the improvements in denoising that allow us to do it with less than 1 PPP.
Another big improvement is that we really improved the lighting model. Both as regards the actual calculation of light from each light source as the way we store and integrate these samples into the image. We have upgraded a fully customized GGX solution for each light source, most of them being attenuated by stochastically filtered shadows, for larger and nicer shadows than previous games. We also use a light classification system, which stores the lights in a voxel grid aligned on the screen (dimensions 24x16x24). In each grid, we store a reference to the lights that will affect anything in this grid. Then, when we process the image in the processing tool, we can take the position of the viewing space of each output pixel, determine the group in which it is located and apply only the lights that affect this area of the screen.
Now, we've always had a deferred pipeline for opaque objects, which creates a g-buffer on which the lights are accumulated afterwards. But we also had a front section for the combined effects that did not have access to all the lighting data. Having all the lights stored in this way now allows us to have the front rendering engine fully compatible with all the lights, so that particles, hair, water, and so on. These groups also include all the information for each type of light, including shaded / unshaded, spot, omnidirectional shadows and new light probes. We only create dynamic branches in the shader based on the bright flags stored in the cluster buffer.
We also have a high-precision rendering option (FP16) for advanced objects. And another option allowing rendering effects toward the front changes the buffer of the screen speeds for more accurate motion blur on mixed alpha objects. In addition, our forward pass is now performed at half resolution but 4x MSAA (if supported). This gives you the same number of samples, which allows you to lose less information as you move up a gear, but rasterization and interpolation are shared among the four samples of each pixel .
The latest versions of Metro on console aimed, and keep impressively, a very stable 60fps. Metro Exodus is targeting this time 30fps on consoles. Beyond the localized GPU rendering functions, where are the additional processor cycles from this 30-fps target spent on the console?
Ben Archard: Open world maps are completely different from the closed tunnel maps of other games. Environments are larger and contain many more objects, visible at a much greater distance. It is therefore much more difficult to eliminate objects from updating and rendering. Far more distant objects still need to be updated and animated. In tunnels, you can mainly select an object in the next room so that only its artificial intelligence is active, then start updating the animations and effects as soon as it is visible, but the open world makes the task much more delicate.
The lights in the distance need to pass a shadow. High quality scenes with dynamic weather systems result in a greater abundance of particle effects. Procedural foliage should be generated on the fly as you move. The terrain must be LODded dynamically. Even where distant objects can be reduced to impostors, there are so many more distant objects to be feared.
So, we spend a good deal of this extra time updating more AIs, particles and physical objects, but we also spend a lot of time feeding the GPU with the additional elements that He will return. We parallel it where we can. The engine is built around a multithreaded task system. Entities such as AIs or vehicles refresh in their own tasks. For example, each shaded light makes its own collection cut into pieces for the objects it has to render in a separate task. This gathering is very similar to the process of collecting the main camera, repeated several times throughout the scene for each light. All this must be completed before the respective deferred passes and the dummy card can begin (at the beginning of the frame).
So, I guess a lot of extra work is needed to properly update the elements present in an open world that you can not hide just behind a corner in the dark. And there are many things in that there are just more things that can be in view.
With the release of DXR GI on PC, we need to recall our discussions a few years ago on global lighting in real time (the approximate voxilisation of the game scene was then mentioned as a possible solution for GI). What kind of IG Metro Exodus does it currently use on consoles? Does DXR GI have an influence on where the 4A engine could go for next-generation consoles?
Ben Archard: We use a spherical harmonics grid around the camera that is updated smoothly from the latest RSM data for each image. More a lot of light sensors. This is a relatively inexpensive and pretty good solution in many cases, but there may be a lighting leak and is too coarse to get something that even looks like far from the shadow. indirect. If the new generation consoles were able to draw the rays, we would be completely "in".
Yes. Consoles and PCs use this GI method as a standard for the moment. The method is strongly influenced by radiance indices (G. Papaionnou). The general process involves taking a grid of 32x16x32 voxels (or three of them in RGB) around the camera and storing for each voxel a spherical harmonic that encodes certain color and direction properties. We fill the grid with data from a collection of light probes and the Reflective Shadow Map (RSM) generated along the second cascade of sun shadows. Indeed, we render the scene from the point of view of the sun as on a normal shadow map, but this time we also retain the albedo (reflected light) and normals (to calculate the direction of the reflection). It's pretty much the same thing we do when generating g-buffer.
At the time of construction of the GI, we can take a number of samples of these SRMs for each voxel to get an idea of the light that reaches this voxel and from which directions. We average these samples to give us a kind of medium light color with a dominant direction as it passes through the voxel. Sampling in the voxel then gives us (in a general way) a kind of small directional light source. We store historical data (voxel grids of previous images) for four images to accumulate data smoothly over time. And, yes, we also have some jitter in the way we sample the voxel grid later when it is used for light accumulation.
This is a relatively inexpensive and effective solution, but the first thing to note is that a 32×16 texture on the screen does not constitute a large amount of information, so the technique is very faithful. If you imagine the amount of information you could store in a shaded map of this size (or even smaller), it's obvious that it's too coarse to give an idea of what even looks like far away indirect shadows. It can also have light leakage problems. Of course, it has already become the obsolete palliative, because we want to do it with RT now and if the new generation console can support RT, we would be completely "in".
Let's talk about ray tracing on next-generation console hardware. According to you, how viable is it and what alternatives would it be if it were not the RTX cards we see on PC? Can we envision a future where the consoles would use something like a GI voxel solution while the PC would keep its DXR path?
Ben Archard: it does not really matter – whether it's dedicated hardware or just enough computing power to do it in shader units, I think it would be viable. For the current generation – yes, several solutions are the way forward.
It is also a matter of time during which you support a parallel pipeline for existing PC hardware. A GeForce GTX 1080 is not an obsolete card for a buyer last year. It takes a few years for these cards to disappear and for RT to become fully public enough to be able to assume it. And obviously, on the consoles of the current generation, we need the voxel GI solution in the engine, next to the new RT solution. RT being the future of the game, the focus is on RT one way or the other.
Regarding the viability of RT on next-generation consoles, the hardware should not specifically consist of RTX cores. These hearts are not the only thing that matters in ray tracing. This is fixed-function hardware that speeds up calculations specifically related to BVH intersection tests. These calculations can be done in standard computation if the computer cores are numerous and fast enough (we think that they will be on the consoles of the next generation). In fact, any GPU that runs DX12 will be able to "execute" DXR since DXR is just an extension of DX12.
A very fast BVH generation algorithm, which will be managed by the main APIs, also affects the speed of ray tracing. and very fast memory. The bad thing about ray tracing, as opposed to something like SSAO, is random access to memory. SSAO will grab a load of texels data from a local area in the texture space and, because of the way these textures are stored, there is a good chance that these texels are quite close (or adjacent) in memory. In addition, the SSAO for the next pixel will work with roughly the same set of samples. You must therefore load a lot less memory because you can cache and a lot of data.
Working on cached data speeds things up in a ridiculous way. Unfortunately, the rays do not really have the same level of consistency. They can randomly access just about any part of the whole geometry, and the radius of the following pixels can be a snapshot of data from an equally random location. Therefore, just as specialized hardware to speed up radius intersection calculations is important, fast compute cores and a memory that allows you to quickly define delimited volume data are also a viable way to perform real-time RT. .
During our last interview, we talked about DirectX 12 in its debut on Xbox One and PC, even on Mantle, which succeeds Vulkan. The PC version of Metro Exodus now supports DX12. How are low-level APIs part of the 4A engine today? What are their advantages for the 4A engine, especially on PC?
Ben Archard: In fact, thanks to the DX12.X API, the consoles of the Xbox family, both on the graphics processor and on the processor, have been optimized. I think it's a common / public knowledge, but the GPU microcode on Xbox consumes the API directly in the state, as SetPSO only contains a few DWORDs in the command buffer. Regarding PC – you know, all the new features and features available go to DX12, and DX11 is somehow forgotten. As we are often in the forefront, we have no choice!
Since our last interview, Microsoft and Sony have both released their enthusiast consoles that offer better graphics processors and optimal features on the original CPUs, among other performance settings (Xbox One X and PS4Pro). What are the differences in resolution and graphics compared to the respective Metro Exodus base consoles, and does the 4A engine exploit some of the updated features from these new GPUs? (Quick calculation, for example on PS4 Pro )?
Ben Archard: We use everything we can find in the GPU API. As for FP16 Math, it is only used in one calculation, I believe, and especially for VGPR savings. We have native 4K on the high end Xbox One X and PS4 Pro as the other titles.
We have different quality parameters for ray tracing in the final game. What do the DXR settings really do?
Oles Shishkovstov: The ray tracing has two quality settings: high and ultra. The Ultra trace setting up to one ray per pixel, with all the management of denoising and accumulation. The high setting traces up to 0.5 radius per pixel, essentially in a checkerboard pattern, and one of the denoising passes works as a checkerboard. We recommend high for the best balance between image quality and performance, but please note that we are still experimenting a lot, so this information is only valid at the time of writing.
At Gamescom, it was mentioned that ray tracing for global illumination was done at three rays per pixel, so there were big changes?
Oles Shishkovstov: What we showed at Gamescom was in its infancy of real-time ray tracing. We were in a learning process with a whole new technological innovation. Radiography of the rays is a difficult problem – that is why it is generally called "the Holy Grail"!
The reason this is a difficult problem is that one key element of any overall lighting algorithm is the need to cosine-integrate values into the visible hemisphere. . We try to generate a value for all the light striking a point, from all possible directions that could hit it (so any direction in a hemisphere surrounding that point). Think of it like this: what we basically do is conceptually is like making a cubemap for each pixel and then adding it in cosine (adding all the values of all the pixels of this cubemap with some weighting for direction and background). angle of incidence). What was inside this imaginary "cubemap", we do not know until the rendering is complete. It would be ideal, brute force to do it. In fact, the reflection cards work the same way, except that we pre-generate the offline cubemap, share it between millions of pixels and the integration part is done when we generate the LOD. We want an effect similar to the one for which they were designed, but at a much more precise level, pixel by pixel.
Unfortunately, even a low resolution cube map would have thousands of samples to add, but we have a radius (one sample) per pixel to work with. To continue the analogy, imagine adding the values of a cubemap with mostly black pixels (where we had no information) and a bright pixel. This way of doing things disappears at this point, so we have to come up with other solutions. The advantage of GI is that you are more interested in low frequency data than in high (as you would for reflections). This is where the stochastic approach saves us. We store our radius value and treat this sample as representative of many samples. We weight its importance according to the representativeness that we think will be later. Nous avons ensuite une passe de débruitage (deux en fait) sur ces données de rayons bruts, où nous utilisons les données d’importance, les données d’historique et les données de pixels environnantes pour remplir les blancs. Cela consiste simplement à préparer les données de rayon pour une accumulation de lumière. Nous effectuons également un dernier (troisième) débruitage à la fin du cadre avec TAA pour nettoyer l'image finale.
Donc, pour la Gamescom, nous avions trois rayons. Après la Gamescom, nous avons tout reconstruit en nous concentrant sur le débruitage de haute qualité et l’accumulation temporelle de données de rayons sur plusieurs images. Nous avons un TAA spécialement conçu pour le "débruitage" au bout du pipeline, car les techniques stochastiques seront bruyantes par nature.
Quelles optimisations ont été mises en œuvre pour le traçage de rayons? Les réflexions de traçage de rayons de Battlefield 5 utilisent un certain nombre d’astuces telles que la combinaison de raymarching et de lancer de rayons, ainsi qu’un système de traçage de rayons variable permettant de limiter et de maximiser les rayons lorsque les objets sont les plus réfléchissants tout en maintenant une limite supérieure de rayons tirés. Existe-t-il des optimisations similaires pour l’IG tracée par rayons dans Metro Exodus? Ou bien l'exploitation des informations sur l'écran ou la limitation des rayons tirés sur une métrique ne sont-elles pas aussi réalisables pour un résultat aussi total et aussi omniprésent que l'illumination globale?
Oles Shishkovstov: Le lancer de rayons en temps réel est une nouvelle frontière passionnante. Nous sommes des pionniers dans le domaine de l’interprétation des rayons dans les jeux. Nous sommes donc en train d’apprendre et de trouver de meilleurs moyens d’appliquer cette technologie. Comme vous le dites, ce ne sont pas des reflets, mais bien des IG, et dans notre cas, les pixels "rugueux" sont aussi importants (sinon plus) que les pixels "lisses". Donc, nous ne pouvons pas vraiment limiter le nombre de rayons ou rendre ce nombre "adaptatif" car nous avons toujours besoin d'un minimum pour pouvoir travailler avec chaque pixel. Avec un échantillon, vous pouvez attribuer une valeur d’importance et commencer à estimer la quantité de lumière disponible. Si vous n'échantillonnez rien, vous n'avez aucune chance. Nous pourrions être (et sommes) adaptatifs au niveau de dénoisement cependant.
En ce qui concerne screen-space, bien sûr, nous effectuons une "pré-trace" peu coûteuse fonctionnant en mode asynchrone avec la mise à jour de BLAS / TLAS (BVH) et si l'intersection peut être trouvée à partir de la mémoire tampon de profondeur actuelle, nous l'utilisons sans générer de rayon. Nous rayonnons également notre terrain (qui est essentiellement une carte en hauteur), à l'intérieur des shaders de génération de rayons, il se trouve presque ainsi libre en raison de la nature du fonctionnement de la latence sur les GPU.
Un autre problème pour nous – nos rayons sont non cohérents par définition de problème. Cela n'aide pas la performance. Nous atténuons un peu le problème en mosaïque une très petite texture de bruit bleu pré-calculée sur l'écran (modifiée chaque image), qui est utilisée comme graine aléatoire de distribution pondérée en cosinus, donc même si les rayons ne sont pas cohérents pour les pixels proches, devraient être, ils sont un peu cohérents à travers la plus grande fenêtre. Cela accélère le traçage des rayons d’environ 10%. Ce n'est pas grave, mais c'est quand même quelque chose.
Si vous parcourez la présentation 4C de Remedy sur son tracé de rayons dans Northlight et que le contexte de Battlefield 5 envoie au plus 40% de la résolution d'écran des rayons dans un rapport 1: 1 pour ses le tracé sur le GPU ne se trouve pas dans la partie intersection rayons / triangle gérée principalement dans le noyau RT, mais plutôt dans l’ombrage associé. Comment se présente cet équilibre de performances (intersection rayons, ombre, dénoise, etc.) dans Metro Exodus et quelle partie de la RT est la plus lourde en performances sur le GPU?
Oles Shishkovstov: Nos shaders de traçage de rayons (mis à part le marquage de terrain par raymarching) recherchent uniquement le résultat le plus proche, puis le stockent dans un UAV. Il n'y a pas d'ombrage à l'intérieur. De cette façon, nous faisons un "ombrage différé" des rayons, ou plus précisément des positions de frappe. Il s’agit d’un juste équilibre entre travail d’ombrage et de traitement de texte pour le matériel actuel. La "nuance différée" est bon marché et ne mérite pas d'être mentionnée. Ce qui est effectivement coûteux, c'est le débruitage. Moins nous envoyons de rayons par pixel, plus le débruitage est coûteux, car il est essentiellement quadratique. Beaucoup de travail, d’idées et d’astuces ont été mises en œuvre pour le rendre temps réel. C’était un effort rassemblant plusieurs personnes, voire plusieurs entreprises, avec la coopération de Nvidia.
À la base, il s’agit d’un dénoiseur stochastique à deux passes avec une accumulation récurrente. Il est très adaptatif à la variance, à la visibilité, aux distances de frappe, etc. Encore une fois, il ne produit pas d'image "propre" dans tous les cas, mais son niveau de bruit de sortie est suffisant pour être "mangé" à la fin du tuyau. TAA débruitant. En ce qui concerne perf split: le traçage de rayons et le débruitage ont à peu près le même coût de performances que la plupart des scènes. Ce dont les autres parlent rarement – il y a une autre chose essentielle à la performance. Ce sont les mises à jour BVH (BLAS) qui sont nécessaires pour les éléments animés par vertex, plus les reconstructions BVH (TLAS) nécessaires pour conserver l'arborescence d'instances compacte et compacte. Nous l'étouffons autant que possible. Sans tout cela, son coût serait à peu près équivalent à 0,5 trace RPP, sinon plus.
Quels ont été les défis de l'optimisation de la RT et quelles sont les stratégies d'optimisation futures que vous souhaiteriez étudier?
Oles Shishkovstov: Ce n’est pas lié au traçage de rayons, c’est plutôt un problème informatique courant: les outils de profilage sont le plus gros problème. Pour optimiser quelque chose, nous devrions d'abord trouver le goulot d'étranglement. Dieu merci (et les fournisseurs de matériel), les outils s’améliorent lentement. En général, le traçage des rayons en temps réel est nouveau et nous avons besoin de beaucoup plus de recherches dans l’industrie. Nous partagerons nos connaissances et nos résultats à la GDC 2019 et je pense que d'autres partageront les leurs – la communauté de la recherche graphique aime partager!
Une question de suivi générale: y a-t-il des parties de la mise en œuvre de la RT dont vous êtes fier / ou qui vous enthousiasment? Nous aimerions entendre.
Oles Shishkovstov: La lumière de traçage de rayons s'est avérée très agréable dans le jeu. C'est très immersif pour les joueurs. En outre, la manière dont nous stockons, accumulons et filtrons l’éclairement énergétique, l’espace dans lequel nous le faisons – est directionnelle. Not only that gives us sharp response to normal map details, it improves contact detail and indirect shadows. Best of all – it allows us to reconstruct a fairly great approximation of indirect specular.
4A Games, many thanks for your time.