Saturday, July 26, 2014

Depth buffer emulation

In Frame buffer emulation. Intro article I described what depth buffer emulation is and why it is necessary. The main goal is to have depth buffer in RDRAM (N64 memory) filled with correct values. There are two approaches:

  • software rendering directly do depth buffer
  • prepare necessary data on video card  side and copy data from video memory to the RDRAM.
Both approaches have pros and cons. Software rendering approach was successfully implemented in Glide64. It was the only option for the target hardware of that plugin. Thus, merit of software rendering is that it is suitable for any video hardware. Obvious drawback is higher CPU load, but for all CPU since Pentium 3 it’s not a problem. There are less obvious shortcomings, which need to be explained:
  • Low-res result. Depth buffer size is the same as size of main color buffer, which is mainly 320x240. It’s perfectly enough to emulate flame coronas, but when depth buffer is used as fog texture the result is bad. Run Beetle Adventure Racing with Glide64 to see what I’m talking about.
  • Incorrect result. Some pixels of a polygon can be discarded by alpha compare and thus not change color and depth buffers. Software rendering implemented in Glide64 renders only polygon’s depth without any care about its alpha. Thus, pixels which would normally be discarded by alpha compare are stored in the depth buffer. Again, it’s not a problem for coronas emulation, but very noticeable for depth buffer based fog. Run Beetle Adventure Racing with Glide64 – fog around trees is missing, because transparent parts of trees polygons were not discarded during software rendering. Software render have to do color rendering too to get alpha compare information. That is fully functional software rendering is necessary to get correct depth buffer. N64 software plugins/emulators don’t run full speed even on many modern CPU.
  • N64 depth compare is more complex process than the one which is usually used on PC. It uses not only “greater” and “greater-or-equal” equations, but also more complex ones. N64 uses not only ‘Z’ value of the pixel but also ‘Delta Z’: DeltaZpix = | dZdx | + | dZdy |. That is, ‘Delta Z’ value shows how polygon’s depth changes along X and Y directions. N64 depth compare mode may compare incoming Z with [stored Z – DeltaZ, stored Z + DeltaZ]. Software depth buffer obviously can’t help to emulate such a mode, because depth compare is done on video card side. OpenGL standard depth compare functions also do not work that way.
    Taking into account all said above I decided to choose another way and emulate N64 depth buffer on video card’s side. The main idea is to use OpenGL Shading Language to emulate N64 depth compare process. The final result should be texture with depth information, which can be copied into RDRAM. The obvious week point is necessity of copy from video memory into PC memory. However, for modern cards and modern GL copying 320x240 array of data is painless. My implementation of copying color buffer from video memory to RDRAM proved it.

    So, I begin to implement depth compare shader. Z values of N64 primitives have float type, but depth buffer format is 16bit unsigned integer. N64 uses pretty complex non-linear conversion of float Z into integer depth buffer value. Video plugins usually use pre-calculated lookup table for fast Z-to-depth_buffer conversion. I decided to continue that tradition and put the lookup table into 512x512 texture with one 16bit component. Next question is: how to make the result depth buffer texture? Depth buffer is read-write object: current value of the buffer is compared with incoming Z and if the new value passes the depth test it replaces the old one. First obvious solution was use Frame Buffer Object (FBO) with texture as render target. The idea is: create a texture with one 16bit integer component, attach it to FBO as color rendering target and pass it as texture input into depth shader. I quickly wrote a shader, which reads value from the texture, compares it with pixel’s Z and writes the result. The code was pretty simple, no place for mistake. Then I spent many days fighting with weirdest glitches. Result was absolutely non-presumable. I decided that I run into famous “ATI sucks” situation, and AMD GL drivers mock at me. I found PC with NVidia card to test my work. This time I got just black screen. Ok, if nothing helps, read the documentation. Quick dig in docs revealed the sad truth - I lost my time for a wrong idea. It is prohibited to Sampling and Rendering to the Same Texture. Documentation says that result is undefined in that case. AMD and NVidia gave me different results, but both results were bad. So, I had to find another way.

    Hopefully, one of readers of that blog, neobrain, gave me a working advice. The idea is to use ‘Image variables’ mechanism. Image variable is bound with some texture in video memory and provides pixel shader with read-write access to that texture. Exactly what I needed. That time I read the documentation first to not step on the same rakes. The documentation contains several vague moments, and I got a feeling that this way will not be easy too.

    I rewrote my code for Image Variables use. The task number one was to get depth texture suitable for copying to RDRAM. As you remember, N64 depth image format is 16bit unsigned integer. GL supports textures with 16bit unsigned integer components. Thus, natural solution was to use such texture for depth buffer data. Shader takes pixel’s Z, converts it to N64 format using lookup table texture, compares that value with the one from the buffer texture and writes the result if the new value is less than the current one. I implemented copy of that texture into RDRAM and run Zeldas to check how it works. Result was negative – no coronas in both Zelda games. I wrote a shader program for depth buffer based fog in Beetle Adventure Racing in hope to visualize my texture and see what is wrong. I got no fog. I wrote a test which just shows my depth texture on screen. The picture correlated well with the rendered scene. That is my depth shader works, but the result is not close enough to what should be. I decided to compare my result with the etalon – depth buffer rendered by Glide64. I dumped depth buffers in both plugins. Values in my hardware-rendered texture were ~10% less than in software-rendered etalon. I checked my code once again and did not find any mistake. As a last resort I decided to replace texture format from 16bit int to 32bit float. Bingo! Coronas are working, fog is working too. The code left the same, just texture format changed. I still don’t understand why integer format does not work: depth shader takes value for depth texture from the lookup table in both cases and there is no place where that value could lose precision. Enigma.

    So, I got correct hardware-rendered N64 depth buffer. That made possible emulation of depth buffer based effects. Next step is to emulate N64 depth compare on video card side. Why? First, it is interesting task by itself. Then, it should fix various problems with depth caused by incomplete depth compare emulation, namely:
    • Decal Surfaces. N64 has “special mode to allow the rendering of 'decal' polygons (usually with a texture on them, like a flag or logo) over a previously rendered opaque surface. Unlike normal rendering, here we only want to render the decal if it is coplanar with the existing surface.” It is pretty well emulated via glPolygonOffset usage, but sometimes visual glitches may appear because plugin uses static parameters for that function. N64 uses already mentioned DeltaZpix to test that two surfaces are coplanar.
    • Nearer vs InFront Compare. N64 has several depth compare modes. Some of them have direct analog in OpenGL, some does not. Graphics plugin usually use one depth compare function: either “less” or “less-or-equal”. N64 depth compare mode InFront  is the same as OpenGL ‘less”: InFront=PixZ<MemZ “Nearer” mode uses DeltaZ: Nearer=(PixZ-DeltaZmax)<=MemZ, where DeltaZmax=MAX(DeltaZpix,DeltaZmem). Sometimes it is well approximated by “less-or-equal”, sometimes it is not.
    • Farther and Nearer. Farther mode is similar to Nearer: Farther=(PixZ+DeltaZ)>=MemZ. N64 can use these modes together. In that case pixel Z must be in the interval [MemZ – DeltaZmax, MemZ + DeltaZmax] to pass the test. That mode has no analog in OpenGL. I know only one game, which uses that mode: 'Extreme-G'. Currently it is emulated only by software graphics plugins.
    So, depth compare shader must calculate not only N64 Z, but also DeltaZ must be calculated per pixel and stored in depth buffer. DeltaZ calculation with shaders is simple thanks to dFdx and dFdy functions. Code for load and store value in depth texture was already written. Equations for depth compare are very simple. So, shader program for N64 depth compare turned out short and simple.

    I disabled OpenGL depth compare to test my new depth compare shader. Main pixel shader discards pixel if its Z does not pass the depth shader test. First result was very disappointing. Many pixels which had to be discarded by the depth shader poke through covering polygons. I guess that precision loss during conversion of float Z to integer depth buffer value makes the depth buffer too rough for use in higher resolutions. I decided to store original pixel’s Z beside N64 Z in my depth texture and use original Z for depth test. Since pixel’s Z is float and depth texture components are floats there should not be any precision loss, so I expected that this time my texture based depth buffer will work as good as the standard OpenGL one. Alas, it does not. The result is much better than with depth compare based on original N64 depth buffer values, but still not perfect. Again, I don’t understand why. Probably I did not take into account some details related to synchronization of load-store processes for Image Variables. Nevertheless, some good results have been achieved.

    Summary:
    • N64 depth buffer emulation implemented on video card side
    • Coronas work perfectly
    • Depth buffer based fog works very well, but glitches sometimes appear
    • N64 depth compare works, but additional polishing is required.

    Now it’s time for some illustrations.

    Coronas emulation

    A short video, which shows why depth buffer rendering is important. Notice when coronas appears and disappears:

    Software depth buffer rendering VS hardware one.



    Software rendering. Black outline on building’s roof is caused by low resolution of fog texture.
    Silhouettes of trees on the left are obviously wrong.



    Hardware rendering. Perfect.



    Software rendering. Some black pixels again. Fog around the sign is missing due to ignored alpha test.


    Hardware rendering. Fog around the sign is correct, but it is wrong on some polygons on the left. I did not find yet why that happens.


    this video allows you to compare both methods side by side:
    Note: GLideN64 video is missing some frames on the start. This problem is caused by video capturing, the gameplay itself is smooth.

    Correct emulation of N64 depth compare modes

    With constant depth compare function shadows are drawn above the characters.


    With calculated depth function shadows are correct.


    Also a short video, which illustrates that with shaders even hardest depth compare modes can be emulated:




    4 comments:

    1. Nice to see my original suggestion being implemented after all ;)

      However, I actually got to point out a potentially major flaw: Writes from a pixel shader to the same texel in a texture via image_load_store do not happen in a guaranteed order. Without further consideration this makes it completely useless for depth buffer emulation (this is obvious if you use it to implement depth tests, but also for simple read operations it might lead to dramatic bugs). One way to fix this is to rely on the OpenGL extension https://www.opengl.org/registry/specs/INTEL/fragment_shader_ordering.txt , which AFAIK is supported by both Intel and AMD (the latter only on recent-ish driver releases).

      The "clean" approach to implement this properly is quite the crazy idea, which would require implementing pixel shaders via compute shaders and is explained here: http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html
      Basically, the idea is whenever you need to perform target texture writes to a single pixel in a primitive-ordered way, a first shader outputs a linked list of outputs for each primitive. A second shader takes this list as input and sorts it by the primitive ID and then performs the requires writes to the target texture. However, I'm not really sure either if this method is useful when reads are used in combination with writes, so things are getting more complicated here :)

      In any case, I just wanted to make you aware that the approach might not actually work all that well in some cases or on some hardware.

      ReplyDelete
      Replies
      1. "Writes from a pixel shader to the same texel in a texture via image_load_store do not happen in a guaranteed order." - could you get me a link to an article which explains it? I saw several examples on stackowerflow with reads/writes to the same texel via image_load_store. Plus proper MemoryBarrier usage may help to insure the read/write order.

        Delete
    2. Another thing I want to comment on is your "criticism" to software renderers:
      - "Low-res result": It's the same result as on the hardware. While that might not be pretty, saying "the result is bad" gives away the impression it in some way is an approach unsuitable to emulation. Also, there's no real reason why a software renderer wouldn't support rendering at a higher than native resolution (other than performance)
      - "Incorrect result": You seem to be taking the case for software rendering as implemented inGlide64 and making it out to be a general disadvantage of software renderers (at least it reads like that), even though it's clear that it's possible to implement this perfectly accurately with software rendering.
      - "N64 depth compare is more complex process than the one which is usually used on PC: [...] Software depth buffer obviously can’t help to emulate such a mode, because depth compare is done on video card side. ": This one couldn't be more explicit and is just completely false. Whatever can be implemented on a GPU can also be implemented in Software, and in particular any block-based software renderer has no problems calculating the "Delta Z" value you've been talking about. I don't know much about Glide64 - maybe it does some weird mixmatch between software rendering and depth compare on the GPU, but that's again a restriction of that particular plugin and not of software rendering by itself.

      Maybe you can clarify some of these points, given that they give away a very wrong picture of software renderers and in a few cases even spreads false information?

      ReplyDelete
      Replies
      1. > -"Low-res result": It's the same result as on the hardware. While that might not be pretty, saying "the result is bad" gives away the impression it in some way is an approach unsuitable to emulation. Also, there's no real reason why a software renderer wouldn't support rendering at a higher than native resolution (other than performance)

        I didn’t say that low-res result is unsuitable to emulation. Cite: “It’s perfectly enough to emulate flame coronas”. I said, that “when (low-res) depth buffer is used as fog texture the result is bad”.
        Of course, a software renderer can use high resolution by the price of performance.

        > - "Incorrect result": You seem to be taking the case for software rendering as implemented inGlide64 and making it out to be a general disadvantage of software renderers (at least it reads like that), even though it's clear that it's possible to implement this perfectly accurately with software rendering.

        I tried to explain that mere depth buffer render (as it is implemented in Glide64) is not enough to get correct depth buffer. Cite: “Software render have to do color rendering too to get alpha compare information. That is fully functional software rendering is necessary to get correct depth buffer.” Which is very costy even in native resolution.


        >- "N64 depth compare is more complex process than the one which is usually used on PC: [...] Software depth buffer obviously can’t help to emulate such a mode, because depth compare is done on video card side. ": This one couldn't be more explicit and is just completely false. Whatever can be implemented on a GPU can also be implemented in Software, and in particular any block-based software renderer has no problems calculating the "Delta Z" value you've been talking about. I don't know much about Glide64 - maybe it does some weird mixmatch between software rendering and depth compare on the GPU, but that's again a restriction of that particular plugin and not of software rendering by itself.

        I don’t get your point here. Yes, software render can do "Delta Z" calculations. angrylion's software plugin does it perfectly. But how that software depth buffer can be used by GPU to do depth buffer tests?

        Delete