20170408 - Advanced Desktop Display

A blog post on being the best you can be, when targeting a local desktop display.

240 Hz Examples
Using Vulkan API terms thoughout. Working with 240 Hz frame rate in some examples in this post to ground the examples in the extreme case of real e-sports displays. Note this post does apply to less extreme frame rates as well.

ASCII diagrams going with 6 characters per millisecond (each character is 1/6 of a millisecond or roughly 0.1667 ms),

->(____)<--- 1 ms

Single 240 Hz frame is 25 characters,

->(________________________)<--- 4 1/6 ms

Limit of GPU Workload
Vulkan provides tools like events to enable maintaining pipelined execution on the GPU in a given queue. However the OS-level command buffer execution is serialized by the OS driver model for a given queue. Execution will begin with a fill period before hitting stead-state execution, and end in a drain period.

 ^                        ^
 |                        |
fill                    drain

The fill and drain periods represent some amount of lost GPU capacity. Expect any semaphore-level synchronization to have this lost capacity. Also expect and {API interopt or process-level sharing of the graphics queue} to have this lost capacity, along with possibly (likely) loosing compression on shared images. Keep everything in-app and in-same-API for optimal performance.

Within a given submit, there is a possibility to have loss of GPU capacity due to API-level command buffer boundaries. For example does the driver bundle up all command buffers in the submit to one OS-level command buffer? Or does the hardware have some front-end-level cost to chaining the command buffers? Etc.

  __  ____  _______    _ 

While newer APIs support multi-threaded command buffer recording to enable the application to maximize specialization at the draw-level or dispatch-level, I personally avoid all that and instead stick to single command buffer per frame, with command buffer replay (no per-frame recording), and only data-level specialization to minimize loss of GPU capacity. When designing for 4.1667 ms/frame, each 0.1 ms lost is 2.4% waste.

Getting Rid of The Compositor Tax
Next step to optimal performance is ensuring any external-process compositor (like the Desktop Window Manager [DWM]) is bypassed, and getting to direct in-app and in-same-API flip of the swap-chain.

 _____________________  _
                       | | 
                     ->| |<--- external-process compositor perf tax (cost varies)

For multi-display situations, where the game outputs to only one display, it is best for the user to turn off or disconnect the other displays. A second option is for the app to grab ownership of all the displays when fullscreen (with user override to disable), so that no compositor runs on them. Multi-display external-process compositors with single-GPUs driving all displays can be exceptionally bad in cases of mixed frame rates: the extra-display(s) compositor runs effectively at a random time, possibly disrupting only a subset of frames, and at variable timing with respect to the game's v-sync point.

Avoid the 64-bit/Pixel Swap-Chain Tax Man
This example is working from an RX480 hitting a 4K display at 60 Hz.

3840x2160 * 4-bytes/pixel = under 32 MiB/frame, estimate 64 MiB for last write + scan-out
256 GB/s * 0.786 efficiency = 192 MiB/ms
64 MiB / 192 MiB = 1/3rd of a millisecond (2% of a 60 Hz frame)

Can estimate getting bumped to 64-bit/pixel on exclusive full-screen is roughly a 2% tax. For high contrast displays it is better to run a 10:10:10:2 32-bit format for scanout with temporal dither, than move to a RGBA16F format. By the same estimate in the case of windowed full-screen, the 64-bit/pixel format is at least a 6% tax compared to full-screen exclusive flip 32-bit/pixel.

Animation and Motion Quality
Hitting v-sync via FIFO or FIFO_RELAXED is the ground truth in animation and motion quality. The timing of display output directly matching the rendered timeline, but offset by some amount of latency.

__________|_________________________|_________________________|_________________________| <--- v-sync timeline
|_________________________|_________________________|_________________________|__________ <--- render timeline

Variable refresh rate displays attempt to solve the problem of engines with highly variable GPU execution time for rendering a given frame, by supporting IMMEDIATE presentation without tearing or waiting. However due to the variability inherent in the pipeline, it is not possible to have jitter-free animation. Likewise these variable refresh rate scan-and-hold LCDs will have a varaible amount of visible blur in motion due to jitter.

                                                                     ->|      |<--- variable amount of jitter
                                                                       |      |
  |<-------------- frame rendered for this point in time ------------->|      |
  |                                                                           |
->|                   |<--- variable CPU execution time                       |
  |___________________|                                                       |
  (________CPU________)                             ->| |<--- variable start scanout delay 
                      |   ____________________________| |                     |
                      |  /_____________GFX____________\ |                     |
                      |  |                            | |_____________________|
                      |  |<--- variable GPU time ---->| (_______SCANOUT_______)
                      |  |                              |                     |
                      |  |                            ->|                     |<--- fixed scanout time
                    ->|  |<--- variable submit delay

Variable refresh rate displays with IMMEDIATE presentation are expected to have less jitter than fixed refresh displays with MAILBOX presentation. However only fixed refresh with FIFO can ever have perfect animation quality.

Input Latency
Using FIFO for possibility for jitter-free animation, first showing how not to do it.

->|    |<--- input jitter (worst case with 1000 Hz mouse)
  |    |
  |    |               ->|       |<--- submit early before graphics ready
  |    |_________________|       |
  |    (_______CPU_______).......|                    ->|  |<--- graphics ends early (fps variance tolerance) 
  |    |                         |______________________|  |
  |    |                         /__________GFX_________\..| 
  |    |                                                   |                          ________________________
  |    |<--- view figured here (written to constants)      |.........................(________SCANOUT_________)
  |                                                        |                        |                         |
  |              running 3 deep FIFO instead of 2 deep --->|                        |<-                       |
  |                                                                                                           |
  |<----------------------- total input latency (exclusing latency added by display) ------------------------>|

And now the best case without breaking current API rules,

                    |<---- better total input latency ---->| 
                    |                                      |
                  ->|    |<--- input jitter                |
                         |                                 |
       __________________|_____                            |
      /__________GFX_____|_____\                           |
                         |     |   ________________________|
                 late-latch    |  (________SCANOUT_________)
                         |     |  |                        
view-dependent render -->|     |< |
                               |  |
                             ->|  |<--- worst case present delay
                        semaphore signal

Running with a 2 deep swap-chain. Object space shading with late-frame view-dependent render. Input is late-latched on the GPU and view computed as late as possible. CPU command buffer generation and submit is no longer a concern for input latency.

The object-space work is decoupled from resolution and frame-rate, and is indirectly dispatched on the GPU and timed. Timing divided by dispatch size enables real-time feedback, and better estimation of how much work is safe to launch before needing to render. In order to make this work, really want an API to provide a GPU-side timestamp of scanout start. But before that happens it is possible to auto-calibrate by moving an "estimated scanout start" until the game gets dropped frames, then backing off.

  |                  |          ________________________
  |                  |         (________SCANOUT_________) 
  |                  |         |
  |                ->|         |<--- safe distance to start rendering before scanout
  |                  |
->|                  |<-- dispatch indirect for object-space work
read time-stamp, then setup indirect amount based on time to safe distance before scanout

Desired API Constructs
First a way to get 2-deep swap-chain with FIFO with guaranteed round-robin acquire ordering. This enables submit to be done before acquire next image (no serial dependency), and enables lowest latency presentation, etc. Example pipeline below, and note filled GPU execution (other than drain/fill on OS-level command buffer boundaries).

________________       ___________________       ___________________       ______________
____CPU_B_______)     (_______CPU_A_______)     (_______CPU_B_______)     (_______CPU_A__
___________________  ________________________  ________________________  ________________
____________________  ________________________  ________________________  _______________

CPU timeline is showing the window of opportunity for command buffer recording, the rest of the timeline can be filled with any other work. CPU command buffer record workload is bounded in time by worst-cast delay on the fence and later on the submit. In my case I just replay the same {A,B} command buffers every 2 frames, so this window is mostly volatility tolerance towards random CPU scheduling bubbles. Looking at the details of API synchronization constructs just for frame A,

                  fence wait            submit
                      |                   |
         delay --->|  |<-               ->|  |<--- submit delay
                   |  |___________________|  |                   present semaphore signal
                   |  (_______CPU_A_______)  |                        |
___________________|                         |________________________|
___GFX_A___________\                         /________GFX_A_______|___\
                   |  ________________________                    |   |   _______________
                   | (_______SCANOUT_A________)                   |   |  (_______SCANOUT_
                   |                          |                   |   |  |
             fence signal                     |                   | ->|  |<--- present delay
                                              |                   |      |
                                              |                   |     present semaphore wait
                            safety bubble --->|                   |<-
                                                   start writing into swap-chain A

Note there is no semaphore making GFX_A depend on SCANOUT_A finishing, because there is enough of a safety bubble where GFX_A won't be rendering into the swap-chain. Therefore GFX_A starts before SCANOUT_A ends. The semaphores and fences are all {A,B} double buffered, and it is ultimately the fences which throttle the CPU load. GPU load will auto-fill due to the indirect dispatch mechanism adapting to timestamps.

Doing Better
Review of current timeline,

                    |<---- better total input latency ---->| 
                    |                                      |
                  ->|    |<--- input jitter                |
                         |                                 |
       __________________|_____                            |
      /__________GFX_____|_____\                           |
                         |     |   ________________________|
                 late-latch    |  (________SCANOUT_________)
                         |     |  |                        
view-dependent render -->|     |< |
                               |  |
                             ->|  |<--- worst case present delay
                        semaphore signal

On the hardware it is certainly physically possibly to overlap late-frame render and scanout,

                    |<---- even better latency ---->| 
                    |                               |
                  ->|    |<--- input jitter         |
                         |                          |
       __________________|_____                     |
      /__________GFX_____|_____\                    |
                         |  ________________________|
                late-latch (________SCANOUT_________)
                         | |
                       ->| |<--- scanout can start when enough of the frame is rendered

The latency advantage of this increases the larger the view-dependent part of the frame is. For many modern engines, the majority of the frame is view-dependent. Note the render would need to be chunked in scanout ordering,


This is different than "racing the beam", as the desire is to start just before scanout, then finish as fast as possible during scanout, instead of continously running just before scanout and periodically interrupting some other task. This method still requires double-buffering, but mixes in a little front-buffer rendering. In order to implement, the pipeline must change a little: the view-dependent rendering gets pushed into the next frame.

________________       ___________________       ___________________       ______________
____CPU_B_______)     (_______CPU_A_______)     (_______CPU_B_______)     (_______CPU_A__
___________________  ________________________  ________________________  ________________
_____________________  ________________________  ________________________  ______________

Breaking down a frame in detail,

                  fence wait            submit
                      |                   |                      
         fence signal |                   |                       
                   |  |___________________|                      present semaphore signal
                   |  (_______CPU_A_______)                           |
___________________| ____                          ___________________| ____
                    |  ________________________                       |    _______________
                    | (_______SCANOUT_A________)                      |   (_______SCANOUT_
                    | |  |                                            |   |
                  ->| |  |<--- view-dependent rendering frame A       |   |
                      |  |                                            |   |
                    ->|  |<--- overlap                                |   |
                                                                      |   |
        command buffer ensures to end early enough before scanout --->|   |<-
                                                               present semaphore wait

The GFX_A command buffer end latches the next image for presentation, but ends early enough for the beginning of GFX_B to actually render frame A ahead of scanout. Part is rendered via back-buffer rendering, the majority is rendered front-buffered.

There are no legal ways to do this in Vulkan currently. Here are the laundry list of spec violations,

(1.) skipping image transitions for the swap-chain
(2.) requires guarantee of 2-deep swap chain
(3.) requires guarantee of round-robin acquire ordering
(4.) requires guarantee of flip
(5.) requires defined ability to store into front-buffer

However it possible to prototype once a flip bug is fixed in the AMD driver, by depending on driver-specific behavior which is not safe to ship with.

Lowest latency with perfect jitter-free animation is just an extension away from being possible.