NVIDIA will present around 20 research papers at SIGGRAPH, the year’s most important computer graphics conference


[H]F Junkie
Dec 19, 2005

“These are just the highlights — read more about all the NVIDIA papers at SIGGRAPH. NVIDIA will also present six courses, four talks and two Emerging Technology demos at the conference, with topics including path tracing, telepresence and diffusion models for generative AI.
NVIDIA Research has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.”

Source: https://blogs.nvidia.com/blog/2023/05/02/graphics-research-advances-generative-ai-next-frontier/
An she starts,,,



New compression method for 4k textures delivering better fidelity/size compression/results, works on all GPUS
OK that is sick.

But this would technically work on AMD GPU's but compression there would be wildly impractical.
Similar to the approach used by Müller et al. for training autodecoders
[47], we achieve practical compression speeds by using halfprecision
tensor core operations in a custom optimization program
written in CUDA.We fuse all of the network layers in a single kernel,
together with feature grids sampling, loss computations, and the
entire backward pass. This allows us to store all network activations
in registers, thus eliminating writes to shared or off-chip memory
for intermediate data.
AMD does not currently do sub-16-bit natively, certainly not half precision sub-16-bit which is where the tensor cores shine, performing this on the current AMD hardware would be between 30 and 100 times slower than the nearest equivalent Nvidia card.

Fortunately though
Inlining the network with the material shader presents a few challenges
as matrix-multiplication hardware such as tensor cores operate
in a SIMD-cooperative manner, where the matrix storage is
interleaved across the SIMD lanes [54, 86]. Typically, network inputs
are copied into a matrix by writing them to group-shared memory
and then loading them into registers using specialized matrix load
intrinsics. However, access to shared memory is not available inside
ray tracing shaders. Therefore, we interleave the network inputs
in-registers using SIMD-wide shuffle intrinsics.
We used the Slang shading language [25] to implement our fused
shader along with a modified Direct3D [44] compiler to generate
NVVM [52] calls for matrix operations and shuffle intrinsics, which
are currently not supported by Direct3D. These intrinsics are instead
directly processed by the GPU driver. Although our implementation
is based on Direct3D, it can be reproduced in Vulkan [23]
without any compiler modifications, where accelerated matrix operations
and SIMD-wide shuffles are supported through public vendor
extensions. The NV_cooperative_matrix extension [22] provides
access to matrix elements assigned to each SIMD lane. The mapping
of these per-lane elements to the rows and columns of a matrix
for NVIDIA tensor cores is described in the PTX ISA [54]. The
KHR_shader_subgroup extension [21] enables shuffling of values
across SIMD lanes in order to assign user variables to the rows
and columns of the matrix and vice versa. These extensions are not
restricted to any shader types, including ray tracing shaders.
It uses a more common hardware set for decompression and instead relies on drivers to manipulate a series of Direct3D or Vulkan commands so it should decompress well across most modern hardware sets.