Refer to caption
(a)
Refer to caption
(b)
Figure 1. Content-adaptive image representation. Content-adaptive image representation. Leveraging a tailored differentiable renderer, Image-GS adaptively distributes and progressively optimizes a set of anisotropic 2D Gaussians to fit a target image. Image-GS is a flexible image representation that has high memory & computation efficiency, supports fast random pixel access, and offers a natural level of detail. (1(a)) shows the learned Gaussian position distribution (green dots); 20% of Gaussians are plotted for better visibility. (1(b)) Image-GS’s content-adaptive nature enables it to wisely allocate resources based on the local signal complexity and preserve fine image details with higher fidelity than alternative methods. The insets visualize the corresponding error images, with brighter colors indicating higher errors.

Image-GS: Content-Adaptive Image Representation via 2D Gaussians

Yunxiang Zhang 0000-0003-0189-1776 yunxiang.zhang@nyu.edu New York UniversityUSA Alexandr Kuznetsov 0009-0001-7084-3391 alexandr.kuznetsov@intel.com Akshay Jindal 0000-0003-0557-0726 akshay.jindal@intel.com Intel CorporationUSA Kenneth Chen 0000-0002-8095-4407 kennychen@nyu.edu New York UniversityUSA Anton Sochenov 0009-0002-7496-8586 anton.sochenov@intel.com Anton Kaplanyan 0000-0002-8376-6719 anton.kaplanyan@intel.com Intel CorporationUSA  and  Qi Sun 0000-0002-3094-5844 qisun@nyu.edu New York UniversityUSA
Abstract.

Neural image representations have recently emerged as a promising technique for storing, streaming, and rendering visual data. Coupled with learning-based workflows, these novel representations have demonstrated remarkable visual fidelity and memory efficiency. However, existing neural image representations often rely on explicit uniform data structures without content adaptivity or computation-intensive implicit models, limiting their adoption in real-time graphics applications.

Inspired by recent advances in radiance field rendering, we propose Image-GS, a content-adaptive image representation. Using anisotropic 2D Gaussians as the basis, Image-GS shows high memory efficiency, supports fast random access, and offers a natural level of detail stack. Leveraging a tailored differentiable renderer, Image-GS fits a target image by adaptively allocating and progressively optimizing a set of 2D Gaussians. The generalizable efficiency and fidelity of Image-GS are validated against several recent neural image representations and industry-standard texture compressors on a diverse set of images. Notably, its memory and computation requirements solely depend on and linearly scale with the number of 2D Gaussians, providing flexible controls over the trade-off between visual fidelity and run-time efficiency. We hope this research offers insights for developing new applications that require adaptive quality and resource control, such as machine perception, asset streaming, and content generation.

Image Representation
ccs: Computing methodologies Computer graphics

1. Introduction

For a long time, images have been digitized as uniform pixel grids in both hardware and software. Such discrete representations do not align with the physical world, where visual content is continuous and non-uniform. As a result, these representations suffer from limited efficiency, especially for domain-specific tasks such as machine vision and content streaming [Li et al., 2020; Chen et al., 2021].

Neural representations have recently emerged to encode, process, and render images using neural networks [Chen et al., 2021; Karnewar et al., 2022; Dosovitskiy et al., 2020; Martel et al., 2021]. Incorporated into learning-based frameworks, these representations have demonstrated significant advantages over traditional image formats in terms of visual fidelity, memory efficiency, and machine vision task performance. However, neural image representations commonly rely on explicit, uniform data structures that do not adapt to the image content or computationally heavy implicit networks for image decoding. Such characteristics become a major barrier in real-time graphics applications that require fast memory access or resource-dependent quality adaptation.

To this end, we propose Image-GS, a flexible, compact, and content-adaptive image representation based on anisotropic, colored 2D Gaussians. Specifically, each 2D Gaussian is characterized by its mean, covariance, and color. Given a target image, a group of 2D Gaussians is adaptively spawned based on the magnitude of local image gradients, with more Gaussians allocated to regions with high-frequency details. The Gaussian parameters are then optimized via a tailored differentiable renderer to reconstruct the target image, and additional Gaussians are progressively added to image regions exhibiting high reconstruction errors. To accelerate Image-GS’s inference speed, we introduce a hierarchical grid structure based on binary space partitioning. The content-adaptive nature of Image-GS enables it to wisely allocate resources based on the local signal complexity and preserve fine image details with high fidelity.

Through a series of comparative experiments with several recent neural image representations and industry-standard texture compression algorithms, we validate Image-GS’s generalizable performance in terms of visual quality and memory & computation efficiency. Additionally, Image-GS supports hardware-friendly fast random access, continuous level-of-detail adaptation, and real-time inference performance. We hope this research provides insights for developing novel image representations that have the advantages of both explicit (direct memory access & adaptive level-of-details) and implicit (efficient storage & learning-compatible) encoding, and therefore, supporting specific hardware and application needs.

In summary, this research contributes:

  • a flexible, compact, and content-adaptive image representation based on anisotropic colored 2D Gaussians;

  • a tailored differentiable renderer that efficiently aggregates
    anisotropic colored 2D Gaussians into images;

  • a hierarchical spatial partitioning for accelerated inference.

Refer to caption
Figure 2. Optimization pipeline of our Gaussian-based adaptive image representation. Optimization pipeline of our Gaussian-based adaptive image representation. At initialization, a group of 2D Gaussian primitives are adaptively spawned based on the magnitude of local image gradients, with more Gaussians allocated to regions with fine details (Section 3.3). During training, the parameters associated with the Gaussian primitives (Section 3.1) are optimized via a tailored differentiable renderer (Section 3.2) to reconstruct the target image, and additional Gaussians are progressively added to image regions exhibiting high reconstruction errors (Section 3.3). Note that we plot Gaussians as colored elliptical discs with red frames based on their mean and covariance, and overlay them with rendered images for better visibility of the training progress.

2. Related Work

2.1. Image and Texture Compression

Traditional image compression methods have prioritized efficient storage and transmission over real-time graphics. Lossless methods optimize pixel permutation and use entropy encoding [Welch, 1985], while lossy methods transform image blocks into frequency domains using wavelet [Antonini et al., 1992] or cosine transforms [Wallace, 1992] followed by quantization and entropy encoding. Advanced lossy methods also consider human color sensitivity, higher bit depths, wide color gamuts, user statistics [Alakuijala et al., 2019], and employ content-adaptive block sizes and looped filtering to reduce artifacts [Chen et al., 2018]. Despite high compression ratios, these methods are complex to decode, slow, and unsuitable for non-color data like normal maps in real-time graphics. In contrast, texture compression methods aim to reduce GPU bandwidth and support non-color data, random access, and fast decompression. They operate on small 4x4 pixel blocks, maintaining local statistics (such as mean and variance) while reducing bits-per-pixel (bpp) within each block [Delp and Mitchell, 1979]. Each block is compressed independently, storing per-pixel color values [Campbell et al., 1986], base color with adjustments [Ström and Akenine-Möller, 2005; Ström and Pettersson, 2007], or color endpoints with interpolation indices [BC, 2024]. Advanced methods allow dynamic block sizes, HDR content, and content-adaptive compression strategy per block [Nystad et al., 2012]. These methods offer fast random access but are limited to an 8:1 compression ratio. Recently, Vaidyanathan et al. [2023] proposed compressing multiple material textures and mipmap chains with a small multilayer perceptron (MLP), achieving high compression and real-time random access. Our Image-GS representation also achieves a high compression ratio, good visual fidelity, and real-time inference, with additional content-adaptive optimization, at-will level-of-detail queries, and variable bit rate.

2.2. Neural Image Representation

Neural image representation is an emerging field that diverges from traditional pixel-based methods, using deep features or implicit neural functions to encode images. Ballé et al. [2018] use variational autoencoders (VAEs) with deep hyperprior to create compact latent space image representation. Chen et al. [2021] transform images into continuous signals via a 2D feature map and a shared decoder. These methods, trained on fixed image sets, may struggle to generalize to new images. In contrast, per-image encoder/decoder approaches employ MLPs to approximate 2D image signals, enhanced with activation functions like sinusoids [Sitzmann et al., 2020] and Gabor wavelets [Saragadam et al., 2023], or positional encoding [Tancik et al., 2020] to capture fine details. These methods can also be extended to discontinuous signals through hybrid neural-mesh representations [Belhe et al., 2023]. Karnewar et al. [2022] showed that even traditional grid-based representations can be improved by adding a fixed non-linearity to interpolated values. Martel et al. [2021] proposed a hybrid method that uses multiscale block-coordinate decomposition to adaptively allocate resources based on local signal complexity. These methods, though capable of representing complex images with high quality and low memory, often have long training and inference times and struggle with high-frequency variations due to single-scale representation. Müller et al. [2022] addressed these issues with fully fused MLPs and multi-resolution hash grids, enabling quick, memory-efficient learning and inference of gigapixel images. In Section 4, we demonstrate that our Image-GS representation achieves superior visual quality at low bit rate compared to these neural methods while supporting real-time inference.

Gaussian mixtures in graphics

In parallel to neural representation, Gaussian mixture representations are also seeing an emerging trend in computer graphics. Our work is inspired by the recent 3D Gaussian Splatting method [Kerbl et al., 2023] that uses explicit 3D Gaussian functions for real-time high-quality novel-view synthesis. Many follow-up works have extended this method to dynamic scenes [Luiten et al., 2023], on-the-fly training and streaming [Sun et al., 2024], 2D Gaussians for surface modeling [Huang et al., 2024], and a wide variety of other applications [Chen and Wang, 2024]. Although Gaussian mixture models have been used for image generation [Gepperth and Pfülb, 2021], compression [Sun et al., 2019, 2021], and stylization [Cheng, 2024], their application as 2D image representation remains largely unexplored in real-time graphics.

3. Method

We first introduce a flexible and compact image representation based on anisotropic, colored 2D Gaussians (Section 3.1), then present a tailored differentiable renderer to aggregate Gaussians into pixel values (Section 3.2). The Gaussian parameters are optimized to reconstruct the image content, while additional 2D Gaussians are progressively added to error-prone regions (Section 3.3). The resulting representation, named Image-GS, adaptively allocates 2D Gaussians based on the complexity of local image regions, achieving a favorable trade-off between visual fidelity and memory/computation efficiency. To enable low-cost random access and fast rendering speed, we build an image-specific hierarchical grid to spatially partition optimized Gaussians and reduce run-time computations (Section 3.4).

3.1. Images as Anisotropic 2D Gaussians

Due to the adoption of memory-consuming feature grids [Lombardi et al., 2019; Sitzmann et al., 2019a] and computation-intensive implicit models [Sitzmann et al., 2019b; Niemeyer et al., 2020; Martel et al., 2021], existing neural representations exhibit limited scalability for complex visual data in real-time graphics applications. To address this issue, we draw inspiration from the recent success of 3D Gaussian splatting [Kerbl et al., 2023], an explicit scene representation supporting high-quality and real-time rendering, and propose to use anisotropic 2D Gaussians as the representation basis for images.

Similar to the Gaussian primitives in 3D, the geometry and orientation of an anisotropic 2D Gaussian is characterized by a covariance matrix 𝚺2×2𝚺superscript22\bm{\Sigma}\in\mathbb{R}^{2\times 2}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT centered at image coordinates 𝝁=(u,v)𝝁𝑢𝑣\bm{\mu}=(u,v)bold_italic_μ = ( italic_u , italic_v ). Its density value evaluated at an arbitrary pixel location x2xsuperscript2\textbf{x}\in\mathbb{R}^{2}x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT gives:

(1) G(x)=exp(12(x𝝁)T𝚺1(x𝝁)),𝐺x12superscriptx𝝁𝑇superscript𝚺1x𝝁\displaystyle G(\textbf{x})=\exp\left({-\frac{1}{2}(\textbf{x}-\bm{\mu})^{T}% \bm{\Sigma}^{-1}(\textbf{x}-\bm{\mu})}\right),italic_G ( x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( x - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( x - bold_italic_μ ) ) ,

To ensure that the covariance matrix 𝚺𝚺\bm{\Sigma}bold_Σ remains physically feasible, i.e., positive semi-definite, during the numerical optimization process detailed in Section 3.3, we factorize it into a rotation matrix R2×2Rsuperscript22\textbf{R}\in\mathbb{R}^{2\times 2}R ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT and a diagonal scaling matrix S2×2Ssuperscript22\textbf{S}\in\mathbb{R}^{2\times 2}S ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT:

(2) 𝚺=RSSTRT,𝚺RSsuperscriptS𝑇superscriptR𝑇\displaystyle\bm{\Sigma}=\textbf{R}\,\textbf{S}\,\textbf{S}^{T}\,\textbf{R}^{T},bold_Σ = R S S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

Specifically, we create and maintain a rotation angle θ[0,π]𝜃0𝜋\theta\in[0,\pi]italic_θ ∈ [ 0 , italic_π ] and a scaling vector s+2ssuperscriptsubscript2\textbf{s}\in\mathbb{R}_{+}^{2}s ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for each 2D Gaussian. These parameters are optimized via stochastic gradient descent and clipped to their allowed value range during the training process. The rotation matrix R and the scaling matrix S are constructed on the fly using θ𝜃\thetaitalic_θ and s at both the training and inference stages.

Unlike Gaussian splatting in 3D that adopts spherical harmonics to model view-dependent color effects [Kerbl et al., 2023], we only utilize a 3-dimensional vector c+3csuperscriptsubscript3\textbf{c}\in\mathbb{R}_{+}^{3}c ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for each 2D Gaussian to store its RGB values, as an image essentially captures a single view of a scene. In addition, 3D Gaussian splatting relies on a per-Gaussian trainable opacity parameter for depth-based occlusion computation and α𝛼\alphaitalic_α-blending during the rendering process. By contrast, depth information is not necessary for accurate rendering in the 2D space, and 2D Gaussian primitives can be effectively aggregated regardless of their relative order, as we explain in Section 3.2. Therefore, our 2D Gaussian primitives do not have an opacity property.

Based on the discussion above, an arbitrary 2D Gaussian primitive in our image representation, Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 1iNg1𝑖subscript𝑁g1\leq i\leq N_{\text{g}}1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT, is fully characterized by a vector containing 8 trainable parameters pi8subscriptp𝑖superscript8\textbf{p}_{i}\in\mathbb{R}^{8}p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT:

(3) pipi(𝝁i,θi,si,ci),subscriptp𝑖subscriptp𝑖subscript𝝁𝑖subscript𝜃𝑖subscripts𝑖subscriptc𝑖\displaystyle\textbf{p}_{i}\coloneqq\textbf{p}_{i}(\bm{\mu}_{i},\theta_{i},% \textbf{s}_{i},\textbf{c}_{i}),p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

3.2. Aggregating 2D Gaussians into Images

While Gaussian splatting in 3D necessitates per-view depth sorting and per-Gaussian opacity to handle occlusions and enforce multi-view consistency, especially when there exist objects that are visible in some views but not in others, we argue that depth sorting and occlusion modeling can be safely omitted in the 2D case without negatively impacting the rendering quality. Since an image only captures a single view of an underlying 3D scene, there is no need to account for any potential multi-view consistency issues in other viewing directions due to inconsistent depth information. As a result, an image represented by a set of 2D Gaussians can be rendered by applying the Gaussians in arbitrary order, as long as the final rendering result is pixel-wise accurate for that particular image.

With this insight in mind, we simplify the standard point-based α𝛼\alphaitalic_α-blending approach from the literature [Yifan et al., 2019; Kopanas et al., 2021, 2022] by treating our Gaussian primitives as an unordered set of semi-transparent anisotropic points, and aggregate their color contributions to form an image pixel x2xsuperscript2\textbf{x}\in\mathbb{R}^{2}x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT:

(4) cr(x)=i=1NgGi(x)ci,subscriptcrxsuperscriptsubscript𝑖1subscript𝑁gsubscript𝐺𝑖xsubscriptc𝑖\displaystyle\textbf{c}_{\text{r}}(\textbf{x})=\sum_{i=1}^{N_{\text{g}}}G_{i}(% \textbf{x})\cdot\textbf{c}_{i},c start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( x ) ⋅ c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

While this naive formulation effectively renders 2D Gaussians in an order-agnostic manner, it involves all Gaussians to compute the color of each pixel. Such global pixel-Gaussian correlation breaks the data locality that is necessary for efficient random pixel access on GPUs and largely limits the overall rendering speed [Vaidyanathan et al., 2023]. Moreover, the fact that each Gaussian receives gradients through pixel errors across the whole image domain makes the optimization less consistent across iterations and slower to converge, as training updates in irrelevant, faraway image regions can influence a 2D Gaussian in unpredictable and undesirable ways.

For this reason, we limit the number of Gaussians that contribute to a given pixel based on their density. Specifically, we first evaluate and rank the density values of all Gaussians at x2xsuperscript2\textbf{x}\in\mathbb{R}^{2}x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then only keep the top-K and use their density values as weights for rendering. Besides, we also normalize these weights before the aggregation to avoid situations where certain pixels are not adjacent enough to their top-K Gaussians to receive sufficient color contributions. Overall, our rendering algorithm for 2D Gaussians is formulated as:

(5) cr(x)=1j𝒩K(x)Gj(x)j𝒩K(x)Gj(x)cj,subscriptcrx1subscript𝑗subscript𝒩Kxsubscript𝐺𝑗xsubscript𝑗subscript𝒩Kxsubscript𝐺𝑗xsubscriptc𝑗\displaystyle\textbf{c}_{\text{r}}(\textbf{x})=\frac{1}{\sum_{j\in\mathcal{N}_% {\text{K}}(\textbf{x})}G_{j}(\textbf{x})}\sum_{j\in\mathcal{N}_{\text{K}}(% \textbf{x})}G_{j}(\textbf{x})\cdot\textbf{c}_{j},c start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( x ) = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT K end_POSTSUBSCRIPT ( x ) end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT K end_POSTSUBSCRIPT ( x ) end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ⋅ c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where 𝒩K(x)subscript𝒩Kx\mathcal{N}_{\text{K}}(\textbf{x})caligraphic_N start_POSTSUBSCRIPT K end_POSTSUBSCRIPT ( x ) represents the set of top-K Gaussians for x, defined as 𝒩K(x)=top-K({Gi(x)}i=1Ng)subscript𝒩Kxtop-Ksuperscriptsubscriptsubscript𝐺𝑖x𝑖1subscript𝑁g\mathcal{N}_{\text{K}}(\textbf{x})=\text{top-K}(\{G_{i}(\textbf{x})\}_{i=1}^{N% _{\text{g}}})caligraphic_N start_POSTSUBSCRIPT K end_POSTSUBSCRIPT ( x ) = top-K ( { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( x ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). Throughout our experiments, K is set to 10101010 if not explicitly specified otherwise.

Refer to caption
Figure 3. Accelerated inference via hierarchical spatial partitioning. Accelerated inference via hierarchical spatial partitioning. After training, the target image is adaptively subdivided into Nbsubscript𝑁bN_{\text{b}}italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT disjoint blocks (solid orange) such that each block covers a subset of optimized Gaussians (red dots show their locations) with similar cardinality (Ng/Nbabsentsubscript𝑁gsubscript𝑁b\approx N_{\text{g}}/N_{\text{b}}≈ italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT). Instead of querying all Ngsubscript𝑁gN_{\text{g}}italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT Gaussians, rendering a pixel in block Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT only involves the Gaussians within its corresponding outer shell Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (dotted yellow). The size difference between each block-shell pair is designed to mitigate boundary artifacts. Only 10% of Gaussians are plotted for better visibility.

3.3. Content-Adaptive Initialization and Optimization

Unlike scene geometries which are commonly sparse in the 3D space, images (except the ones with an alpha channel) typically contain dense color information everywhere in the 2D image domain. A good initialization of our 2D Gaussian primitives, therefore, should output a spatial distribution that covers the entire image domain while emphasizing regions with high-frequency, fine details.

To this end, we propose a content-adaptive sampling strategy that combines local image gradient guidance with uniform sampling. Specifically, we only sample pixel locations to initialize the position of Gaussians. During the position sampling for each Gaussian, the probability of a given pixel x being sampled is a weighted sum of the relative magnitude of its local image gradient and a constant shared across all pixel locations, as formulated below. In particular, the left term advocates image-content adaptivity, while the right term ensures appropriate image-domain coverage.

(6) init(x)=(1λinit)I(x)2h=1Hw=1WI(xh,w)2+λinitHW,subscriptinitx1subscript𝜆initsubscriptdelimited-∥∥𝐼x2superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊subscriptdelimited-∥∥𝐼subscriptx𝑤2subscript𝜆init𝐻𝑊\displaystyle\mathds{P}_{\text{init}}(\textbf{x})=\frac{(1-\lambda_{\text{init% }})\cdot\lVert\nabla I(\textbf{x})\rVert_{2}}{\sum_{h=1}^{H}\sum_{w=1}^{W}% \lVert\nabla I(\textbf{x}_{h,w})\rVert_{2}}+\frac{\lambda_{\text{init}}}{H% \cdot W},blackboard_P start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ( x ) = divide start_ARG ( 1 - italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ) ⋅ ∥ ∇ italic_I ( x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ∥ ∇ italic_I ( x start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT end_ARG start_ARG italic_H ⋅ italic_W end_ARG ,

where H/W𝐻𝑊H/Witalic_H / italic_W give the height/width of the image, and I()𝐼\nabla I(\cdot)∇ italic_I ( ⋅ ) denotes the image gradient operator. λinit[0,1]subscript𝜆init01\lambda_{\text{init}}\in[0,1]italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ∈ [ 0 , 1 ] balances local content adaptivity and uniform coverage, and is set to 0.30.30.30.3 in our experiments. In addition to position initialization, all Gaussians are assigned the target pixel color at their initialized location ct(x)subscriptctx\textbf{c}_{\text{t}}(\textbf{x})c start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( x ).

To optimize the Gaussian parameters toward reconstructing the target image, during each training iteration, we sample a large set of pixel locations and compute the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between rendered (using the differentiable renderer in Section 3.2) and ground-truth pixel values at sampled locations. Similar to Equation 6, the sampling distribution again has an image-gradient term to emphasize complex local image content and a constant term to enforce broad coverage.

(7) opt(x)=(1λopt)I(x)2h=1Hw=1WI(xh,w)2+λoptHW,subscriptoptx1subscript𝜆optsubscriptdelimited-∥∥𝐼x2superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊subscriptdelimited-∥∥𝐼subscriptx𝑤2subscript𝜆opt𝐻𝑊\displaystyle\mathds{P}_{\text{opt}}(\textbf{x})=\frac{(1-\lambda_{\text{opt}}% )\cdot\lVert\nabla I(\textbf{x})\rVert_{2}}{\sum_{h=1}^{H}\sum_{w=1}^{W}\lVert% \nabla I(\textbf{x}_{h,w})\rVert_{2}}+\frac{\lambda_{\text{opt}}}{H\cdot W},blackboard_P start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ( x ) = divide start_ARG ( 1 - italic_λ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) ⋅ ∥ ∇ italic_I ( x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ∥ ∇ italic_I ( x start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_λ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT end_ARG start_ARG italic_H ⋅ italic_W end_ARG ,

The trade-off parameter λopt[0,1]subscript𝜆opt01\lambda_{\text{opt}}\in[0,1]italic_λ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is set to 0.80.80.80.8 in our experiments.

Besides the initially created Gaussians, we also periodically add new Gaussian primitives to image regions having high reconstruction errors as the training progresses. This is achieved by sampling pixel locations based on their relative error magnitude.

(8) add(x)=|cr(x)ct(x)|h=1Hw=1W|cr(xh,w)ct(xh,w)|,subscriptaddxsubscriptcrxsubscriptctxsuperscriptsubscript1𝐻superscriptsubscript𝑤1𝑊subscriptcrsubscriptx𝑤subscriptctsubscriptx𝑤\displaystyle\mathds{P}_{\text{add}}(\textbf{x})=\frac{\left|\textbf{c}_{\text% {r}}(\textbf{x})-\textbf{c}_{\text{t}}(\textbf{x})\right|}{\sum_{h=1}^{H}\sum_% {w=1}^{W}\left|\textbf{c}_{\text{r}}(\textbf{x}_{h,w})-\textbf{c}_{\text{t}}(% \textbf{x}_{h,w})\right|},blackboard_P start_POSTSUBSCRIPT add end_POSTSUBSCRIPT ( x ) = divide start_ARG | c start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( x ) - c start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( x ) | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | c start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) - c start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) | end_ARG ,

Figure 2 illustrates the optimization pipeline of Image-GS.

3.4. Hierarchical Spatial Partitioning for Efficient Inference

For real-time graphics applications with demanding performance requirements, the fact that rendering a single pixel requires evaluating and ranking the density values of all Gaussian primitives is computationally infeasible. Fortunately, the constraint on the number of Gaussians making contributions to each pixel (Equation 5) enforces pixel-Gaussian locality during the optimization and ensures that the top-K Gaussians of each pixel are located within a small local neighborhood after the training.

Following this intuition, we tailor a hierarchical grid structure to each Image-GS-represented image to spatially partition the set of all Ngsubscript𝑁gN_{\text{g}}italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT optimized 2D Gaussians into Nbsubscript𝑁bN_{\text{b}}italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT smaller subsets of similar cardinality (Ng/Nbabsentsubscript𝑁gsubscript𝑁b\approx N_{\text{g}}/N_{\text{b}}≈ italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT). This is achieved via a variant of binary space partitioning (BSP), where only horizontal and vertical splitting lines are employed. Given the maximum number of Gaussians allowed in each subset Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, a BSP tree is constructed to iteratively and adaptively subdivide the image domain into smaller blocks until the termination criterion is met. Specifically, each node in the BSP tree corresponds to a block in the image space. The ensemble of leaf nodes forms a set of disjoint blocks that together cover the entire image domain. Each leaf block undergoes a sequence of alternating horizontal and vertical splitting. Each splitting line is computed such that it separates a leaf block into two smaller blocks with an equal number of Gaussians. This process continues until no leaf block contains more than Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT Gaussians. The resulting leaf blocks give the desired subsets of Gaussians.

However, rendering pixels in a leaf block Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using only the Gaussians in Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can result in severe boundary artifacts, where neighboring pixels in two adjacent leaf blocks exhibit unnatural color changes. To address this issue, we extend the boundaries of each block Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by 1/4141/41 / 4 to obtain its corresponding encompassing shell Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and make all Gaussians within Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT available for rendering pixels in Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Figure 3 illustrates the resulting hierarchical spatial partitioning. Instead of querying all Ngsubscript𝑁gN_{\text{g}}italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT Gaussians, rendering a pixel only involves 9Ng/4Nbabsent9subscript𝑁g4subscript𝑁b\approx 9N_{\text{g}}/4N_{\text{b}}≈ 9 italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT / 4 italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT Gaussians now. Figure 4 shows the inference acceleration performance for a varying number of blocks.

4. Evaluation

4.1. Experimental Setup

Evaluation dataset

To comprehensively understand and validate Image-GS’s image representation performance, we prepared an evaluation dataset of 30 RGB images (2K×\times×2K resolution) with diverse characteristics, including 10 photographs, 4 vector-style images, 8 texture maps, 4 anime posters, and 4 watercolor/oil paintings.

Evaluation metrics

We adopt 4 image quality metrics, PSNR, SSIM [Wang et al., 2004], LPIPS [Zhang et al., 2018], and FLIP [Andersson et al., 2020], to evaluate the visual fidelity of Image-GS and establish quantitative comparisons with baseline methods. We also report the parameter size of each representation and the corresponding bit rate in bpp (bits per pixel) to evaluate their memory efficiency.

Implementation

The only trainable parameters in Image-GS are the Gaussian parameters in Equation 3. By mapping image domains to [0,1]2superscript012[0,1]^{2}[ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT grids, Image-GS works for target images of any aspect ratio and resolution. At initialization, the Gaussian positions 𝝁𝝁\bm{\mu}bold_italic_μ and colors c are populated by sampling pixel coordinates based on Equation 6. Their scaling vectors s and rotation angles θ𝜃\thetaitalic_θ are set to 2/max(H,W)2𝐻𝑊2/\max(H,W)2 / roman_max ( italic_H , italic_W ) and 00, respectively. We adopt the Adam optimizer [Kingma and Ba, 2015] to iteratively update these parameters for 50K iterations. The learning rates for (𝝁,c,s,θ)𝝁cs𝜃(\bm{\mu},\textbf{c},\textbf{s},\theta)( bold_italic_μ , c , s , italic_θ ) start at ((((2e𝑒eitalic_e-4, 2e𝑒eitalic_e-3, 1e𝑒eitalic_e-3, 1e𝑒eitalic_e-3)))) and decay by 10 (only once) if no improvement (PSNR and SSIM are computed every 1K iterations) has been made for 3 consecutive measurements. During each iteration, 10K pixel locations 𝒫𝒫\mathcal{P}caligraphic_P are sampled based on Equation 7 to evaluate the loss function:

(9) L=1|𝒫|x𝒫|cr(x)ct(x)|,𝐿1𝒫subscriptx𝒫subscriptcrxsubscriptctx\displaystyle L=\frac{1}{\left|\mathcal{P}\right|}\sum_{\textbf{x}\in\mathcal{% P}}\left|\textbf{c}_{\text{r}}(\textbf{x})-\textbf{c}_{\text{t}}(\textbf{x})% \right|,italic_L = divide start_ARG 1 end_ARG start_ARG | caligraphic_P | end_ARG ∑ start_POSTSUBSCRIPT x ∈ caligraphic_P end_POSTSUBSCRIPT | c start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( x ) - c start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( x ) | ,

During training, we progressively allocate additional Gaussians to image regions with high fitting errors based on Equation 8. For a budget of Ngsubscript𝑁gN_{\text{g}}italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT Gaussians, we initialize training with Ng/2subscript𝑁g2N_{\text{g}}/2italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT / 2 Gaussians and optimize for 10K iterations. Ng/8subscript𝑁g8N_{\text{g}}/8italic_N start_POSTSUBSCRIPT g end_POSTSUBSCRIPT / 8 additional Gaussians are added every 5K iterations until the budget runs out. We implement Image-GS in PyTorch [Paszke et al., 2019] and use half-precision floating-point numbers (float16) for all parameters.

Refer to caption
Figure 4. Performance analysis with hierarchical grid partitioning. Performance analysis with hierarchical grid partitioning. We measure the time of rendering 10K pixels with Image-GS (8K Gaussians) for a varying number of BSP blocks (Section 3.4). The measurement is performed on an NVIDIA GeForce RTX 3080 GPU. Each data point is averaged over 100 trials. Note that our proof-of-concept implementation uses pure PyTorch code only. Further acceleration can be achieved with customized CUDA kernels.

4.2. Acceleration via Hierarchical Spatial Partitioning

Using the acceleration technique introduced in Section 3.4, we analyze Image-GS’s system performance under a varying number of blocks. With the BSP tree constructed, the image domain [0,1]2superscript012[0,1]^{2}[ 0 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is hierarchically partitioned into Nbsubscript𝑁bN_{\text{b}}italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT disjoint blocks. Each block Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is defined by the coordinates of its top-left and bottom-right corners (x1,x2)subscriptx1subscriptx2(\textbf{x}_{1},\textbf{x}_{2})( x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). At run-time, the boundary locations of its corresponding shell Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are derived from (x1,x2)subscriptx1subscriptx2(\textbf{x}_{1},\textbf{x}_{2})( x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) on the fly and used to filter the Gaussians within Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for rendering the pixels within Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Results and discussion

Figure 4 illustrates the run-time efficiency of Image-GS (8K Gaussians) for a varying number of spatial partitioning blocks. The measurement is performed on an NVIDIA GeForce RTX 3080 GPU. The inference time required for rendering 10K pixels decreases from 4.59 ms for 16 blocks to 0.70 ms for 192 blocks, showing a 6.56×\times× acceleration. Further engineering efforts, such as customized CUDA kernels, could enable substantial acceleration over our proof-of-concept implementation in pure PyTorch code but are beyond the scope of this work. Notably, storing these block coordinates for accelerated rendering only requires 4Nb4subscript𝑁b4N_{\text{b}}4 italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT float16 parameters, and incurs 8Nb8subscript𝑁b8N_{\text{b}}8 italic_N start_POSTSUBSCRIPT b end_POSTSUBSCRIPT bytes additional size to our representation, which is practically negligible compared to memory consumed by the Gaussian parameters. For instance, 150 blocks (1.20 KB) only account for 0.93%percent0.930.93\%0.93 % memory consumption of an Image-GS-represented image with 8K Gaussians. These results demonstrate the scalability of our acceleration approach via hierarchical spatial partitioning.

4.3. Visual Fidelity vs Memory Efficiency

Table 1. Trade-off between visual fidelity and memory efficiency. Trade-off between visual fidelity and memory efficiency. We adjust the size of Image-GS by varying the number of 2D Gaussians therein (“nK”). \uparrow/\downarrow symbols indicate that higher/lower values are better.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow FLIP\downarrow Size\downarrow
Ours (1K) 24.48 0.7946 0.2734 0.1748 16 KB
Ours (2K) 26.80 0.8316 0.2076 0.1392 32 KB
Ours (4K) 29.43 0.8706 0.1573 0.1074 64 KB
Ours (6K) 30.85 0.8828 0.1310 0.0967 96 KB
Ours (8K) 32.19 0.8912 0.1176 0.0866 128 KB

We optimize Image-GS to the 22 non-texture images in our evaluation dataset under varying memory consumption. By adjusting the number of 2D Gaussians, we obtain 5 different bit-rate levels (in bpp) for Image-GS: 0.244, 0.183, 0.122, 0.061, 0.031.

Results and discussion

For all 5 bit-rate levels, we render Image-GS-represented images at 2K×\times×2K resolution and evaluate their visual fidelity against the reference images. As summarized in Table 1, Image-GS achieves an average performance of 32.19 (PSNR), 0.89 (SSIM), 0.12 (LPIPS), and 0.09 (FLIP) with a memory size of 128 KB. Even at an ultra-low bit rate of 0.031 bpp (16 KB), Image-GS is able to fit the target images at 24.48 (PSNR), 0.79 (SSIM), 0.27 (LPIPS), and 0.17 (FLIP) on average. Notably, our progressive optimization strategy (Section 3.3) automatically generates a sequence of Image-GS-represented images at varying bit rates during training. This forms a natural level of detail (LoD) stack for the target image. Figure 6 shows several samples rendered at 2K×\times×2K resolution.

4.4. Comparison with Neural Image Representations

Table 2. Quantitative comparison with previous neural image representations. Quantitative comparison with previous neural image representations. We measure the representation efficiency of Image-GS in terms of visual quality and memory consumption against four baseline methods. All metrics are computed on images rendered at 2K×\times×2K resolution.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow FLIP\downarrow Size\downarrow
ReLU 23.25 0.7148 0.4288 0.2012 132 KB
SIREN 27.48 0.7760 0.3662 0.1613 135 KB
WIRE 26.53 0.6996 0.4062 0.2117 134 KB
I-NGP 27.12 0.7815 0.2641 0.1786 141 KB
Ours 32.19 0.8912 0.1176 0.0866 128 KB

Baseline methods

We establish comparisons with 4 recent neural image representations: ReLU fields [Karnewar et al., 2022], SIREN [Sitzmann et al., 2020], WIRE [Saragadam et al., 2023], and Instant NGP [Müller et al., 2022]. For fair comparisons under similar bit rates, we modify these baseline models to match our bit rates by decreasing the resolution of feature grids (ReLU, Instant NGP) and/or reducing the number of hidden layers/features (Instant NGP, WIRE, SIREN). The 22 non-texture images in our evaluation dataset are employed.

Results and discussion

As summarized in Table 2, our Gaussian-based representation Image-GS achieves an average performance of 32.19 (PSNR), 0.89 (SSIM), 0.12 (LPIPS), and 0.09 (FLIP), significantly outperforming all baseline methods despite a smaller memory footprint. Figure 7 shows several visual examples. At an ultra-low bit rate of 0.244 bpp, all baselines show different levels of image distortions and artifacts. For instance, decreasing the resolution of grid-based methods (ReLU, Instant NGP) leads to block artifacts due to feature vectors being interpolated at sparser locations. Implicit functions (SIREN, WIRE) exhibit artifacts such as ringing or blurring, which are more pronounced after reducing the base network parameters. By contrast, Image-GS exhibits much less visible artifacts.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5. Qualitative comparison with GPU texture compressors. Qualitative comparison with GPU texture compressors. For comparisons under similar compression rates, we use the mipmap level 2 for both BC1 and BC7 to match their bit rates to ours. The bottom-right insets visualize the corresponding error images, with brighter colors indicating higher errors.

4.5. Comparison with Texture Compression Methods

Baseline methods

We compare to two BCx variants designed for RGB texture compression, BC1 (highest settings) and BC7 (quality 0.25), using their AMD Compressonator [AMD, 2024] implementations. Since these settings only offer a maximum compression of up to 4 bpp, we use the compressed images in the mipmap level 2 for both algorithms and upsample them back to the original resolution using bilinear interpolation to match our 0.244 bpp for fair comparisons. The 8 texture maps in our evaluation dataset are employed.

Results and discussion

As summarized in Table 3, our Gaussian-based representation Image-GS achieves an average performance of 33.03 (PSNR), 0.89 (SSIM), 0.16 (LPIPS), and 0.08 (FLIP), consistently outperforming the two baseline methods despite a smaller memory footprint. Figure 5 presents visual comparisons of Image-GS versus BC1 and BC7. The error maps indicate that Image-GS shows visibly less distortions from the uncompressed reference images. More samples can be found in LABEL:fig:supplementary-texture-1 and LABEL:fig:supplementary-texture-2 in the appendix.

Table 3. Quantitative comparison with GPU texture compressors. Quantitative comparison with GPU texture compressors. All metrics are computed on textures rendered at 2K×\times×2K resolution. We use the mipmap level 2 for BC1/BC7 to match their bit rates to ours.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow FLIP\downarrow Bit Rate\downarrow
BC1 28.86 0.8714 0.2369 0.1012 0.250 bpp
BC7 27.31 0.8425 0.3006 0.1054 0.253 bpp
Ours 33.03 0.8922 0.1591 0.0757 0.244 bpp

5. Potential Applications

5.1. Adaptive Image Representation for Machine Vision

Deep-learning techniques have found tremendous success across a wide range of machine vision tasks, such as object tracking, facial recognition, and depth estimation. These achievements typically require high-resolution images as inputs for large neural network models to extract relevant information. While images represent visual information using a uniform grid structure, the task-related information often only resides in a few scattered local image regions exhibiting distinct features. Such spatially uniform representation for visual inputs results in an efficient usage of processing time and energy consumption. By contrast, Image-GS adaptively distributes representation resources across the image domain, with more bits allocated to image regions showing fine details. Compared to standard images, our Gaussian-based representation offers a more flexible and efficient encoding of visual inputs for machine vision models to operate on. Directly running machine vision models on Image-GS-represented images may improve their computation efficiency.

5.2. Resource-Adaptive Image Representation

The only trainable parameters in Image-GS are the ones associated with the 2D Gaussians. Consequently, its memory and computation requirements solely depend on and linearly scale with the number of 2D Gaussians. This provides a flexible way to trade off between visual fidelity and run-time efficiency. Moreover, our progressive optimization strategy with error-guided Gaussian addition produces a sequence of Image-GS-represented images at varying bit rates during training, which forms a natural level of detail stack for the target image. These properties of Image-GS are useful in situations where computational resources or network bandwidths are limited, such as mobile computing or web album streaming. While conventional compression algorithms may uniformly degrade the visual quality across the entire image domain, Image-GS is capable of maximizing the usage of available resources for optimized visual quality.

5.3. Image Restoration and Enhancement

The inherent low-frequency nature of Gaussian functions makes our Image-GS representation robust to various high-frequency image distortions and artifacts caused by JPEG compression (blockiness, ringing, and contouring), quantization errors (color banding and aliasing), transmission over noisy channels and low-light imagery (salt and pepper noise). We empirically observe in our experiments that, when using Image-GS to represent images containing such high-frequency artifacts, Image-GS effectively eliminates them during optimization and outputs artifact-free rendering. These observations suggest that Image-GS has the potential to accomplish certain image restoration and enhancement tasks.

6. Limitations and Future Work

Hierarchical spatial guidance during optimization

While Image-GS is designed to be content-adaptive, numerical optimization algorithms, such as stochastic gradient descent, sometimes have trouble shifting the spatial distribution of Gaussians toward the global optimum. Following the acceleration technique in Section 3.4, we plan to introduce a dynamic BSP tree that guides the spatial allocation of Gaussians. During training, the tree structure and Gaussian parameters can be jointly optimized by formulating and solving an integer programming problem similar to [Martel et al., 2021].

Dynamic visual content

In this work, we demonstrate that images can be adaptively and efficiently represented with an explicit basis, anisotropic 2D Gaussian. Motivated by recent research that incorporates dynamics into Gaussian-based 3D scene representations [Luiten et al., 2023; Stavros et al., 2024], we plan to apply our method to represent videos by additionally modeling the movements of 2D Gaussians within the image plane. We envision this extension to benefit graphics applications such as video streaming.

7. Conclusion

In this paper, we proposed Image-GS, a flexible, compact, and content-adaptive image representation based on anisotropic, colored 2D Gaussians. Image-GS has high memory & computation efficiency, supports fast random access, and offers a natural level of detail through progressive optimization. The content-adaptive nature of Image-GS enables it to wisely allocate resources based on the signal complexity of local image regions and preserve fine image details with higher fidelity than alternative methods. Through a series of quantitative comparisons with recent neural image representations and industry-standard texture compression algorithms, we validated the visual fidelity and memory efficiency of Image-GS. We hope this research will establish new grounds for developing new representations for visual data.

References

  • [1]
  • AMD [2024] 2024. AMD Compressonator. https://gpuopen.com/compressonator/.
  • BC [2024] 2024. Texture Block Compression in Direct3D 11. https://learn.microsoft.com/en-us/windows/win32/direct3d11/texture-block-compression-in-direct3d-11.
  • Alakuijala et al. [2019] Jyrki Alakuijala, Ruud Van Asseldonk, Sami Boukortt, Martin Bruse, Iulia-Maria Comșa, Moritz Firsching, Thomas Fischbacher, Evgenii Kliuchnikov, Sebastian Gomez, Robert Obryk, et al. 2019. JPEG XL next-generation image compression architecture and coding tools. In Applications of digital image processing XLII, Vol. 11137. SPIE, 112–124.
  • Andersson et al. [2020] Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström, and Mark D Fairchild. 2020. FLIP: A Difference Evaluator for Alternating Images. Proc. ACM Comput. Graph. Interact. Tech. 3, 2 (2020), 15–1.
  • Antonini et al. [1992] Marc Antonini, Michel Barlaud, Pierre Mathieu, and Ingrid Daubechies. 1992. Image coding using wavelet transform. IEEE Trans. Image Processing 1 (1992), 20–5.
  • Ballé et al. [2018] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 (2018).
  • Belhe et al. [2023] Yash Belhe, Michaël Gharbi, Matthew Fisher, Iliyan Georgiev, Ravi Ramamoorthi, and Tzu-Mao Li. 2023. Discontinuity-Aware 2D Neural Fields. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–11.
  • Campbell et al. [1986] Graham Campbell, Thomas A DeFanti, Jeff Frederiksen, Stephen A Joyce, Lawrence A Leske, John A Lindberg, and Daniel J Sandin. 1986. Two bit/pixel full color encoding. ACM SIGGRAPH Computer Graphics 20, 4 (1986), 215–223.
  • Chen and Wang [2024] Guikun Chen and Wenguan Wang. 2024. A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024).
  • Chen et al. [2021] Yinbo Chen, Sifei Liu, and Xiaolong Wang. 2021. Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8628–8638.
  • Chen et al. [2018] Yue Chen, Debargha Murherjee, Jingning Han, Adrian Grange, Yaowu Xu, Zoe Liu, Sarah Parker, Cheng Chen, Hui Su, Urvang Joshi, et al. 2018. An overview of core coding tools in the AV1 video codec. In 2018 picture coding symposium (PCS). IEEE, 41–45.
  • Cheng [2024] Chang-Chieh Cheng. 2024. Image representation and reconstruction by compositing Gaussian ellipses. IET Image Processing 18, 2 (2024), 493–506.
  • Delp and Mitchell [1979] Edward Delp and O Mitchell. 1979. Image compression using block truncation coding. IEEE transactions on Communications 27, 9 (1979), 1335–1342.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  • Gepperth and Pfülb [2021] Alexander Gepperth and Benedikt Pfülb. 2021. Image modeling with deep convolutional gaussian mixture models. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–9.
  • Huang et al. [2024] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2024. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. arXiv preprint arXiv:2403.17888 (2024).
  • Karnewar et al. [2022] Animesh Karnewar, Tobias Ritschel, Oliver Wang, and Niloy Mitra. 2022. Relu fields: The little non-linearity that could. In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–14.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Kopanas et al. [2022] Georgios Kopanas, Thomas Leimkühler, Gilles Rainer, Clément Jambon, and George Drettakis. 2022. Neural point catacaustics for novel-view synthesis of reflections. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–15.
  • Kopanas et al. [2021] Georgios Kopanas, Julien Philip, Thomas Leimkühler, and George Drettakis. 2021. Point-Based Neural Rendering with Per-View Optimization. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 29–43.
  • Li et al. [2020] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. 2020. Towards streaming perception. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 473–488.
  • Lombardi et al. [2019] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–14.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2023. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
  • Martel et al. [2021] Julien NP Martel, David B Lindell, Connor Z Lin, Eric R Chan, Marco Monteiro, and Gordon Wetzstein. 2021. Acorn: adaptive coordinate networks for neural scene representation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–13.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15.
  • Niemeyer et al. [2020] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2020. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3504–3515.
  • Nystad et al. [2012] Jörn Nystad, Anders Lassen, Andy Pomianowski, Sean Ellis, and Tom Olson. 2012. Adaptive scalable texture compression. In Proceedings of the Fourth ACM SIGGRAPH/Eurographics Conference on High-Performance Graphics. 105–114.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
  • Saragadam et al. [2023] Vishwanath Saragadam, Daniel LeJeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, and Richard G Baraniuk. 2023. Wire: Wavelet implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18507–18516.
  • Sitzmann et al. [2020] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. 2020. Implicit neural representations with periodic activation functions. Advances in neural information processing systems 33 (2020), 7462–7473.
  • Sitzmann et al. [2019a] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. 2019a. Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2437–2446.
  • Sitzmann et al. [2019b] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019b. Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems 32 (2019).
  • Stavros et al. [2024] Stavros Stavros, Tobias Zirr1, Alexandr Kuznetsov, Georgios Kopanas, and Anton Kaplanyan. 2024. N-Dimensional Gaussians for Fitting of High Dimensional Functions. In ACM SIGGRAPH 2024 Conference Proceedings. 1–9.
  • Ström and Akenine-Möller [2005] Jacob Ström and Tomas Akenine-Möller. 2005. i PACKMAN: High-quality, low-complexity texture compression for mobile phones. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. 63–70.
  • Ström and Pettersson [2007] Jacob Ström and Martin Pettersson. 2007. ETC 2: texture compression using invalid combinations. In Graphics Hardware, Vol. 7. 49–54.
  • Sun et al. [2024] Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 2024. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos. arXiv preprint arXiv:2403.01444 (2024).
  • Sun et al. [2019] Jianjun Sun, Yan Zhao, and Shigang Wang. 2019. Image compression using GMM model optimization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1797–1801.
  • Sun et al. [2021] Jianjun Sun, Yan Zhao, Shigang Wang, and Jian Wei. 2021. Image compression based on Gaussian mixture model constrained using Markov random field. Signal Processing 183 (2021), 107990.
  • Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. 2020. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems 33 (2020), 7537–7547.
  • Vaidyanathan et al. [2023] Karthik Vaidyanathan, Marco Salvi, Bartlomiej Wronski, Tomas Akenine-Moller, Pontus Ebelin, and Aaron Lefohn. 2023. Random-Access Neural Compression of Material Textures. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–25.
  • Wallace [1992] Gregory K Wallace. 1992. The JPEG still picture compression standard. IEEE transactions on consumer electronics 38, 1 (1992), xviii–xxxiv.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
  • Welch [1985] Terry A Welch. 1985. High speed data compression and decompression apparatus and method. US Patent 4,558,302.
  • Yifan et al. [2019] Wang Yifan, Felice Serena, Shihao Wu, Cengiz Öztireli, and Olga Sorkine-Hornung. 2019. Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Refer to caption
(n)
Refer to caption
(o)
Refer to caption
(p)
Refer to caption
(q)
Refer to caption
(r)
Refer to caption
(s)
Refer to caption
(t)
Refer to caption
(u)
Refer to caption
(v)
Refer to caption
(w)
Refer to caption
(x)
Refer to caption
(y)
Refer to caption
(z)
Refer to caption
(aa)
Refer to caption
(ab)
Refer to caption
(ac)
Refer to caption
(ad)
Refer to caption
(ae)
Refer to caption
(af)
Refer to caption
(ag)
Refer to caption
(ah)
Refer to caption
(ai)
Refer to caption
(aj)
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 6. Illustration of the trade-off between visual quality and memory/computation efficiency with Image-GS. Illustration of the trade-off between visual quality and memory/computation efficiency with Image-GS. The bit rate of Image-GS is adjusted by controlling the maximum number of 2D Gaussians that are allowed. Notably, our progressive optimization with error-guided Gaussian addition generates a sequence of Image-GS-represented images at varying bit rates along the way, which forms a natural level of detail for the target image.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Refer to caption
(n)
Refer to caption
(o)
Refer to caption
(p)
Refer to caption
(q)
Refer to caption
(r)
Refer to caption
(s)
Refer to caption
(t)
Refer to caption
(u)
Refer to caption
(v)
Refer to caption
(w)
Refer to caption
(x)
Refer to caption
(y)
Refer to caption
(z)
Refer to caption
(aa)
Refer to caption
(ab)
Refer to caption
(ac)
Refer to caption
(ad)
Refer to caption
(ae)
Refer to caption
(af)
Refer to caption
(ag)
Refer to caption
(ah)
Refer to caption
(ai)
Refer to caption
(aj)
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 7. Qualitative comparison with previous neural image representations. Qualitative comparison with previous neural image representations. Our evaluation dataset covers various image types, including photographs, watercolor/oil paintings, anime posters, and vector-style images. Notably, the content-adaptive nature of Image-GS enables it to wisely allocate resources based on the complexity of local image regions and better preserve fine image details than the baseline methods under similar memory consumption. The model size (in KB) for ReLU fields, SIREN, WIRE, Instant NGP, and our Image-GS are 132, 135, 134, 141, and 128, respectively.