Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers

Qi Jiang1, Shaohua Gao1, Yao Gao, Kailun Yang2, Zhonghua Yi, Hao Shi, Lei Sun, and Kaiwei Wang2 This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 12174341 and in part by Hangzhou SurImage Technology Company Ltd.Q. Jiang, S. Gao, Y. Gao, Z. Yi, H. Shi, L. Sun, and K. Wang are with the State Key Laboratory of Modern Optical Instrumentation and the National Engineering Research Center of Optical Instrumentation, Zhejiang University, Hangzhou 310027, China.K. Yang is with the School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082, China.1Equal contribution.2Corresponding authors: Kaiwei Wang and Kailun Yang. (E-mail: wangkaiwei@zju.edu.cn, kailun.yang@hnu.edu.cn.)

Abstract

High-quality panoramic images with a Field of View (FoV) of 360° are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist imaging systems are desired. In this paper, we propose a Panoramic Computational Imaging Engine (PCIE) to achieve minimalist and high-quality panoramic imaging. With less than three spherical lenses, a Minimalist Panoramic Imaging Prototype (MPIP) is constructed based on the design of the Panoramic Annular Lens (PAL), but with low-quality imaging results due to aberrations and small image plane size. We propose two pipelines, i.e. Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR $\&$ AC), to solve the image quality problems of MPIP, with imaging sensors of small and large pixel size, respectively. To leverage the prior information of the optical system, we propose a Point Spread Function (PSF) representation method to produce a PSF map as an additional modality. A PSF-aware Aberration-image Recovery Transformer (PART) is designed as a universal network for the two pipelines, in which the self-attention calculation and feature extraction are guided by the PSF map. We train PART on synthetic image pairs from simulation and put forward the PALHQ dataset to fill the gap of real-world high-quality PAL images for low-level vision. A comprehensive variety of experiments on synthetic and real-world benchmarks demonstrates the impressive imaging results of PCIE and the effectiveness of the PSF representation. We further deliver heuristic experimental findings for minimalist and high-quality panoramic imaging, in terms of the choices of prototype and pipeline, network architecture, training strategies, and dataset construction. Our dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART.

Index Terms:

Panoramic imaging, minimalist optical systems, computational imaging, vision transformer, point spread function.

I Introduction

Refer to caption — Figure 1: Illustration of the proposed MPIP and its key issue of low image quality, which is properly addressed with PSF-aware transformer: PART. (a) Minimalist Panoramic Imaging Prototype (MPIP); (b) Comparison between real products of conventional panoramic imaging systems and PAL-based MPIP; (c) Low-quality image captured by MPIP. (d) High-quality image recovered by PART. In this way, we realize minimalist and high-quality panoramic imaging with PSF-aware transformers.

Image processing of panoramic images with an ultra-wide Field of View (FoV) of 360° is growing popular for achieving a holistic understanding of the entire surrounding scene [1, 2, 3, 4]. While the 360° panoramas suffer from inherent defects of low angular resolutions and severe geometric image distortions, a variety of low-level vision works is conducted in terms of image super-resolution [5, 6, 7] and image rectification [8, 9, 10], to produce high-quality panoramas for photography and down-stream tasks. However, the image blur caused by optical aberrations of the applied lens is seldom explored.

Most contemporary works on panoramic images are based on the common sense that the optical system is aberration-free where the imaging result is clear and sharp. While widely applied, conventional panoramic optical systems, come with notoriously sophisticated lens designs, composed of multiple sets of lenses with complex surface types [11, 12, 13], to reach high imaging quality. However, this is not often the case as the demand for thin, portable imaging systems, i.e., Minimalist Optical Systems (MOS), grows stronger in mobile and wearable applications [14]. Without sufficient lens groups for aberration correction, the aberration-induced image blur is inevitable for MOS. In this case, the imaging quality drops significantly and often catastrophically, and the unsatisfactory imaging performance disqualifies its potential usage in upper-level applications. This leads to an appealing issue and we ask if we may strike a fine balance between high-quality panoramic imaging and minimalist panoramic optical systems.

With the rapid development of digital image processing, Computational Imaging (CI) methods for MOS [15, 16, 17] appear as a preferred solution to this issue. These methods often propose optical designs with few necessary optical components to meet the basic demands of specific applications, e.g., the FoV, depth of field, and focal length, followed by an image post-processing model to recover the aberration-image. Recent research works [18, 19, 20] further design end-to-end deep learning frameworks for joint optimization of optical systems and image recovery networks. In this paper, based on the idea of computational imaging, we propose Panoramic Computational Imaging Engine (PCIE), a framework for minimalist and high-quality panoramic imaging, to solve the trade-off between high-quality panoramic imaging and minimalist panoramic optical systems as a whole, without sitting on only one of its sides.

Motivated by modern panoramic lens designs [12, 21, 22], PCIE builds on a Minimalist Panoramic Imaging Prototype (MPIP) shown in Fig. 1(a), which is composed of an essential panoramic head for 360° panoramic imaging and a relay lens group for aberration correction. In specific, we select the structure of Panoramic Annular Lens (PAL) [23, 24]: a more compact solution to 360° panoramas, as an example for MPIP, where a catadioptric PAL head is equipped to replace the complex lens group [11] in the conventional fisheye lens [25, 26]. To achieve a minimalist design, the proposed MPIP is composed of $1$ spherical lens for the PAL head, and $1$ or $2$ simple spherical lenses for the relay lens group, which can image over 360° FoV with only $40\%$ of the numbers of lenses and $60\%$ of the volumes of conventional panoramic imaging systems, as shown in Fig. 1(b). However, as illustrated in Fig.1(c), the uncorrected optical aberrations and the limited image plane size lead to the image corruptions, i.e., aberration-induced spatially-variant image blur and low imaging resolution.

To address the issues of MPIP, engaged with the information of Point Spread Function (PSF) from optical design, we propose PSF-aware Aberration-image Recovery Transformer (PART): a transformer-based low-quality image recovery paradigm for MPIP. Different from previous transformer baselines, e.g., SwinIR [27], PART exploits the PSF, the forward function characterizing the aberration-induced blur, to attain better results. A PSF representation method is delivered to represent PSF kernels in the form of a feature map, which serve as an additional modality for the network. Based on the representation, we design two PSF-aware mechanisms inspired by the physical meanings of the aberration-induced blur.

Specifically, the PSF-aware Feature Modulator (PFM) builds on the idea of modeling the inverse process of degradation convolution of PSFs, where pixel-adaptive convolution kernels are learned from the PSF representation to modulate the feature map gradually during recovery. PFM is a plug-and-play PSF-aware mechanism that can be inserted into other recovery models. In addition, PSF-aware Mix-Attention Block (PMAB) is proposed as the basic unit of PART, which comprises: (1) Vanilla window attention of SwinIR [27] for capturing long-range dependency; (2) PSF-aware Varied-Size Attention (P-VSA), where diverse windows of varied sizes and locations are learned from the PSF representation to provide dynamic receptive fields, motivated by the varied PSF sizes in different FoVs; (3) PFM of small kernel size for enhancing the feature extraction of local details. With PART, the low-quality image captured by MPIP can be smoothly recovered (see Fig. 1(d)) for minimalist and high-quality panoramic imaging.

To facilitate the training of PART, wave-based imaging simulation with random perturbation [28] is utilized for generating clear-blur image pairs. To fill the gap of ground-truth images of PAL, we record a high-quality PAL images dataset named PALHQ through a well-designed PAL in varied scenes. Based on PALHQ, we set up two tasks to formalize the key issue of low-quality MPIP images: (1) Aberration Correction (AC) of high-resolution images taken by sensors with small pixel size, and (2) Super-Resolution and Aberration Correction (SR $\&$ AC) of low-resolution images from sensors with large pixel size. Then, representative models of image super-resolution (SR) [27, 29, 30, 31, 32, 33, 34], image deblurring (Deblur) [35, 36, 37, 38], and image restoration with PSF-aware mechanisms (PSF-aware) [28] are evaluated, where PCIE enables all models to produce impressive panoramic imaging results.

Furthermore, we manufacture an MPIP sample with better image quality and capture the real-world dataset RealMPAL to benchmark models on real-world scenes. Experimental results reveal that PFM enhances the performance of the baselines (see Fig. 2) and PART sets the state of the art on both synthetic and real-world benchmarks, where the PSF representation plays a significant role to enable effective PSF-aware mechanisms. We also conduct extensive experiments to investigate the potential of GAN-based training strategies and the effectiveness of PALHQ in PCIE. The generative model appears to be competitive for generating more realistic details if the artifacts can be well suppressed. Additionally, PALHQ serves as the cornerstone of PCIE for training a robust model for annular images. At a glance, we deliver the following contributions:

•

We propose the Panoramic Computational Imaging Engine (PCIE), a novel framework for minimalist and high-quality panoramic imaging, as shown in Fig. 3, where a Minimalist Panoramic Imaging Prototype (MPIP) is designed for 360° panoramic imaging with an essential panoramic head and simple relay lens group.
•

We raise two pipelines to process low-quality MPIP images: Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR $\&$ AC). The real-world panoramic image datasets PALHQ and RealMPAL of high-quality and low-quality are recorded respectively for benchmarking the two pipelines, which are the first real-world PAL datasets for low-level vision.
•

We design a PSF representation method to represent the intensity and size distributions of PSF kernels in the form of a feature map, i.e., PSF map, which serves as an additional modality for the pipelines.
•

We further introduce the PSF-aware Aberration-image Recovery Transformer (PART) to process the low-quality images of MPIP, where the PSF-aware mechanisms guided by the PSF map are explored to enhance the recovery performance.

The experimental exploration of PCIE provides heuristic findings in terms of optical design, network architecture, training strategies, and dataset construction. We hope that PCIE can bring inspiration in both hardware system and algorithm aspects, for minimalist and high-quality panoramic imaging.

II Related Work

II-A Image Processing of Panoramic Images

Recent research interest in panoramic images is booming for immersive visual experiences [11, 39]. Semantic segmentation [1, 40], depth estimation [2, 3], and visual Simultaneous Localization and Mapping (SLAM) [41, 42] are widely explored on panoramic images for a holistic understanding of the surrounding scene. To this intent, high-quality panoramic images are urgently required for robust performance. A considerate amount of work is conducted to improve the image quality of panoramic images, such as super-resolution [5, 6, 7] and distortion correction [8, 9, 10].

However, the above image processing of panoramic images is based on the aberration-free images captured by the conventional panoramic lens, where multiple sets of lenses with complex surface types [11, 12, 13] are applied for high-quality imaging. This work focuses on capturing panoramic images with a Minimalist Optical System (MOS) composed of much fewer lenses for volume-limited applications, where we process the aberration images via computational methods.

II-B Computational Imaging for Minimalist Optical System

The aberration-induced image blur is inevitable for MOS due to insufficient lens groups for aberration correction. Recently, computational imaging methods for MOS appear as a preferred solution to this issue, where optical designs with few necessary optical components are equipped with image recovery pipelines for both minimalist and aberration-free imaging [16, 17, 43]. Some research works [18, 44, 45] even design end-to-end deep learning frameworks for joint optimization of MOS and post-processing networks to achieve the best match between MOS and recovery models to further improve the final imaging performance.

However, computational imaging for minimalist panoramic systems is scarcely explored. In a preliminary study, Jiang et al. [28] propose an Annular Computational Imaging (ACI) framework to break the optical limit of minimalist Panoramic Annular Lens (PAL), where the image processing is conducted on unfolded PAL images. To further develop a general framework for minimalist and high-quality panoramic imaging, this work fills the gap in high-quality PAL image datasets and designs the Panoramic Computational Imaging Engine (PCIE) directly on annular PAL images, in terms of both optical design and aberration-images recovery.

II-C Image Recovery of Aberration Images

The aberration-induced blur is always spatially-variant, i.e. Linear Shift Variant (LSV), due to the uneven thicknesses of the lenses. Several efforts have been made for the LSV system, spanning path-wise restoration [46, 47], experimental PSFs calibration and non-blind deconvolution [43, 48], and low-rank decomposition [49, 50], based on the degradation model of aberration-images [51, 52, 53].

Recent works tend to adopt the data-driven learning-based image restoration networks [54, 38]. These methods typically use a U-shaped network with an encoder-decoder structure [16, 55, 56] to achieve more efficient and robust recovery results, which can also be easily inserted into an end-to-end framework for joint optimization. To break the bottleneck of data-driven methods under scarce data, the PSF information is explored to design physical-informed networks, where model-based methods are characterized by Convolutional Neural Networks (CNNs) for learning ill-posed terms [14, 57]. Explorations have also been made in [27, 35, 36, 58] to apply transformers for solving the inverse problem, leveraging its strong long-range modeling capabilities.

Differently, we make a pioneering effort and investigate the potential of transformer-based SR models in aberration correction rather than conventional Deblur models. Then, the PSFs are transformed into an additional modality of the aberration image, based on which we design PSF-aware mechanisms for achieving better results. The proposed PSF-aware Aberration-image Recovery Transformer (PART) is a successful attempt to engage PSF information in the representation learning stage of SR models for recovering aberration images.

The overview of PCIE is shown in Fig. 3. It provides a powerful framework for minimalist and high-quality panoramic imaging, where optical design (detailed in Sec. III) and learning-based model (presented in Sec. IV) are intertwined to achieve impressive imaging results.

III Minimalist Panoramic Imaging Prototype

In this section, we set up a universal prototype for minimalist panoramic imaging systems based on modern panoramic lens designs (Sec. III-A). To address the issues induced by the reduced lens numbers and limited image plane size, two settings of tasks and benchmarks are defined in Sec. III-B and Sec. III-C, respectively. In Sec. III-D, we describe the constructed imaging simulation model to generate synthetic image pairs for training learning-based methods.

III-A Optical Design

To boost scene understanding with larger FoV, panoramic optical systems are emerging, including fisheye optical systems, refractive panoramic systems, panoramic annular optical systems, etc. [11]. In most modern designs of panoramic lenses [12, 21, 22], a panoramic head is applied for collecting incident rays of 360° FoV, while a set of relay lenses is designed to bend the rays and correct the aberrations. Based on the structure, we propose the Minimalist Panoramic Imaging Prototype (MPIP), including an essential panoramic head and a simple relay lens group, as shown in Fig. 1(a).

Specifically, we adopt a more compact and efficient solution, i.e. Panoramic Annular Lens (PAL) [23, 24], in MPIP samples, where a catadioptric PAL head is equipped for 360° annular imaging. For minimalist design and convenient manufacture, spherical lenses are applied in the relay lens group and the number is reduced to fewer than $3$ . To illustrate the imaging results of different relay lens groups, we define MPIP-P1 and MPIP-P2 in Fig. 4(a), whose relay lens group is composed of two lenses and a single lens, respectively.

The lack of enough lenses for aberration correction makes the imaging point spread from an ideal point, inducing spatially-variant PSFs with large kernel sizes, as shown in Fig. 4(b). The average geometric spot radius of MPIP-P1 is $13.78{\mu}m$ , whereas that of MPIP-P2 is $46.26{\mu}m$ . As a result, the captured images of MPIP suffer from spatially-variant aberration-induced blur, especially for MPIP-P2 with fewer lenses, as shown in Fig. 4(c).

III-B Definition of Tasks

In addition to the uncorrected optical aberrations, the limited image plane size due to the small aperture of MPIP presents the issue of image resolution. To fit the small image plane of the MPIP, an image sensor with a smaller pixel size can be applied to maintain high resolution, but it makes the system more sensitive to aberration-induced blur. As shown in Fig 5(a), the diffused optical spot of fixed physical size affects more pixels for the sensor with smaller pixel sizes and higher resolution. The opposite solution with large pixel sizes is less sensitive to the diffused spot, but the reduced image resolution also brings degradation to the images, which is especially harmful to panoramic images with large FoV [6].

To address this dilemma, we propose two pipelines for solving the contradictory problems, as shown in Fig. 5(b), where a learning-based model is applied to process different image recovery tasks. For image sensors with smaller pixel sizes, we define the Aberration Correction (AC) task, where the goal is to recover a clear image ${x}_{hq}\in\mathbb{R}^{H{\times}W{\times}3}$ from a high-resolution input aberration-image ${x}_{ab}\in\mathbb{R}^{H{\times}W{\times}3}$ . Whereas for image sensor with larger pixel size, the Super-Resolution and Aberration Correction (SR $\&$ AC) task is raised to recover a high-resolution aberration-free image ${x}_{hq}\in\mathbb{R}^{H{\times}W{\times}3}$ from a low-resolution input aberration-image ${x}_{{lq}}{\in}\mathbb{R}^{\frac{{H}}{{s}}\times\frac{{W}}{{s}}\times 3}$ , where $s$ is the scale factor of SR.

III-C PALHQ: Established Dataset of High-Quality PAL Images

The lack of high-quality image datasets for PAL comes as a bottleneck to the above tasks. A piece of previous work for CI of PAL, ACI [28], unfolds the annular PAL images into perspective ones to utilize the publicly available datasets, i.e. DIV2K [59]. However, the asymmetrical interpolation during unfolding induces extra image degradation, which further complicates the image degradation factors for MPIP. In addition, the annular image is more appealing for the simulation of aberrations in the original image plane and necessary for some vision tasks like PAL-based SLAM [41, 42]. In the case of benchmarks for processing panoramic images, e.g., the ODI-SR dataset [60] and the SUN360 panorama dataset [61], which are taken via fisheye cameras, the imaging process is also quite different from that of PAL [11]. These concerns raise an urgent request for high-quality panoramic annular image datasets.

To this intent, we propose PALHQ, a dataset of high-quality PAL images, to facilitate network training and evaluation of PAL-based low-level vision tasks. A well-designed PAL of $11$ lenses and a Sony $\alpha$ 6600 camera are applied to capture high-resolution PAL images with negligible primary aberrations. PALHQ contains $550$ clear PAL images with a resolution of $3152\times 3152$ , covering rich and varied scenes of indoor, natural, urban, campus, and scenic spots. We divide PALHQ into $500$ images for the training set and $50$ images for the validation set (refer to the appendix for sample images of PALHQ). In PCIE, we benchmark both AC and SR $\&$ AC on PALHQ, where the corresponding aberration images are generated by the imaging simulation model depicted below. Furthermore, PALHQ can be also transmitted to unfolded panoramas via equirectangular projection (ERP), which can support various panoramic image processing applications.

III-D Imaging Simulation Model

To quantitatively benchmark the raised two tasks and enable supervised training of learning-based models, paired aberration images and clear images are required. Following previous super-resolution works [62, 63] and CI works [28, 56], we construct an imaging simulation model to generate synthetic aberration-images in batches.

The wave-based simulation pipeline with random perturbation in [28] is adopted to generate multiple aberration distributions directly on clear annular PAL images. Specifically, the clear raw image $R$ is modulated by an optical system and then processed by ISP $\Gamma(\cdot)$ to produce the final imaging result $A$ :

A_{\theta}(x,y)=\Gamma[(\int{{r_{\lambda}}R_{\theta}(x,y)\otimes K_{\theta}(x,% y,\lambda)d\lambda})\downarrow+N],

(1)

where ${r_{\lambda}}$ is the wave response of the sensor. The noise $N$ and the sampling process $(\cdot)\downarrow$ of the image sensor are also included in the model. We divide the image into patches for patch-wise convolution with PSFs $K_{\theta}(x,y,\lambda)$ under different FoV $\theta$ . Different from [28], the division of FoV is centrosymmetric for annular images as is shown in Fig. 4(b). Through scalar diffraction integral [64], $K_{\theta}(x,y,\lambda)$ is calculated based on the wavefront $\Phi_{\theta}(x^{\prime},y^{\prime},\lambda)$ on exit pupil plane, which is described by Zernike polynomials [65] mathematically:

\Phi_{\theta}(x^{\prime},y^{\prime},\lambda)=\sum_{n,m}{C^{m}_{n}}(\theta,% \lambda){Z^{m}_{n}}(x^{\prime},y^{\prime}),

(2)

where $C(\theta,\lambda)$ denotes Zernike coefficients under FoV $\theta$ and wavelength $\lambda$ and $Z$ refers to polynomials of the coordinate $(x^{\prime},y^{\prime})$ on exit pupil. The combination of different $m$ and $n$ represents different orders. Finally, we apply the random disturbance strategy in [28] to fine-tune the ideal $C(\theta,\lambda)$ from the $Zemax$ software, generating synthetic aberration images with diverse aberration distributions.

IV Low-Quality MPIP Images Recovery

In this section, we describe the proposed learning-based model to recover low-quality MPIP images, as shown in Fig. 6. The PSF information, characterizing the image degradation process, is represented as the PSF map, detailed in Sec. IV-A, serving as one additional modality for our model. With the PSF map, we design the PSF-aware Feature Modulator (PFM) and the PSF-aware Mix-Attention Block (PMAB), elaborated in Sec. IV-B and Sec. IV-C, respectively. Then, the PSF-aware Aberration-image Recovery Transformer (PART) is established and introduced in Sec. IV-D as a transformer-based paradigm for the raised two tasks.

IV-A The Representation of PSF Information

For non-blind optimization-based recovery methods in aberration correction, e.g. Wiener filter [51], PSFs $K$ of the system are exploited to predict the clear image $x_{hq}$ from the aberration-image $x_{ab}$ by deconvolution. However, this method often fails when the PSFs deviate from the design stage during manufacture and require time-consuming strategies [43, 48] for processing complex spatially-variant blur. Data-driven learning-based models [27, 36], which can be plugged directly into existing end-to-end frameworks of lens design [18, 19, 20], have demonstrated more powerful abilities in image recovery, but may hit a bottleneck when the training data is scarce.

This motivates us to break the bottleneck by utilizing PSFs of the optical system in a learning-based model. The PSFs of $n$ sampled FoVs of the applied MIIP can be calculated based on the wavefront as depicted in Eq. (2):

{K_{i}}(x,y)=\mathcal{S}({\Phi_{i}}(x^{\prime},y^{\prime})),i=1,2,3,\cdots,n.

(3)

The $\mathcal{S}(\cdot)$ denotes scalar diffraction integral (refer to [64] for more details). Previous methods tend to use the kernel size of $K_{i}$ to guide the network [28] or refine the ill-posed term in deconvolution through a learning-based model [57]. Although these attempts can improve the recovery benefiting from the applications of $K_{i}$ , the spatial intensity distribution of PSFs is not fully exploited to guide the deep feature extraction of the general image recovery paradigm.

To this intent, we propose a PSF representation method to produce a PSF map, containing both intensity and size distributions of PSF kernels, which is aligned with the image feature map. We first map the spatial PSFs $K_{i}\in\mathbb{R}^{k_{i}{\times}{k_{i}}{\times}3}$ into the image feature shape ${H{\times}{W}{\times}{C^{\prime}}}$ , where $k_{i}$ is the kernel size of PSF under the $i_{th}$ FoV and $C^{\prime}$ denotes the channels of mapped PSFs, serving as an additional modality aligned with the aberration-image. As previously shown in [66], the spatial-to-channel arrangement helps transform spatially-variant kernels into a feature map. Similarly, to produce a PSF feature map, the spatial PSFs $K_{i}$ are arranged into the channel dimension. Concretely, for a pixel at $(x,y)$ of the image, we first calculate the vector $\overrightarrow{p}$ from the image center $(0,0)$ to $(x,y)$ , and define the vertical unit vector $\overrightarrow{a}(1,0)$ . The PSF of the corresponding FoV is located by $\left|\overrightarrow{p}\right|/\max\left(\left|{\overrightarrow{p}}\right|\right)$ , and rotated by the angle $\arccos\left(\frac{\vec{p}\cdot\vec{a}}{\left|\vec{p}\right|\cdot\left|\vec{a}% \right|}\right)$ , producing the PSF $K_{x,y}$ of the pixel. For memory-friendly computation, we pad all $K_{x,y}$ into unified size of $\max\limits_{i}k_{i}$ and compress them into ${k^{\prime}\times{k^{\prime}}\times 1}$ via adaptive average pooling:

\hat{K_{x,y}}={\rm{AveragePool}}({\rm{padding}}{(K_{x,y})}),

(4)

where the choice of compressed size $k^{\prime}$ is ablated in Sec. V-F. The $\hat{K_{x,y}}$ is then reshaped into ${1\times 1\times(k^{\prime 2})}$ and inserted into each pixel to produce the PSF feature map $x_{int}\in\mathbb{R}^{H\times{W}\times(k^{\prime 2})}$ . In addition, considering the lost PSF size information during compressing, we also generate the size distribution map $x_{s}\in\mathbb{R}^{H\times{W}\times 3}$ of RGB channels, where the value of each pixel represents the kernel size. Finally, the PSF map $x_{psf}$ is produced via Eq. (5):

x_{psf}={\rm{Concat}}{(x_{int},x_{s})},

(5)

and the visualized PSF map is shown in the appendix. PSF map is an aligned modality of the aberration image characterizing the image degradation over FoVs, based on which we design a PSF-aware transformer, as described in the next subsection.

IV-B PSF-aware Feature Modulator

CNN layers have shown impressive abilities of local feature extraction, but are restricted to the fixed spatially-invariant kernels. However, the mathematical imaging model in Eq. (1) reveals that the aberration-induced blur is only generated by convolution with spatially-variant PSF kernels, whose inverse solution cannot be modeled by the fixed convolution kernels [55, 56].

To extract adaptive image features with the guidance of spatially-variant PSF kernels, we propose the PSF-aware Feature Modulator (PFM), as shown in the lower left of Fig. 6. PFM builds on the idea of filter adaptive convolution [66, 67], where a kernel map of ${H{\times}{W}{\times}(Ck^{2})}$ is predicted from feature map of ${H{\times}{W}{\times}{C}}$ . Differently, in PFM, the kernel map $x_{kernel}$ is predicated on the features of PSF map $x_{psf}$ , which has been compressed into a similar form as $x_{kernel}$ . We first apply $E_{psf}$ as a $3\times{3}$ convolution layer to extract features of PSF map $x_{psf}$ , as depicted in Eq. (6):

x^{\prime}_{psf}=E_{psf}(x_{psf}),

(6)

where $x^{\prime}_{psf}\in\mathbb{R}^{H\times{W}\times{C}}$ is the extracted PSF feature map. Then, a lightweight kernel predictor composed of several convolution layers is proposed to output the kernel map $x_{kernel}\in\mathbb{R}^{H{\times}{W}{\times}(Ck^{2})}$ based on $x^{\prime}_{psf}$ , as in Eq. (7):

x_{kernel}=P(x^{\prime}_{psf}).

(7)

To reduce the memory cost and inference latency of kernel prediction, the predictor $P$ computes the kernel map on the downsampled features (by $4{\times}4$ average pooling). Benefiting from that the PSF map shares a similar form with the kernel map, we further simplify the $P$ where only one Max Pooling layer and one residual block of $1{\times}1$ convolution layers are applied, to predict the kernel map $x^{\prime}_{kernel}\in\mathbb{R}^{\frac{H}{8}{\times}\frac{W}{8}{\times}(Ck^{2% })}$ in a smaller resolution. The final kernel map $x_{kernel}$ is then obtained by ${\times}8$ upsampling via bilinear interpolation. Finally, we reshape the $x_{kernel}$ into a list of per-pixel kernels of $k{\times}{k}{\times}{C}$ and apply them to the corresponding pixels of image feature $x^{\prime}_{img}$ . PFM attempts to model the inverse process of the aberration-induced blur, i.e. deconvolution, which promotes the dynamic feature extraction of the aberration-image.

IV-C PSF-aware Mix-Attention Block

We put forward the PSF-aware Mix-Attention Block (PMAB) as the basic unit of our PSF-aware transformer, to process aberration images assisted with the PSF map, as shown at the middle bottom of Fig. 6. The Window-based Multi-head Self-Attention (W-MSA) of the Swin-T block [27] is first adopted to be the baseline attention mechanism for modeling spatially-variant convolution and long-range dependency, which is also important for stable training of the network.

To address the drawback of fixed window size in vanilla W-MSA, we further propose the PSF-aware Varied-Size Attention (P-VSA), shown on the lower right of Fig. 6. The vanilla varied-size attention [68] in high-level tasks predicts the sizes and locations of the windows from input features for computing self-attention on dynamic windows. Meanwhile, the kernel sizes of PSFs in different FoV regions reveal the severity of aberration-induced blur, which is relevant to the calculation of window-based self-attention. To better adaptively modulate the windows according to the PSF kernels, we make use of the PSF map features $x^{\prime}_{psf}$ to generate PSF-aware varied-size windows. Concretely, the scale $S$ and offset $O$ of the varied-size windows are predicated on $x^{\prime}_{psf}$ by the Window Transform block, which is composed of a $1\times{1}$ convolution layer. Then, we sample the projected key and value tokens $K,V$ of image features $x^{\prime}_{img}$ based on the transformed window to obtain $K_{vs},V_{vs}$ . The cross-attention is computed between query $Q$ of the default window and $K_{vs},V_{vs}$ . The operation of P-VSA can be expressed as:

Q,K,V={\rm{Linear}}{({\rm{WinPar}}(x^{\prime}_{img}))},

(8)

S,O={\rm{WinTrans}}(x^{\prime}_{psf}),

(9)

K_{vs},V_{vs}={\rm{Sample}}(K,S,O),{\rm{Sample}}(V,S,O)

(10)

{\rm Attn}(Q,K_{vs},V_{vs})={\rm Softmax}(\frac{QK_{vs}^{\top}}{\sqrt{d}})V_{% vs},

(11)

where $\rm{WinPar}$ denotes the window partition operation of Swin-T and $d$ is the dimension of tokens.

Additionally, some works [30, 32] apply channel-attention-based convolution blocks in parallel with the self-attention to enhance the representation ability of the network. We insert the proposed PFM to PMAB in the same parallel way, where the filter adaptive convolution mechanism can better model the spatially-variant blur compared to channel-attention-based convolution.

Finally, PMAB is the mixing of W-MSA and P-VSA with a parallel $1{\times}1$ PFM. For the self-attention module, the image feature map $x^{\prime}_{img}$ is equally split along the channel dimension and processed by parallel W-MSA and P-VSA, then concatenated along the channel dimension again. The modulated feature map by parallel PFM is multiplied by a constant $\alpha$ , to be added to the result of self-attention and the original feature map as common practice for stable training [32]. The whole process of PMAB is computed as:

{x^{\prime}}^{(1)}_{img},{x^{\prime}}^{(2)}_{img}={\rm{Split}}({x^{\prime}}_{% img}),

(12)

x_{attn}={\rm{Concat}}({\rm{M\text{-}WSA}}({x^{\prime}}^{(1)}_{img}),{\rm{P% \text{-}VSA}}({x^{\prime}}^{(2)}_{img},x^{\prime}_{psf})),

(13)

x_{mix}=x_{attn}+\alpha{\rm{PFM}}({x^{\prime}}_{img},{x^{\prime}}_{psf})+{x^{% \prime}}_{img},

(14)

y=x_{mix}+{\rm{FFN}}(x_{mix}),

(15)

where ${\rm{FFN}}$ is a common Feed Forward Network composed of a LayerNorm and a Multi-Layer Perceptron (MLP) layer.

IV-D PSF-aware Aberration-image Recovery Transformer

Most previous networks [16, 55] for aberration corrections often utilize the architecture of image deblurring methods, i.e. U-Net. However, for MPIP images with a high resolution (e.g., $3K$ ), the U-Net methods incur unacceptable computational costs due to the large image sizes at shallow layers. Differently, we look into the tasks from the perspective of image super-resolution, which processes image features with low resolution and reconstructs the high-quality image via an upsampling module. The aberration-induced blur brings aliasing between pixels and losses of image details, which can also be interpreted as “low resolution”. Thereby, as shown in Fig. 6, the PSF-aware Aberration-image Recovery Transformer (PART) is set up based on the structure of SwinIR [27] and our proposed PSF-aware mechanisms.

A Task-Processing module is first applied to transform the input image and PSF map to a small spatial size, where pixel-unshuffle [63] is leveraged for AC and no operation is entailed for SR $\&$ AC. The PSF map is also concatenated with the aberration image as the input of the network. More precisely, PART contains three parts. (1) A feature extraction layer converts the input to image feature maps via a $3{\times}3$ convolution. (2) The representation learning stage applies stacks of transformer-based blocks ending with a convolution layer to enrich the learned degradation information of aberration-induced blur progressively. We design the PSF-aware Residual Transformer Block (PRTB) with several PMAB layers and a convolution layer. The PFM is inserted into each PRTB to modulate the learned features and model the inverse process of the aberration-induced blur. We also implement PFM at the beginning and end of the representation learning stage, for adaptive feature extraction and feature fusion based on PSF information. (3) The image reconstruction module further fuses the extracted deep features and recovers a high-quality image with higher resolution. With PART, we can recover a high-quality aberration-free image $x_{hq}$ from either a high-resolution aberration image $x_{ab}$ or a low-resolution one $x_{lq}$ , providing a general solution to AC and SR $\&$ AC:

{x^{AC}_{hq}}={\rm{PART}}(x_{ab}),{x^{SR\&AC}_{hq}}={\rm{PART}}(x_{lq}).

(16)

V Experiments and Results

We conduct a comprehensive set of experiments to evaluate the proposed PCIE for minimalist and high-quality panoramic imaging. We first describe the implementation details of our work in Sec. V-A. The PCIE under different recovery models is then evaluated on both synthetic (Sec. V-B) and real (Sec. V-C) datasets. We further investigate the GAN-based training strategies for PCIE in Sec. V-D. At last, in Sec. V-E and Sec. V-F, ablation studies on training datasets and the architecture of PART are conducted.

V-A Implementation Details

TABLE I: Quantitative evaluation of PCIE with the AC pipeline on synthetic benchmarks with MPIP-P1 and MPIP-P2. We highlight the best and second results. The “*” for NAFNet and Restormer denotes that the cropping testing strategy is applied.

Method		PALHQ-SynMPIP-P1				PALHQ-SynMPIP-P2
Method		PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$
SR	RRDB [29]	32.716	0.9265	0.0469	03.704	26.996	0.8503	0.0896	17.413
	RCAN [34]	32.496	0.9257	0.0459	04.327	26.456	0.8443	0.0956	22.619
	EDSR [33]	32.868	0.9282	0.0449	03.770	26.951	0.8500	0.0889	17.871
	SwinIR [27]	32.913	0.9291	0.0446	03.670	26.935	0.8509	0.0884	17.630
	EDT [31]	32.929	0.9288	0.0450	03.658	27.055	0.8518	0.0888	16.976
	HAT [32]	32.925	0.9288	0.0447	03.748	26.827	0.8500	0.0889	18.498
	GRL [30]	32.369	0.9256	0.0457	04.234	26.268	0.8424	0.0943	22.463
Deblur	HINet [37]	32.238	0.9234	0.0476	04.159	26.401	0.8428	0.0933	22.341
	NAFNet* [38]	32.837	0.9274	0.0441	03.845	27.045	0.8514	0.0856	17.504
	Restormer* [35]	32.971	0.9287	0.0445	03.763	27.001	0.8510	0.0870	17.023
	Uformer [36]	32.999	0.9290	0.0442	03.672	27.133	0.8525	0.0866	16.693
PSF-aware	PI²RNet [28]	32.682	0.9268	0.0448	03.638	26.656	0.8471	0.0874	18.544
	RRDB+	32.816	0.9271	0.0456	03.746	27.050	0.8505	0.0895	17.103
	GRL+	32.847	0.9292	0.0454	03.627	27.020	0.8528	0.0864	16.281
	PART (Ours)	33.143	0.9304	0.0435	03.571	27.198	0.8540	0.0855	16.436

TABLE II: Quantitative evaluation of PCIE with SR

\&

AC pipeline on synthetic benchmark.

Method	PALHQ-SynMPIP-P1
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$
RRDB [29]	28.856	0.8758	0.0733	09.957
RCAN [34]	28.238	0.8686	0.0787	12.689
EDSR [33]	28.817	0.8759	0.0715	10.670
SwinIR [27]	28.985	0.8781	0.0714	09.938
EDT [31]	29.008	0.8777	0.0726	10.750
HAT [32]	28.921	0.8771	0.0727	10.141
GRL [30]	28.695	0.8753	0.0714	09.829
RRDB+	29.044	0.8774	0.0724	10.068
GRL+	28.757	0.8768	0.0716	09.709
PART (Ours)	29.310	0.8819	0.0681	09.648

Synthetic Datasets. We apply the collected PALHQ dataset for training and evaluation. Based on PALHQ, the aberration images of two prototypes, i.e. PALHQ-SynMPIP-P1 and PALHQ-SynMPIP-P2, are generated by the simulation model of Eq. (1). Following [28], we set the random range of disturbance as $25\%$ and generate $10$ virtual MPIP samples for the training set ( $500$ images) and $4$ for the validation set ( $50$ images) to simulate the synthetic-to-real gap. For image sensors, the MV-SUA1600C camera with a pixel size of $1.34{\mu}m$ and the MV-SUA133GC camera with a pixel size of $4{\mu}m$ are applied for the AC and SR $\&$ AC pipelines, respectively, where the ISP and wave response of them are simulated in the data generation. In addition, we use $\times 3$ bicubic downsampling to produce low-resolution aberration images for SR $\&$ AC, considering the sensors’ pixel sizes.

Real-world Datasets. As shown in Fig. 4, with only one more simple lens, the MPIP-P1 reveals much better image quality, which relieves the burden of the post-image processing pipelines. We manufacture MPIP-P1 and use it to record the RealMPIP3K-AC ( $58$ images with a resolution of $2912\times 2912$ ) and RealMPIP1K-SR $\&$ AC ( $64$ images with a resolution of $992\times 992$ ) with two cameras respectively, to provide real-world MPIP aberration-images for evaluating two pipelines of PCIE. We test models trained on PALHQ-SynMPIP-P1 (AC) and PALHQ-SynMPIP-P1 (SR $\&$ AC) with RealMPIP3K-AC and RealMPIP1K-SR $\&$ AC respectively.

Evaluation Metrics. For synthetic datasets with ground truth, PSNR and SSIM [69] are employed to evaluate the fidelity of the recovery results, whereas LPIPS [70] and FID [71] are employed to evaluate the perceptual quality.

For real datasets without reference clear image, we employ no-reference metrics, i.e. NIQE and BRISQUE, to evaluate the image quality of MPIP images in terms of natural images. The qualitative visual results are also provided for an intuitive evaluation. However, the NIQE [72] and BRISQUE [73] are built on the statistics of perspective natural images, which are challenging for assessing the MPIP images with the annular distribution of image content. Considering the specific tasks of correcting optical aberrations, we define the Optical-based Image Quality Evaluator (OIQE) for credible evaluation, based on the Modulation Transfer Function (MTF) of the imaging system calculated by a set of testing checkerboard images.

To be specific, we follow Spatial Frequency Response (SFR) [56] testing to calculate MTFs on image patches of “knife-edge” of different FoVs from different testing images. $MTF50$ and $MTFarea$ are used to characterize the MTF curves, where the former is the frequency when the MTF drops $50\%$ and the latter is the area under the MTF curve. We further define $OIQE50$ and $OIQEarea$ as the ratio of the average $MTF50$ and $MTFarea$ of the testing imaging pipeline to those of a well-designed panoramic imaging system. Accordingly, $OIQE$ is defined as:

OIQE=\frac{OIQE50+OIQEarea}{2},

(17)

which measures the gap between the results of PICE and conventional panoramic lenses in terms of MTF. OIQE is only applied in the AC pipeline due to its specific design for evaluating the ability of the model to remove aberration-induced blur.

In addition, with the testing checkerboard images of OIQE, we generate the ground-truth images through edge extraction and re-coloring following [55], so that the PSNR and SSIM can be applied as metrics in this setting.

Finally, we conduct a user study as a subjective evaluation method. The results of the User Study (U.S.) will be presented as the percentage of times that each method’s results were chosen as the best.

The implementation details of the ground-truth generation pipeline and user study are depicted in the Appendix. Based on the above evaluation pipelines and metrics, a comprehensive evaluation of competitive recovery models on real-world datasets will be presented in Section V-C.

Compared Methods. For the AC pipeline, as shown in Table II, we compare PART with representative state-of-the-art SR models (RRDB [29], RCAN [34], EDSR [33], SwinIR [27], EDT [31], HAT [32], and GRL [30]), along with Deblur methods (HINet [37], NAFNet [38], Restormer [35], and UFormer [36]). Image restoration models with PSF-aware mechanisms, i.e. RRDB+, GRL+, and PI²RNet [28], are also included in the comparison. Here, “+” means that the methods are inserted with the designed PFM, where we select RRDB and GRL as the classical CNN- and state-of-the-art transformer-based SR model to investigate the adaptability of PSF-aware mechanisms to different types of models. For the PSF-aware methods in SR $\&$ AC pipeline, only RRDB+ and GRL+ are selected due to the specific task requirement for super-resolution, as shown in Table II.

All the models are retrained on PALHQ-SynMPIP-P1 and PALHQ-SynMPIP-P2 with their original optimizers, learning rates, and schedulers, where the number of training iterations and the batch size are set the same as PART for a fair comparison. Additionally, we apply task-processing for all the SR models the same as PART.

Training Details. The compressed kernel size $k^{\prime}$ of the PSF map is set to $5$ in our experiments, where an ablation study is conducted in Sec. V-F. In addition, we set the kernel size $k$ of PFM to $3$ considering the computational efficiency. Following SwinIR [27], the PRTB number, PMAB number, channel number, attention head number, and window size are generally set to $6$ , $6$ , $180$ , $6$ , and $8$ , respectively.

PART is trained on L1Loss, while other loss functions are explored in Sec. V-D. We train the models with the Adam optimizer with an initial learning rate of $2e{-}4$ and a batch size of $8$ on a single A800 GPU. For data augmentation, random crop, flip, and rotation are applied, where the ground-truth crop size is $256{\times}256$ for AC and $196{\times}196$ for SR $\&$ AC to keep an image size of $64$ in the representation learning stage. The number of training iterations is set to $200k$ and the learning rate is halved at $100k$ , $160k$ , $180k$ , and $190k$ .

V-B Experiments on Synthetic Datasets

AC Pipeline. Table II shows numerical results of PCIE under different image recovery models on synthetic benchmarks of AC. Considering that the performance of NAFNet and Restormer is sensitive to input resolution [74, 75], the cropping testing strategy is applied for the two models (i.e., NAFNet* and Restormer*), which is depicted in the Appendix. We also present visual results of representative methods in Fig. 7. PCIE with most models achieves PSNR over $32dB$ on PALHQ-SynMPIP-P1 and over $26dB$ on PALHQ-SynMPIP-P2, producing impressive panoramic imaging results via a minimalist optical system. Compared to Deblur methods, SR methods overall deliver better results, illustrating the effectiveness of the SR framework in aberration correction. PSF-aware methods further outperform their baselines. Precisely, PI²RNet exceeds HINet, PART surpasses SwinIR, and RRDB+ and GRL+ outstrip their corresponding baselines by clear margins. We find that the models based on the window-attention mechanism (SwinIR, EDT, HAT, UFormer, GRL, and our proposed PART) realize more competitive results than CNN-based models, where the window-based self-attention can better model spatially-variant blur. Yet, the state-of-the-art SR model GRL performs poorly on the benchmarks, which is attributed to the stripe-based attention being difficult to adapt to MPIP images with annular distributions.

Overall, PART brings better results for PCIE, yielding state-of-the-art performance on two benchmarks, in terms of both fidelity-based metrics (PSNR and SSIM) and perceptual-based metrics (LPIPS and FID). As all the methods produce aberration-free visual results with some lost textures and artifacts, the recovered image of PART shows more visually pleasant details, as shown in Fig. 7(a), in all FoVs.

Further, applied with only one more spherical lens, the PCIE results of MPIP-P1 outperform those of MPIP-P2 by a large margin. For example, the PSNR drops by $3.050dB{\sim}6.101dB$ when MPIP-P2 is equipped. As shown in Fig. 7(b), PCIE with MPIP-P2 delivers moderate clear aberration-free images. Yet, suffering from severe aberrations, its detailed textures are heavily corrupted, especially for large FoVs. In this sense, MPIP-P1 is a superior choice for PCIE to achieve minimalist and high-quality panoramic imaging.

SR $\&$ AC Pipeline. The quantitative evaluation of PCIE with the SR $\&$ AC pipeline is shown in Table II. Consistent with the observations in AC, the methods with window-based attention and PSF-aware mechanisms lead to better performance. PART sets the state of the art in the SR $\&$ AC task, achieving improvements compared against the second best, e.g. $0.266dB$ in PSNR, $0.0038$ in SSIM, LPIPS from $0.0714$ to $0.0681$ (about $5\%$ ), and FID from $9.709$ to $9.648$ (about $6\%$ ).

Comparing SR $\&$ AC (Table II) with AC (Table II), we observe that the loss of spatial resolution in aberration-images causes significant deterioration to the imaging quality of PCIE, e.g., to an amount of ${-}4.258dB{\sim}{-}3.674dB$ in PSNR. The visual quality comparison between the two pipelines is provided in Fig. 8, where the imaging results of AC reveal richer and more realistic details. In this case, AC is a more competitive pipeline for reconstructing high-resolution aberration-free images, where the real sampled pixels of the sensor offer more convincing imaging features than super-resolved ones despite more aberration-induced blur.

TABLE III: Quantitative evaluation of PCIE on real-world benchmarks RealMPIP. The OIQE and PSNR/SSIM of original aberration images are

55.22\%

and

16.215dB/0.7995

, respectively. The PSNR and SSIM are calculated on the generated checkerboard image pairs. U.S. denotes the result of the user study. We also list the ranks on each metric in “()” and the Average Rank (A.R.) of each method for an intuitive evaluation.

Method	RealMPIP3K-Checkerboard			RealMPIP3K-AC			RealMPIP3K-SR&AC			A.R. $\downarrow$
Method	OIQE $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	U.S. $\uparrow$	NIQE $\downarrow$	BRISQUE $\downarrow$	U.S. $\uparrow$	NIQE $\downarrow$	BRISQUE $\downarrow$	A.R. $\downarrow$
RRDB [29]	66.94%(4)	19.587(3)	0.8872(4)	51.91%(3)	04.930(8)	45.692(2)	37.84%(5)	04.848(4)	50.729(5)	4.2
SwinIR [27]	67.51%(3)	18.991(5)	0.8863(5)	49.05%(4)	04.816(6)	46.380(5)	38.92%(4)	04.833(2)	50.555(4)	4.2
GRL [30]	64.63%(7)	19.348(4)	0.8885(3)	36.43%(7)	04.665(1)	46.587(6)	08.65%(6)	04.824(1)	51.057(6)	4.6
UFormer [36]	68.55%(2)	19.841(1)	0.8897(2)	61.43%(2)	04.914(7)	45.427(1)	n.a.	n.a.	n.a.	2.5
PI²RNet [28]	64.86%(6)	18.443(7)	0.8734(7)	42.14%(6)	04.710(3)	47.148(8)	n.a.	n.a.	n.a.	6.2
RRDB+	58.28%(8)	18.458(6)	0.8758(6)	32.14%(8)	04.783(5)	46.675(7)	62.16%(3)	04.857(5)	50.383(2)	5.6
GRL+	65.31%(5)	18.193(8)	0.8690(8)	48.33%(5)	04.724(4)	46.138(4)	68.92%(2)	04.842(3)	50.061(1)	4.4
PART	77.87%(1)	19.606(2)	0.8943(1)	78.57%(1)	04.707(2)	45.968(3)	83.51%(1)	04.933(6)	50.422(3)	2.2

V-C Experiments on Real-World Datasets

As shown in Table III, PCIE with representative models makes significant contributions to the removal of the aberration-induced blur of real-world MPIP images. To be specific, OIQE improves from $55.22\%$ to $58.28\%{\sim}77.87\%$ , and PSNR/SSM improves from $16.215dB/0.7995$ to $18.193dB/0.8690{\sim}19.841dB/0.8943$ . The results on NIQE and BRISQUE reveal a large variance, which is attributed to that these metrics are designed for perspective natural images rather than annular MPIP images. For a comprehensive and intuitive evaluation, we rank each method on each metric and provide the average rank (A.R.). In the real-world case, PART outperforms other models, achieving the best OIQE ( $77.87\%$ ), and the best A.R. (2.2). The subjective evaluation of the User Study (U.S.) also illustrates that PART delivers more visual-pleasant panoramic images, which has far superior selection rates. The visual results of PCIE on real-world scenes are provided in Fig. 9. PCIE enables most methods to deliver high-quality panoramic images with few aberrations and high resolution, where PART sets the state of the art in terms of higher contrast, sharper edges, and fewer artifacts. Additionally, consistent with experiments on synthetic data, the recovered images of the SR $\&$ AC pipeline reveal perceptually unpleasant artifacts.

V-D Investigation on GAN-based Training Strategies

To generate richer details for recovered images, we investigate GAN-based training strategies on classical models RRDB, SwinIR, and our PART. Following [76], the GAN-based loss functions in ESRGAN [29] and Local Discriminative Learning (LDL) [76] are adopted, where the former is a classical GAN-based framework and the latter is an improved strategy to remove artifacts. We take models trained with L1Loss, i.e. PSNR-oriented models, as pre-training generators, then apply GAN and LDL loss functions to enable these networks to generate more textures respectively. As shown in Table IV, on synthetic data, both GAN and LDL lead to a decrease in recovery accuracy (PSNR and SSIM), while bringing great gains under the perceptual quality metrics. LDL is a more competitive strategy that outperforms GAN with higher fidelity and fewer visual artifacts, especially with PART.

Regarding real-world data, we present the OIQE and qualitative results in Fig. 10. GAN-based training further contributes to the removal of the aberration-induced blur, achieving better OIQE with higher image contrast. Aside from this, GAN-based models deliver more realistic imaging results with richer textures, which also bring some perceptually unpleasant artifacts and fake details despite being trained with LDL. We point out that the GAN-based strategies offer the potential for learning a more realistic high-quality MPIP image. Still, the local statistics in LDL of perspective images may need to be adapted to annular images for better suppression of artifacts.

We have further explored other potential generative models, e.g., the diffusion model [77, 78]. Please refer to the Appendix for more results.

TABLE IV: Quantitative evaluation of GAN-based training on our benchmarks of AC and SR

\&

AC.

Task	Method	Training Strategy	PALHQ-SynMPIP-P1
Task	Method	Training Strategy	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$
AC	RRDB [29]	PSNR-oriented	32.716	0.9265	0.0469	03.704
		+GAN [29]	28.929	0.8840	0.0392	04.919
		+LDL [76]	31.864	0.9166	0.0338	04.559
	SwinIR [27]	PSNR-oriented	32.913	0.9291	0.0446	03.670
		+GAN [29]	29.916	0.8920	0.0449	04.254
		+LDL [76]	31.770	0.9130	0.0297	03.444
	PART	PSNR-oriented	33.143	0.9304	0.0435	03.571
		+GAN [29]	30.965	0.9045	0.0410	04.402
		+LDL [76]	31.854	0.9148	0.0264	03.541
SR&AC	RRDB [29]	PSNR-oriented	28.856	0.8758	0.0733	09.957
		+GAN [29]	25.842	0.8276	0.0638	12.564
		+LDL [76]	28.112	0.8633	0.0561	09.746
	SwinIR [27]	PSNR-oriented	28.985	0.8781	0.0714	09.938
		+GAN [29]	26.596	0.8385	0.0634	12.155
		+LDL [76]	27.875	0.8575	0.0686	09.423
	PART	PSNR-oriented	29.310	0.8819	0.0681	09.648
		+GAN [29]	28.382	0.8688	0.0682	08.897
		+LDL [76]	28.608	0.8720	0.0508	08.715

V-E Effectiveness of PALHQ

The collected PALHQ demonstrates an impressive ability to train the model for recovering both synthetic and real-world MPIP images in previous experiments. In this section, we explore whether PALHQ is necessary for PCIE. As an alternative to PALHQ, we simulate the aberrations of MPIP-P1 directly on the publicly available perspective image dataset, i.e. Flickr2K [79], creating PanoFlickr2K for training.

We compare representative models trained on PanoFlickr2K and PALHQ on both synthetic and real-world benchmarks in Table V and Fig. 11. It becomes clear that PALHQ contributes significantly to high-quality panoramic imaging, where the numerical results in all metrics are improved by a large margin and the visual results are more perceptually pleasant with sharper edges, fewer artifacts, and fewer noises.

TABLE V: Quantitative comparison between the effectiveness of PALHQ and available HQ dataset in PCIE.

Task	Method	Training Dataset	PALHQ-SynMPIP-P1
Task	Method	Training Dataset	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$
AC	Uformer [36]	PanoFlickr2K	32.340	0.8253	0.0484	04.558
	Uformer [36]	PALHQ	32.999	0.9290	0.0442	03.672
	RRDB [29]	PanoFlickr2K	32.128	0.9235	0.0508	04.717
	RRDB [29]	PALHQ	32.716	0.9265	0.0469	03.704
	SwinIR [27]	PanoFlickr2K	32.292	0.9255	0.0485	04.521
	SwinIR [27]	PALHQ	32.913	0.9291	0.0446	03.670
	PART	PanoFlickr2K	32.498	0.9259	0.0480	04.348
	PART	PALHQ	33.143	0.9304	0.0435	03.571
SR&AC	RRDB [29]	PanoFlickr2K	27.943	0.8688	0.0906	12.276
	RRDB [29]	PALHQ	28.856	0.8758	0.0733	09.957
	SwinIR [27]	PanoFlickr2K	28.129	0.8705	0.0880	12.064
	SwinIR [27]	PALHQ	28.985	0.8781	0.0714	09.938
	PART	PanoFlickr2K	28.558	0.8748	0.0854	12.078
	PART	PALHQ	29.310	0.8819	0.0681	09.648

V-F Ablation Study

We conduct ablation studies to investigate how PSF-aware mechanisms contribute to high-quality MPIP image reconstruction. In all cases, the experiments are implemented with the AC pipeline on PALHQ-SynMPIP-P1, evaluated by PSNR and SSIM, and set up on the baseline model SwinIR.

Physical Information. As reported in Table VIII, the different types of physical information are concatenated with the input image respectively for an intuitive evaluation. The PSF map contains rich information characterizing aberration-induced blur, providing better results compared to the FoV map. Then, we set the optimal $k^{\prime}$ to $5$ . A larger $k^{\prime}$ tends to improve the model’s scores, but the performance becomes saturated when $k^{\prime}$ is too large with redundant and sparse information.

PSF-aware Mechanisms. Table VIII shows that all the designed PSF-aware mechanisms contribute to reaching better scores, i.e. $0.230dB$ and $0.0013$ improvements in PSNR and SSIM. PFM attains the highest gains of $0.201dB$ in PSNR and $0.0012$ in SSIM. In addition, the performances of RRDB+ and GRL+ in Table II and Table II verify the consistent effectiveness of the plug-and-play PFM in other models, bringing improvements of $0.100dB{\sim}0.478dB$ in PSNR and $0.0006{\sim}0.0036$ in SSIM on SynMPIP-P1. Regarding the attention block, PMAB with $1{\times}1$ PFM and P-VSA enable adaptive self-attention guided by PSF information, outperforming the vanilla window-based self-attention.

Position of PFM. We further investigate the optimal position to insert PFM. As shown in Table VIII, we apply PFM after the feature extraction, at the last of each PRTB, and before the image reconstruction, for ablations. The PFM on shallow features reveals more competitive results, while increasing the number of PFM during the representation learning stage also leads to significant improvements. Using PFM in all ablated positions helps to reach the best performance, which is also corroborated by the observation in omnidirectional image super-resolution [6].

Effectiveness of PSF Representation. Table IX reports several possible PSF-aware mechanisms along with their vanilla versions without the guidance of the PSF map. The deformable (DConv and DeformSwin), FAC, and VSA mechanisms all deliver even worse performance compared to the baseline ( $-0.957dB{\sim}-0.202dB$ in PSNR, $-0.0086{\sim}-0.0021$ in SSIM), which illustrates that the image-only network is unable to implicitly learn the complex spatial distribution of aberrations, leading to the unreliable predictions of offsets, convolution kernels, and varied-size windows. Serving as a key modality, the PSF representation, i.e., the PSF map, which contains information of the intensity and size distribution of the PSF kernels, facilitates several potential PSF-aware mechanisms to achieve superior performance. To be specific, the guidance of PSF representation brings improvements of $0.261dB{\sim}1.158dB$ in PSNR and $0.0027{\sim}0.0098$ in SSIM to the vanilla mechanisms.

TABLE VI: Ablations on Physical Information.

Physical Information	k’	PSNR	SSIM
w/o	-	32.913	0.9291
FoV map	-	32.992	0.9293
	1	33.012	0.9300
PSF map	5	33.021	0.9301
	9	33.021	0.9299

TABLE VII: Ablations on PSF-aware Mechanisms.

PSF-aware Mechanism		Params	PSNR	SSIM
w/o	-	11.97M	32.913	0.9291
concat	-	12.02M	33.021	0.9301
PFM	-	16.72M	33.114	0.9303
PMAB	1 $\times$ 1 PFM	14.32M	33.069	0.9302
	P-VSA	12.18M	32.999	0.9297
	both	14.53M	33.082	0.9303
all	-	19.27M	33.143	0.9304

TABLE VIII: Ablations on the Position of PFM.

Position	Params	PSNR	SSIM
w/o	11.97M	32.913	0.9291
first conv	12.61M	33.032	0.9297
PRTB	15.54M	33.071	0.9299
last conv	12.61M	32.971	0.9298
all	16.72M	33.114	0.9303

TABLE IX: Ablations on the effectiveness of PSF representation. Dconv: Deformable convolution [80], DeformSwin: Deformable Swin transformer [6], “P-”: the offsets are predicted from the PSF feature, FAC: Filter Adaptive Convolution [66, 67].

Method	Params	PSNR	SSIM
baseline	11.97M	32.913	0.9291
w Dconv	14.71M	32.258	0.9237
w P-Dconv	14.71M	33.064	0.9303
w FAC	16.72M	31.956	0.9205
w PFM	16.72M	33.114	0.9303
w DeformSwin	13.21M	32.711	0.9270
w P-DeformSwin	13.21M	32.972	0.9297
w VSA	12.18M	32.496	0.9253
w P-VSA	12.18M	32.999	0.9297

V-G Summary

The extensive experiments illustrate the critical points in the proposed PCIE for achieving minimalist and high-quality panoramic imaging. We summarize the following primary findings of our experiments:

•

The proposed PCIE presents impressive high-quality imaging results, where the MPIP-P1 and AC pipeline are superior choices for delivering aberration-free panoramic images with much more realistic details.
•

In PCIE, we find that window-attention-based models reveal better results. Furthermore, PSF-aware mechanisms are effective for improving the performance of SR models, where the proposed PSF-aware transformer, i.e. PART, sets state of the art.
•

The PSF representation plays a significant role in PSF-aware mechanisms, facilitating effective learning of the inverse process of the aberration-induced blur.
•

Regarding the training strategies, GAN-based methods contribute to more realistic recovered images, but with some visually unpleasant artifacts and fake details. The generative model appears to be more competitive in PCIE if a good balance is struck when generating rich details and suppressing artifacts.
•

Comparing with the adaptation of perspective images, the collected high-quality panoramic annular images dataset, i.e. PALHQ, brings considerable improvements. PALHQ serves as the cornerstone of our PCIE for training a robust model to process MPIP images.

We hope that the PCIE can bring inspiration from optical design, network architecture, sensor choice, data preparation, and training strategies, for minimalist and high-quality panoramic imaging in mobile and wearable applications.

VI Conclusion and Discussion

VI-A Conclusion

In this paper, we design PCIE to present a general solution to minimalist and high-quality panoramic imaging. Based on the idea of PAL, the MPIP is proposed for 360° panoramic imaging with less than three lenses. Then, learning-based models, which are trained on synthetic aberration images from simulation, are applied to solve the aberration-induced blur and low resolution of MPIP images. A new dataset PALHQ is collected to fill the gap of high-quality PAL images for low-level vision. We explore utilizing PSF information of the optical system to improve the performance of models and design a PSF-aware transformer PART with PSF-aware mechanisms. The plug-and-play mechanism PFM can enhance modern SR models for removing aberration-induced blur, while PART with PMAB delivers state-of-the-art performance on both synthetic and real-world benchmarks. Extensive experiments are conducted to investigate how to improve PCIE, providing heuristic findings for constructing a computational-imaging-based minimalist panoramic system with impressive imaging quality, in terms of optical design, network architecture, sensor selection, training strategies, and data preparation.

VI-B Discussion and Future Work

There are still some limitations in PCIE, which call for further investigation into extremely high-quality imaging. First, the PSF-aware mechanisms are designed in a straightforward way, which improves the performance, yet, with extra parameters and computational overhead. More efficient and effective PSF-aware architectures or training strategies are expected to further enhance the performance. Meanwhile, the improvements on CNN-based models are less pronounced compared to those on transformer models. We are interested in the design of learnable PSF representation, PSF-aware dynamic, deformable, and dilated convolution, or PSF-aware varied-shape window attention for better exploration of PSF information. Then, we investigate state-of-the-art GAN-based training strategies, while there is open research space for further suppressing artifacts. Aside from this, the results of PCIE on real-world data are not as good as on synthetic data, where artifacts and fake details exist in some recovered images. The considerable synthetic-to-real gap needs future research on domain adaptation. The image number of PALHQ is also limited due to the difficulties of capturing high-quality PAL images under various scenes. We intend to design a hybrid training approach to take advantage of the large data size of the publicly available perspective datasets while improving the training with PALHQ. Finally, an end-to-end framework for joint optimization of MPIP design and recovery model will be focused on presenting a more general engine of minimalist and high-quality panoramic imaging.

References

[1] K. Yang, X. Hu, and R. Stiefelhagen, “Is context-aware CNN ready for the surroundings? Panoramic semantic segmentation in the wild,” TIP, vol. 30, pp. 1866–1881, 2021.
[2] K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic images,” in ECCV, vol. 11220, 2018, pp. 732–750.
[3] Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y. Zhao, “PanoFormer: Panorama transformer for indoor 360° depth estimation,” in ECCV, vol. 13661, 2022, pp. 195–211.
[4] K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-Painting: Seamless 360° panoramic image outpainting and beyond,” TIP, vol. 33, pp. 382–394, 2024.
[5] Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “SphereSR: 360° image super-resolution with arbitrary projection via continuous spherical image representation,” in CVPR, 2022, pp. 5667–5676.
[6] F. Yu, X. Wang, M. Cao, G. Li, Y. Shan, and C. Dong, “OSRT: Omnidirectional image super-resolution with distortion-aware transformer,” in CVPR, 2023, pp. 13 283–13 292.
[7] X. Sun et al., “OPDN: Omnidirectional position-aware deformable network for omnidirectional image super-resolution,” in CVPRW, 2023, pp. 1293–1301.
[8] S. Yang, C. Lin, K. Liao, and Y. Zhao, “FishFormer: Annulus slicing-based transformer for fisheye rectification with efficacy domain exploration,” arXiv preprint arXiv:2207.01925, 2022.
[9] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively complementary network for fisheye image rectification using appearance flow,” in CVPR, 2021, pp. 6348–6357.
[10] S. Yang, C. Lin, K. Liao, and Y. Zhao, “Innovating real fisheye image correction with dual diffusion architecture,” in ICCV, 2023, pp. 12 653–12 662.
[11] S. Gao, K. Yang, H. Shi, K. Wang, and J. Bai, “Review on panoramic imaging and its applications in scene understanding,” TIM, vol. 71, pp. 1–34, 2022.
[12] D. Cheng, C. Gong, C. Xu, and Y. Wang, “Design of an ultrawide angle catadioptric lens with an annularly stitched aspherical surface,” OE, vol. 24, no. 3, pp. 2664–2677, 2016.
[13] S. Gao, E. A. Tsyganok, and X. Xu, “Design of a compact dual-channel panoramic annular lens with a large aperture and high resolution,” AO, vol. 60, no. 11, pp. 3094–3102, 2021.
[14] Q. Jiang et al., “Computational optics meet domain adaptation: Transferring semantic segmentation beyond aberrations,” arXiv preprint arXiv:2211.11257, 2022.
[15] Y. Peng, Q. Fu, F. Heide, and W. Heidrich, “The diffractive achromat full spectrum computational imaging with diffractive optics,” TOG, vol. 35, no. 4, pp. 1–11, 2016.
[16] Y. Peng, Q. Sun, X. Dun, G. Wetzstein, W. Heidrich, and F. Heide, “Learned large field-of-view imaging with thin-plate optics,” TOG, vol. 38, no. 6, pp. 1–14, 2019.
[17] X. Li, J. Suo, W. Zhang, X. Yuan, and Q. Dai, “Universal and flexible optical aberration correction using deep-prior based deconvolution,” in ICCV, 2021, pp. 2593–2601.
[18] Q. Sun, C. Wang, Q. Fu, X. Dun, and W. Heidrich, “End-to-end complex lens design with differentiate ray tracing,” TOG, vol. 40, no. 4, pp. 1–13, 2021.
[19] C. Wang, N. Chen, and W. Heidrich, “dO: A differentiable engine for deep lens design of computational imaging systems,” TCI, vol. 8, pp. 905–916, 2022.
[20] X. Yang, Q. Fu, and W. Heidrich, “Curriculum learning for ab initio deep learned refractive optics,” arXiv preprint arXiv:2302.01089, 2023.
[21] K. Zhang, X. Zhong, L. Zhang, and T. Zhang, “Design of a panoramic annular lens with ultrawide angle and small blind area,” AO, vol. 59, no. 19, pp. 5737–5744, 2020.
[22] J. Wang, J. Bai, K. Wang, and S. Gao, “Design of stereo imaging system with a panoramic annular lens and a convex mirror,” OE, vol. 30, no. 11, pp. 19 017–19 029, 2022.
[23] P. Greguss, “Panoramic imaging block for three-dimensional space,” Jan. 28 1986, US Patent 4,566,763.
[24] I. Powell, “Panoramic lens,” AO, vol. 33, no. 31, pp. 7356–7361, 1994.
[25] S. Thibault, J. Gauvin, M. Doucet, and M. Wang, “Enhanced optical design by distortion control,” in SPIE, vol. 5962, 2005, pp. 307–314.
[26] D. Geng, H.-t. Yang, C. Mei, and Y.-h. Li, “Optical system design of space fisheye lens and performance analysis,” in SPIE, vol. 10462, 2017, pp. 1276–1282.
[27] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using swin transformer,” in ICCVW, 2021, pp. 1833–1844.
[28] Q. Jiang, H. Shi, L. Sun, S. Gao, K. Yang, and K. Wang, “Annular computational imaging: Capture clear panoramic images through simple lens,” TCI, vol. 8, pp. 1250–1264, 2022.
[29] X. Wang et al., “ESRGAN: Enhanced super-resolution generative adversarial networks,” in ECCVW, vol. 11133, 2018, pp. 63–79.
[30] Y. Li et al., “Efficient and explicit modelling of image hierarchies for image restoration,” in CVPR, 2023, pp. 18 278–18 289.
[31] W. Li, X. Lu, J. Lu, X. Zhang, and J. Jia, “On efficient transformer and image pre-training for low-level vision,” arXiv preprint arXiv:2112.10175, 2021.
[32] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more pixels in image super-resolution transformer,” in CVPR, 2023, pp. 22 367–22 377.
[33] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPRW, 2017, pp. 1132–1140.
[34] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in ECCV, vol. 11211, 2018, pp. 294–310.
[35] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022, pp. 5718–5729.
[36] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general U-shaped transformer for image restoration,” in CVPR, 2022, pp. 17 662–17 672.
[37] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “HINet: Half instance normalization network for image restoration,” in CVPRW, 2021, pp. 182–192.
[38] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in ECCV, vol. 13667, 2022, pp. 17–33.
[39] H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” arXiv preprint arXiv:2205.10468, 2022.
[40] J. Zhang, K. Yang, C. Ma, S. Reiß, K. Peng, and R. Stiefelhagen, “Bending reality: Distortion-aware transformers for adapting to panoramic semantic segmentation,” in CVPR, 2022, pp. 16 896–16 906.
[41] H. Chen, W. Hu, K. Yang, J. Bai, and K. Wang, “Panoramic annular SLAM with loop closure and global optimization,” AO, vol. 60, no. 21, pp. 6264–6274, 2021.
[42] Z. Wang, K. Yang, H. Shi, P. Li, F. Gao, and K. Wang, “LF-VIO: A visual-inertial-odometry framework for large field-of-view cameras with negative plane,” in IROS, 2022, pp. 4423–4430.
[43] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Schölkopf, “Non-stationary correction of optical aberrations,” in ICCV, 2011, pp. 659–666.
[44] Y. Liu, C. Zhang, T. Kou, Y. Li, and J. Shen, “End-to-end computational optics with a singlet lens for large depth-of-field imaging,” OE, vol. 29, no. 18, pp. 28 530–28 548, 2021.
[45] G. Côté, F. Mannan, S. Thibault, J.-F. Lalonde, and F. Heide, “The differentiable lens: Compound lens search over glass surfaces and materials for object detection,” in CVPR, 2023, pp. 20 803–20 812.
[46] H. Trussell and B. Hunt, “Image restoration of space variant blurs by sectioned methods,” in ICCASP, vol. 3, 1978, pp. 196–198.
[47] J. Kim, A. Tsai, M. Cetin, and A. S. Willsky, “A curve evolution-based variational approach to simultaneous image restoration and segmentation,” in ICIP, vol. 1, 2002, pp. I–I.
[48] E. Kee, S. Paris, S. Chen, and J. Wang, “Modeling and removing spatially-varying optical blur,” in ICCP, 2011, pp. 1–8.
[49] Y. Xue, Q. Yang, G. Hu, K. Guo, and L. Tian, “Deep-learning-augmented computational miniature mesoscope,” Optica, vol. 9, no. 9, pp. 1009–1021, 2022.
[50] L. Denis, E. Thiébaut, F. Soulez, J.-M. Becker, and R. Mourya, “Fast approximations of shift-variant blur,” IJCV, vol. 115, pp. 253–278, 2015.
[51] N. Wiener, Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. MIT press Cambridge, MA, 1949, vol. 113, no. 21.
[52] W. H. Richardson, “Bayesian-based iterative method of image restoration,” JOSA, vol. 62, no. 1, pp. 55–59, 1972.
[53] L. B. Lucy, “An iterative technique for the rectification of observed distributions,” The Astronomical Journal, vol. 79, p. 745, 1974.
[54] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” in CVPR, 2015, pp. 769–777.
[55] S. Chen, H. Feng, K. Gao, Z. Xu, and Y. Chen, “Extreme-quality computational imaging via degradation framework,” in ICCV, 2021, pp. 2612–2621.
[56] S. Chen, T. Lin, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Computational optics for mobile terminals in mass production,” TPAMI, vol. 45, no. 4, pp. 4245–4259, 2023.
[57] T. Lin, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Non-blind optical degradation correction via frequency self-adaptive and finetune tactics,” OE, vol. 30, no. 13, pp. 23 485–23 498, 2022.
[58] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Learning a 3D-CNN and transformer prior for hyperspectral image super-resolution,” Information Fusion, vol. 100, p. 101907, 2023.
[59] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: Dataset and study,” in CVPRW, 2017, pp. 1122–1131.
[60] X. Deng, H. Wang, M. Xu, Y. Guo, Y. Song, and L. Yang, “LAU-net: Latitude adaptive upscaling network for omnidirectional image super-resolution,” in CVPR, 2021, pp. 9189–9198.
[61] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place representation,” in CVPR, 2012, pp. 2695–2702.
[62] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in ICCV, 2021, pp. 4771–4780.
[63] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data,” in ICCVW, 2021, pp. 1905–1914.
[64] E. Huggins, “Introduction to fourier optics,” Physics Teacher, vol. 45, no. 6, pp. 364–368, 2007.
[65] V. N. Mahajan, “Zernike circle polynomials and optical aberrations of systems with circular pupils,” AO, vol. 33, no. 34, pp. 8121–8124, 1994.
[66] D. Li et al., “Involution: Inverting the inherence of convolution for visual recognition,” in CVPR, 2021, pp. 12 321–12 330.
[67] Y. Jiang, B. Wronski, B. Mildenhall, J. T. Barron, Z. Wang, and T. Xue, “Fast and high quality image denoising via malleable convolution,” in ECCV, vol. 13678, 2022, pp. 429–446.
[68] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “VSA: Learning varied-size window attention in vision transformers,” in ECCV, vol. 13685, 2022, pp. 466–483.
[69] hou Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, vol. 13, no. 4, pp. 600–612, 2004.
[70] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
[71] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, vol. 30, 2017, pp. 6626–6637.
[72] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” SPL, vol. 20, no. 3, pp. 209–212, 2012.
[73] A. Mittal, A. Moorthy, and A. Bovik, “Referenceless image spatial quality evaluation engine,” in ACSSC, vol. 38, 2011, pp. 53–54.
[74] X. Chu, L. Chen, C. Chen, and X. Lu, “Improving image restoration by revisiting global information aggregation,” in ECCV, 2022, pp. 53–71.
[75] L. Beyer et al., “FlexiViT: One model for all patch sizes,” in CVPR, 2023, pp. 14 496–14 506.
[76] J. Liang, H. Zeng, and L. Zhang, “Details or artifacts: A locally discriminative learning approach to realistic image super-resolution,” in CVPR, 2022, pp. 5647–5656.
[77] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, 2020, pp. 6840–6851.
[78] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffusion models as plug-and-play priors,” in NeurIPS, vol. 35, 2022, pp. 14 715–14 728.
[79] R. Timofte et al., “NTIRE 2017 challenge on single image super-resolution: Methods and results,” in CVPRW, 2017, pp. 1110–1121.
[80] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets V2: More deformable, better results,” in CVPR, 2019, pp. 9300–9308.
[81] C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models,” in CVPR, 2020, pp. 8825–8835.
[82] L. Hoyer, D. Dai, and L. Van Gool, “DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation,” in CVPR, 2022, pp. 9914–9925.
[83] A. Jaiswal, X. Zhang, S. H. Chan, and Z. Wang, “Physics-driven turbulence image restoration with stochastic refinement,” in ICCV, 2023, pp. 12 136–12 147.
[84] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” TPAMI, vol. 45, no. 4, pp. 4713–4726, 2023.
[85] X. Zhang, H. Zeng, S. Guo, and L. Zhang, “Efficient long-range attention network for image super-resolution,” in ECCV, vol. 13677, 2022, pp. 649–667.

Appendix A Sample Images of PALHQ

We show the shooting device and sample images of PALHQ in Fig. A.1. The high-quality PAL images dataset covers a wide variety of scenes. The PAL can present 360° imaging of the surroundings, but with a blind area in the center of the image due to the reflective surface in the center FoV. As illustrated on the left of Fig. A.1, the PAL is usually placed toward the sky during the application, where the occlusion of the blind area causes little influence on the acquisition of panoramic information. PALHQ serves as the cornerstone of our PCIE for training a robust model to process MPIP images. Additionally, PALHQ can be transmitted to unfolded panoramas via equirectangular projection (ERP) for other various panoramic image processing applications. More types of minimalist PAL design, e.g., using fewer lenses or applying meta surface, would also benefit from PALHQ for training learning-based recovery models to improve imaging quality.

Appendix B Visualization of the PSF Map

The visualization of the process of producing PSF maps is illustrated in Fig. B.1. As depicted in Sec. IV-A, we locate the corresponding FoV of the target pixel and obtain the R/G/B PSFs. Then, they are rotated according to their locations and compressed into a shape of ${k^{\prime}\times{k^{\prime}}\times 1}$ via adaptive average pooling, where $k^{\prime}$ is set to $5$ as an example. Finally, we reshape the compressed kernel into channel dimension and insert it into the pixel to produce the PSF map.

Appendix C Pipeline for OIQE Calculation

The detailed pipeline for calculating the defined OIQE is shown in Fig. C.1. We capture $8$ testing images of the checkerboard under different ISP settings with our MPIP-P1 and AC pipeline. For evaluating different models in terms of aberration correction in real-world scenes, the processed testing images are fed into the OIQE pipeline, where sample knife-edges from different FoVs are cropped for SFR testing. In OIQE, the comprehensive results of different shooting settings and FoVs present a more credible evaluation of the ability to correct optical aberrations.

Appendix D Implementation Details of the Simulation Model

In this section, we introduce how to generate aberration images based on the simulation model and Zemax software in specific. To apply Eq. (1) for simulating aberrations, a set of PSFs under all FoVs of the target optical lens is required. We input the structure of MPIP into Zemax, then calculate the Zernike standard coefficients under different FoVs ( $128$ FoVs from the minimum to the maximum FoV) and wavelengths ( $31$ wavelengths from $400{\sim}700nm$ ), where the first $37$ polynomials are kept as a common practice. In this way, the Zernike coefficients matrix with a shape of $31{\times}128{\times}37$ is produced. Then, we plug the coefficients into Eq. (2) to describe the wavefronts under all FoVs and wavelengths. The random disturbance strategy is applied here to fine-tune the coefficients for multiple virtual aberration distributions. Before calculating PSFs, we also need to have access to the spot diagram and illumination distribution of the MPIP in Zemax, where the sizes of spots determine the kernel sizes (the ratio of the spot size to the pixel size) of PSFs, and the illumination provides the relative amplitude of PSFs. Finally, the wavefronts are transformed to PSFs $K_{\theta}(x,y,\lambda)$ via Eq. (D.1) to Eq. (D.3):

\mathcal{P}_{\theta}(x,y,\lambda)=P(x,y)e^{\mathrm{i}\Phi_{\theta}(x,y,\lambda% )},

(D.1)

E_{\theta}(x,y,\lambda)=\frac{E_{0}}{{\lambda}d}\iint{\mathcal{P}_{\theta}(x^{% \prime},y^{\prime},\lambda)e^{-\mathrm{i}\frac{2\pi}{{\lambda}d}(x^{\prime}x+y% ^{\prime}y)}dx^{\prime}dy^{\prime}},

(D.2)

K_{\theta}(x,y,\lambda)={|E_{\theta}(x,y,\lambda)|^{2}},

(D.3)

where $P(x,y)$ is the circ function and $d$ is the distance from exit pupil to image plane. With multiple sets of $K_{\theta}(x,y,\lambda)$ under different random disturbances, we generate aberration images of multiple virtual MPIP samples via Eq. (1), where the high-quality MPIP images are transformed to raw images by invert-ISP (Gamma Decompression, Invert Color Correction Matrix, and Invert White Balance), and the aberrated raw images are further processed by ISP (Mosaiced, Adding Noise, Demosaiced, White Balance, Color Correction Matrix, and Gamma Compression) to obtain the final results.

Appendix E Implementation Details of Model Testing

During the testing (inference) stage, the input is the full-resolution panoramic images, except for Restormer and NAFNet. For global self-attention-based methods, e.g., Restormer, the performance of the model is sensitive to the image resolution, which requires the same resolution during testing and training to maintain consistent high performance [74, 75]. Moreover, in our tasks, a larger resolution of testing input represents more complex spatially variant degradation (related to more FoVs), which introduces a larger gap with the training data. Consequently, in TABLE I of the paper, the results of Restormer are those under the cropping testing strategy, where the input image is cropped into overlapped patches of $256{\times}256$ . The same is true for NAFNet. To further illustrate this issue, we test the performance of Restormer and NAFNet under different crop sizes of input, as shown in Table E.1. When the training crop size is $256{\times}256$ , the performances of the models drop significantly when the testing crop size increases from $256{\times}256$ to $3152{\times}3152$ (the full-resolution).

TABLE E.1: The impacts of input resolution on Restormer and NAFNet. The models are trained on

256{\times}256

image patches and tested with different resolutions. We take the results on SynMPIP-P1 as an example. The results in the table are PSNR/SSIM.

Method	Input Resolution
Method	3152	1024	256
Restormer	27.424/0.8826	30.333/0.9088	32.971/0.9287
NAFNet	28.494/0.8853	30.481/0.9099	32.837/0.9274

Appendix F Ground-Truth Generation Pipeline for Checkerboard Images.

Capturing real data with Ground Truth (GT) is challenging in the computational imaging field, where no reliable data acquisition pipeline is available in related work. Taking GT images displayed on the screen with the optical system to be measured could be a solution [16]. However, there still exists a gap between the screen and the real image. At the same time, for a special panoramic system, i.e., MPIP, no suitable screen is available for capturing paired panoramic images.

Consequently, we make an early effort to generate GT images based on captured special patterns. For the black and white geometric pattern, e.g., the checkerboard, degraded by aberration degradation, we only need to extract its edge and re-color each part according to its original distribution, to generate its GT pattern. This method was once applied in [55] to generate checkerboard pairs for training a degradation network. In our case, we crop patches of checkerboard test images captured by MPIP, under different FoVs, and generate corresponding GTs by the above method, as shown in Fig. F.1. In this way, we only need to crop the patches of the same area on the imaging results of PCIE, and then calculate the error metric, e.g., PSNR and SSIM, with the corresponding GTs. The checkerboard testing set of RealMPAL consists of $7$ paired images, where checkerboard patches under different FoVs and ISP settings are included.

However, the pipeline is only a preliminary experiment, which still reveals some weaknesses. For example, the coloring method for GT is worth further investigating, because the chessboard captured by a well-designed PAL is also not as ideal as the GT. The calculation of PSNR and SSIM between recovered images and GT might not fully reflect the ability of aberration correction of the model. Compared to it, the QIQE defined based on the optical metric MTF, is more credible and suitable for evaluating the aberration correction task.

Appendix G Implementation Details of User Study

To conduct a subjective evaluation of imaging results of PCIE in real-world scenes, we randomly sample $10$ images from RealMPAL3K and $10$ images from RealMPAL1K for the AC and SR $\&$ AC pipeline, respectively. $42$ volunteers are invited to participate in the survey, where they need to go through the imaging results of all the methods in Table III and select half of them with the best image quality. The final statistical result is presented as the percentage of each method that is being selected, which is the U.S. in Table III.

Appendix H Failure Case of PSF-aware Mechanisms

From the quantitative results on synthetic datasets, RRDB+ delivers better results than RRDB on most metrics and tasks. However, compared to the significant improvements by PSF-aware mechanisms to SwinIR and GRL, the improvements to RRDB are limited, which even leads to worse FID in some cases (AC-SynMPIP-P1 and SR $\&$ AC-SynMPIP-P1). The OIQE from $66.94\%{\sim}58.28\%$ also illustrates the limitations of PFM on RRDB. This is a failure case of the PSF-aware mechanisms.

We speculate that this is caused by the unsatisfactory robustness of the CNN-based model to the domain shift of testing data [14, 81, 82]. In our evaluation, for both synthetic and real data, the aberration distribution, i.e., the PSF distribution, is slightly different from the standard distribution in PSF representation, which is often the case in the real-world scene due to the manufacture and assembly errors of the lens. In this case, the model has to learn the actual distribution from the standard PSF distribution to guide the PFM. Consequently, the CNN-based model is seriously affected by the domain gap, leading to the unreliable prediction of dynamic kernels in PFM, which brings worse performance.

Appendix I More Results of Generative Models.

Except for the GAN-based training strategies, the recently developed diffusion model [77, 78, 83] shows strong abilities to generate realistic images with rich details. Consequently, we further explore the potential of the diffusion model in our AC task.

TABLE I.1: The quantitative results and computational overhead of SwinIR and corresponding generative models. The denoising steps of SR3 are set to

10

to make sure that the model can converge. The parameters and FLOPs of SR3 are multiplied by

10

considering the steps.

	PSNR	SSIM	LPIPS	FID	Params	FLOPs
SwinIR	32.913	0.9291	0.0446	03.670	11.97M	407.76G
SwinIR+GAN	29.916	0.8920	0.0449	04.254	11.97M	407.76G
SwinIR+LDL	31.770	0.9130	0.0297	03.444	11.97M	407.76G
SwinIR+SR3	31.281	0.8412	0.4110	17.956	11.97M+13.85M	407.76G+3481G

Following [83], we refine the PSNR-oriented SwinIR model with a diffusion model SR3 [84], where the recovered images of SwinIR are applied as the condition of the diffusion model, considering that the amount of PALHQ is too small to train a diffusion model from scratch. The experimental results of the refined model are shown in Fig. I.1 and Table I.1, where the diffusion model (SwinIR+SR3) cannot bring improvements on perceptual-based metrics like GAN and LDL, but leads to worse performance.

It is known that diffusion models require training on large amounts of datasets under large denoising steps ( $2000$ as common practice) for good performance [77, 84]. When the dataset size is small, the number of denoising steps should be set as small to make sure that the model can converge during training ( $10$ steps in our case). However, the small steps mean a weak denoising ability, leading to the terrible performance of the diffusion model (the residual noise and color deviation). Moreover, for aberration correction of high-resolution panoramic images, the computational overhead is considerable, where the additional steps of denoiser are unacceptable. The computational overhead of the SwinIR and additional diffusion model is also shown in Table I.1, where the Floating Point Operations (FLOPs) are calculated with the input resolution of $1024{\times}1024$ .

In summary, the diffusion model is not suitable for the PCIE currently, but it could be a competitive solution in the future if the PALHQ is developed for larger datasets and an efficient inference pipeline is proposed.

Appendix J Training Details.

The parameters of the proposed PART are $19.27M$ , which takes $52$ hours for training $200k$ iterations with a batch size of $8$ on a single A800 GPU. Due to the dynamic convolution operations in the model, the small amount of training data, and the guidance of PSF prior, the PSF-aware Transformer can be trained well in small numbers of iterations. The training curve in Fig. J.1 shows that the model has converged at the end of training.

Appendix K The Analysis of Computational Overhead.

TABLE K.1: The computational overheads of representative methods. The FLOPs, memory cost, and inference latency are calculated with the input resolution of

1024{\times}1024

on a single A800 GPU.

Method		PSNR	SSIM	LPIPS	FID	Params	FLOPs	Memory	Latency
Baselines	RRDB	32.716	0.9265	0.0469	03.704	16.72M	588.28G	0.58GB	0.15s
	SwinIR	32.913	0.9291	0.0446	03.670	11.97M	407.76G	0.65GB	0.97s
	GRL	32.369	0.9256	0.0457	04.234	20.27M	649.24G	1.23GB	1.37s
PSF-aware	RRDB+	32.816	0.9271	0.0456	03.746	18.61M	589.75G	0.83GB	0.27s
	GRL+	32.847	0.9292	0.0454	03.627	25.60M	653.43G	2.11GB	1.41s
	PART	33.143	0.9304	0.0435	03.571	19.27M	412.08G	1.90GB	1.17s
	PART-S	32.812	0.9278	0.0452	03.710	03.89M	077.47G	1.00GB	0.73s

The parameters, FLOPs (Floating Point Operations), memory cost, and inference latency of PSF-aware methods with their baselines are presented in Table K.1. The FLOPs, memory cost, and inference latency are calculated with the input resolution of $1024{\times}1024$ on a single A800 GPU, which only needs to be roughly scaled for other input resolutions.

As shown in Table K.1, the PSF-aware mechanisms only introduce negligible additional computational overheads ( $0.25\%{\sim}1.06\%$ of FLOPs), while bringing significant improvements over the baselines. The increase in the number of parameters is mainly due to the the prediction of dynamic convolution kernel of each pixel, which introduces little computational overheads. The defect of the latency is caused by the transformer-based backbones, while the additional latency brought by PSF-aware mechanisms is not evident ( $0.04{\sim}0.2s$ ). Benefiting from the efficiency and effectiveness of PSF-aware mechanisms, our future work will focus on more efficient backbones, e.g., lightweight SR backbones [85]) to achieve light-weight and high-quality panoramic imaging.

Moreover, we also release a lightweight version of PART, i.e., PART-S (with smaller depth and embedding dim), considering the potential applications of the PCIE in mobile and wearable terminals. With only $19.00\%$ of the computational overhead, PART-S can achieve comparable performance to the baseline SwinIR.