Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers

Qi Jiang1, Shaohua Gao1, Yao Gao, Kailun Yang2, Zhonghua Yi, Hao Shi, Lei Sun, and Kaiwei Wang2 This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 12174341 and in part by Hangzhou SurImage Technology Company Ltd.Q. Jiang, S. Gao, Y. Gao, Z. Yi, H. Shi, L. Sun, and K. Wang are with the State Key Laboratory of Modern Optical Instrumentation and the National Engineering Research Center of Optical Instrumentation, Zhejiang University, Hangzhou 310027, China.K. Yang is with the School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082, China.1Equal contribution.2Corresponding authors: Kaiwei Wang and Kailun Yang. (E-mail: wangkaiwei@zju.edu.cn, kailun.yang@hnu.edu.cn.)
Abstract

High-quality panoramic images with a Field of View (FoV) of 360° are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist imaging systems are desired. In this paper, we propose a Panoramic Computational Imaging Engine (PCIE) to achieve minimalist and high-quality panoramic imaging. With less than three spherical lenses, a Minimalist Panoramic Imaging Prototype (MPIP) is constructed based on the design of the Panoramic Annular Lens (PAL), but with low-quality imaging results due to aberrations and small image plane size. We propose two pipelines, i.e. Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&\&&AC), to solve the image quality problems of MPIP, with imaging sensors of small and large pixel size, respectively. To leverage the prior information of the optical system, we propose a Point Spread Function (PSF) representation method to produce a PSF map as an additional modality. A PSF-aware Aberration-image Recovery Transformer (PART) is designed as a universal network for the two pipelines, in which the self-attention calculation and feature extraction are guided by the PSF map. We train PART on synthetic image pairs from simulation and put forward the PALHQ dataset to fill the gap of real-world high-quality PAL images for low-level vision. A comprehensive variety of experiments on synthetic and real-world benchmarks demonstrates the impressive imaging results of PCIE and the effectiveness of the PSF representation. We further deliver heuristic experimental findings for minimalist and high-quality panoramic imaging, in terms of the choices of prototype and pipeline, network architecture, training strategies, and dataset construction. Our dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART.

Index Terms:
Panoramic imaging, minimalist optical systems, computational imaging, vision transformer, point spread function.

I Introduction

Refer to caption
Figure 1: Illustration of the proposed MPIP and its key issue of low image quality, which is properly addressed with PSF-aware transformer: PART. (a) Minimalist Panoramic Imaging Prototype (MPIP); (b) Comparison between real products of conventional panoramic imaging systems and PAL-based MPIP; (c) Low-quality image captured by MPIP. (d) High-quality image recovered by PART. In this way, we realize minimalist and high-quality panoramic imaging with PSF-aware transformers.

Image processing of panoramic images with an ultra-wide Field of View (FoV) of 360° is growing popular for achieving a holistic understanding of the entire surrounding scene [1, 2, 3, 4]. While the 360° panoramas suffer from inherent defects of low angular resolutions and severe geometric image distortions, a variety of low-level vision works is conducted in terms of image super-resolution [5, 6, 7] and image rectification [8, 9, 10], to produce high-quality panoramas for photography and down-stream tasks. However, the image blur caused by optical aberrations of the applied lens is seldom explored.

Most contemporary works on panoramic images are based on the common sense that the optical system is aberration-free where the imaging result is clear and sharp. While widely applied, conventional panoramic optical systems, come with notoriously sophisticated lens designs, composed of multiple sets of lenses with complex surface types [11, 12, 13], to reach high imaging quality. However, this is not often the case as the demand for thin, portable imaging systems, i.e., Minimalist Optical Systems (MOS), grows stronger in mobile and wearable applications [14]. Without sufficient lens groups for aberration correction, the aberration-induced image blur is inevitable for MOS. In this case, the imaging quality drops significantly and often catastrophically, and the unsatisfactory imaging performance disqualifies its potential usage in upper-level applications. This leads to an appealing issue and we ask if we may strike a fine balance between high-quality panoramic imaging and minimalist panoramic optical systems.

With the rapid development of digital image processing, Computational Imaging (CI) methods for MOS [15, 16, 17] appear as a preferred solution to this issue. These methods often propose optical designs with few necessary optical components to meet the basic demands of specific applications, e.g., the FoV, depth of field, and focal length, followed by an image post-processing model to recover the aberration-image. Recent research works [18, 19, 20] further design end-to-end deep learning frameworks for joint optimization of optical systems and image recovery networks. In this paper, based on the idea of computational imaging, we propose Panoramic Computational Imaging Engine (PCIE), a framework for minimalist and high-quality panoramic imaging, to solve the trade-off between high-quality panoramic imaging and minimalist panoramic optical systems as a whole, without sitting on only one of its sides.

Motivated by modern panoramic lens designs [12, 21, 22], PCIE builds on a Minimalist Panoramic Imaging Prototype (MPIP) shown in Fig. 1(a), which is composed of an essential panoramic head for 360° panoramic imaging and a relay lens group for aberration correction. In specific, we select the structure of Panoramic Annular Lens (PAL) [23, 24]: a more compact solution to 360° panoramas, as an example for MPIP, where a catadioptric PAL head is equipped to replace the complex lens group [11] in the conventional fisheye lens [25, 26]. To achieve a minimalist design, the proposed MPIP is composed of 1111 spherical lens for the PAL head, and 1111 or 2222 simple spherical lenses for the relay lens group, which can image over 360° FoV with only 40%percent4040\%40 % of the numbers of lenses and 60%percent6060\%60 % of the volumes of conventional panoramic imaging systems, as shown in Fig. 1(b). However, as illustrated in Fig.1(c), the uncorrected optical aberrations and the limited image plane size lead to the image corruptions, i.e., aberration-induced spatially-variant image blur and low imaging resolution.

To address the issues of MPIP, engaged with the information of Point Spread Function (PSF) from optical design, we propose PSF-aware Aberration-image Recovery Transformer (PART): a transformer-based low-quality image recovery paradigm for MPIP. Different from previous transformer baselines, e.g., SwinIR [27], PART exploits the PSF, the forward function characterizing the aberration-induced blur, to attain better results. A PSF representation method is delivered to represent PSF kernels in the form of a feature map, which serve as an additional modality for the network. Based on the representation, we design two PSF-aware mechanisms inspired by the physical meanings of the aberration-induced blur.

Specifically, the PSF-aware Feature Modulator (PFM) builds on the idea of modeling the inverse process of degradation convolution of PSFs, where pixel-adaptive convolution kernels are learned from the PSF representation to modulate the feature map gradually during recovery. PFM is a plug-and-play PSF-aware mechanism that can be inserted into other recovery models. In addition, PSF-aware Mix-Attention Block (PMAB) is proposed as the basic unit of PART, which comprises: (1) Vanilla window attention of SwinIR [27] for capturing long-range dependency; (2) PSF-aware Varied-Size Attention (P-VSA), where diverse windows of varied sizes and locations are learned from the PSF representation to provide dynamic receptive fields, motivated by the varied PSF sizes in different FoVs; (3) PFM of small kernel size for enhancing the feature extraction of local details. With PART, the low-quality image captured by MPIP can be smoothly recovered (see Fig. 1(d)) for minimalist and high-quality panoramic imaging.

To facilitate the training of PART, wave-based imaging simulation with random perturbation [28] is utilized for generating clear-blur image pairs. To fill the gap of ground-truth images of PAL, we record a high-quality PAL images dataset named PALHQ through a well-designed PAL in varied scenes. Based on PALHQ, we set up two tasks to formalize the key issue of low-quality MPIP images: (1) Aberration Correction (AC) of high-resolution images taken by sensors with small pixel size, and (2) Super-Resolution and Aberration Correction (SR&\&&AC) of low-resolution images from sensors with large pixel size. Then, representative models of image super-resolution (SR) [27, 29, 30, 31, 32, 33, 34], image deblurring (Deblur) [35, 36, 37, 38], and image restoration with PSF-aware mechanisms (PSF-aware) [28] are evaluated, where PCIE enables all models to produce impressive panoramic imaging results.

Furthermore, we manufacture an MPIP sample with better image quality and capture the real-world dataset RealMPAL to benchmark models on real-world scenes. Experimental results reveal that PFM enhances the performance of the baselines (see Fig. 2) and PART sets the state of the art on both synthetic and real-world benchmarks, where the PSF representation plays a significant role to enable effective PSF-aware mechanisms. We also conduct extensive experiments to investigate the potential of GAN-based training strategies and the effectiveness of PALHQ in PCIE. The generative model appears to be competitive for generating more realistic details if the artifacts can be well suppressed. Additionally, PALHQ serves as the cornerstone of PCIE for training a robust model for annular images. At a glance, we deliver the following contributions:

Refer to caption
Figure 2: The proposed plug-and-play PSF-aware mechanism, PFM, consistently and significantly improves the performance of several baseline models in two pipelines. “+” means the model inserted the PFM in the same way as PART.
  • We propose the Panoramic Computational Imaging Engine (PCIE), a novel framework for minimalist and high-quality panoramic imaging, as shown in Fig. 3, where a Minimalist Panoramic Imaging Prototype (MPIP) is designed for 360° panoramic imaging with an essential panoramic head and simple relay lens group.

  • We raise two pipelines to process low-quality MPIP images: Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&\&&AC). The real-world panoramic image datasets PALHQ and RealMPAL of high-quality and low-quality are recorded respectively for benchmarking the two pipelines, which are the first real-world PAL datasets for low-level vision.

  • We design a PSF representation method to represent the intensity and size distributions of PSF kernels in the form of a feature map, i.e., PSF map, which serves as an additional modality for the pipelines.

  • We further introduce the PSF-aware Aberration-image Recovery Transformer (PART) to process the low-quality images of MPIP, where the PSF-aware mechanisms guided by the PSF map are explored to enhance the recovery performance.

The experimental exploration of PCIE provides heuristic findings in terms of optical design, network architecture, training strategies, and dataset construction. We hope that PCIE can bring inspiration in both hardware system and algorithm aspects, for minimalist and high-quality panoramic imaging.

Refer to caption
Figure 3: Overview of the proposed Panoramic Computational Imaging Engine (PCIE) for minimalist and high-quality panoramic imaging. To achieve the goal of panoramic imaging with a minimalist system, the number of optical components and the radius of MPIP are designed to be small, which brings two key issues of low image quality: (1) aberration-induced blur due to lack of enough lenses for aberration correction and (2) low resolution caused by limited image plane size. We introduce the PART, which is trained on synthetic data pairs generated by imaging simulation, to recover the low-quality aberration image with the guidance of PSF information. “ISP” denotes Image Signal Processing.

II Related Work

II-A Image Processing of Panoramic Images

Recent research interest in panoramic images is booming for immersive visual experiences [11, 39]. Semantic segmentation [1, 40], depth estimation [2, 3], and visual Simultaneous Localization and Mapping (SLAM) [41, 42] are widely explored on panoramic images for a holistic understanding of the surrounding scene. To this intent, high-quality panoramic images are urgently required for robust performance. A considerate amount of work is conducted to improve the image quality of panoramic images, such as super-resolution [5, 6, 7] and distortion correction [8, 9, 10].

However, the above image processing of panoramic images is based on the aberration-free images captured by the conventional panoramic lens, where multiple sets of lenses with complex surface types [11, 12, 13] are applied for high-quality imaging. This work focuses on capturing panoramic images with a Minimalist Optical System (MOS) composed of much fewer lenses for volume-limited applications, where we process the aberration images via computational methods.

II-B Computational Imaging for Minimalist Optical System

The aberration-induced image blur is inevitable for MOS due to insufficient lens groups for aberration correction. Recently, computational imaging methods for MOS appear as a preferred solution to this issue, where optical designs with few necessary optical components are equipped with image recovery pipelines for both minimalist and aberration-free imaging [16, 17, 43]. Some research works [18, 44, 45] even design end-to-end deep learning frameworks for joint optimization of MOS and post-processing networks to achieve the best match between MOS and recovery models to further improve the final imaging performance.

However, computational imaging for minimalist panoramic systems is scarcely explored. In a preliminary study, Jiang et al. [28] propose an Annular Computational Imaging (ACI) framework to break the optical limit of minimalist Panoramic Annular Lens (PAL), where the image processing is conducted on unfolded PAL images. To further develop a general framework for minimalist and high-quality panoramic imaging, this work fills the gap in high-quality PAL image datasets and designs the Panoramic Computational Imaging Engine (PCIE) directly on annular PAL images, in terms of both optical design and aberration-images recovery.

II-C Image Recovery of Aberration Images

The aberration-induced blur is always spatially-variant, i.e.  Linear Shift Variant (LSV), due to the uneven thicknesses of the lenses. Several efforts have been made for the LSV system, spanning path-wise restoration [46, 47], experimental PSFs calibration and non-blind deconvolution [43, 48], and low-rank decomposition [49, 50], based on the degradation model of aberration-images [51, 52, 53].

Recent works tend to adopt the data-driven learning-based image restoration networks [54, 38]. These methods typically use a U-shaped network with an encoder-decoder structure [16, 55, 56] to achieve more efficient and robust recovery results, which can also be easily inserted into an end-to-end framework for joint optimization. To break the bottleneck of data-driven methods under scarce data, the PSF information is explored to design physical-informed networks, where model-based methods are characterized by Convolutional Neural Networks (CNNs) for learning ill-posed terms [14, 57]. Explorations have also been made in [27, 35, 36, 58] to apply transformers for solving the inverse problem, leveraging its strong long-range modeling capabilities.

Differently, we make a pioneering effort and investigate the potential of transformer-based SR models in aberration correction rather than conventional Deblur models. Then, the PSFs are transformed into an additional modality of the aberration image, based on which we design PSF-aware mechanisms for achieving better results. The proposed PSF-aware Aberration-image Recovery Transformer (PART) is a successful attempt to engage PSF information in the representation learning stage of SR models for recovering aberration images.

The overview of PCIE is shown in Fig. 3. It provides a powerful framework for minimalist and high-quality panoramic imaging, where optical design (detailed in Sec. III) and learning-based model (presented in Sec. IV) are intertwined to achieve impressive imaging results.

III Minimalist Panoramic Imaging Prototype

In this section, we set up a universal prototype for minimalist panoramic imaging systems based on modern panoramic lens designs (Sec. III-A). To address the issues induced by the reduced lens numbers and limited image plane size, two settings of tasks and benchmarks are defined in Sec. III-B and Sec. III-C, respectively. In Sec. III-D, we describe the constructed imaging simulation model to generate synthetic image pairs for training learning-based methods.

III-A Optical Design

To boost scene understanding with larger FoV, panoramic optical systems are emerging, including fisheye optical systems, refractive panoramic systems, panoramic annular optical systems, etc. [11]. In most modern designs of panoramic lenses [12, 21, 22], a panoramic head is applied for collecting incident rays of 360° FoV, while a set of relay lenses is designed to bend the rays and correct the aberrations. Based on the structure, we propose the Minimalist Panoramic Imaging Prototype (MPIP), including an essential panoramic head and a simple relay lens group, as shown in Fig. 1(a).

Refer to caption
Figure 4: Two prototype samples of MPIP. Up: MPIP-P1, Down: MPIP-P2. (a) Optical path diagram. (b) Visualized PSF distributions. (c) The degraded checkerboard image patches of normalized FoVs 0.10.10.10.1, 0.60.60.60.6, and 0.90.90.90.9 are captured by two MPIP samples. The minimalist optical design brings spatially-variant aberration-induced blur, especially for MPIP-P2 equipped with fewer lenses.

Specifically, we adopt a more compact and efficient solution, i.e. Panoramic Annular Lens (PAL) [23, 24], in MPIP samples, where a catadioptric PAL head is equipped for 360° annular imaging. For minimalist design and convenient manufacture, spherical lenses are applied in the relay lens group and the number is reduced to fewer than 3333. To illustrate the imaging results of different relay lens groups, we define MPIP-P1 and MPIP-P2 in Fig. 4(a), whose relay lens group is composed of two lenses and a single lens, respectively.

The lack of enough lenses for aberration correction makes the imaging point spread from an ideal point, inducing spatially-variant PSFs with large kernel sizes, as shown in Fig. 4(b). The average geometric spot radius of MPIP-P1 is 13.78μm13.78𝜇𝑚13.78{\mu}m13.78 italic_μ italic_m, whereas that of MPIP-P2 is 46.26μm46.26𝜇𝑚46.26{\mu}m46.26 italic_μ italic_m. As a result, the captured images of MPIP suffer from spatially-variant aberration-induced blur, especially for MPIP-P2 with fewer lenses, as shown in Fig. 4(c).

III-B Definition of Tasks

In addition to the uncorrected optical aberrations, the limited image plane size due to the small aperture of MPIP presents the issue of image resolution. To fit the small image plane of the MPIP, an image sensor with a smaller pixel size can be applied to maintain high resolution, but it makes the system more sensitive to aberration-induced blur. As shown in Fig 5(a), the diffused optical spot of fixed physical size affects more pixels for the sensor with smaller pixel sizes and higher resolution. The opposite solution with large pixel sizes is less sensitive to the diffused spot, but the reduced image resolution also brings degradation to the images, which is especially harmful to panoramic images with large FoV [6].

To address this dilemma, we propose two pipelines for solving the contradictory problems, as shown in Fig. 5(b), where a learning-based model is applied to process different image recovery tasks. For image sensors with smaller pixel sizes, we define the Aberration Correction (AC) task, where the goal is to recover a clear image xhqH×W×3subscript𝑥𝑞superscript𝐻𝑊3{x}_{hq}\in\mathbb{R}^{H{\times}W{\times}3}italic_x start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT from a high-resolution input aberration-image xabH×W×3subscript𝑥𝑎𝑏superscript𝐻𝑊3{x}_{ab}\in\mathbb{R}^{H{\times}W{\times}3}italic_x start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT. Whereas for image sensor with larger pixel size, the Super-Resolution and Aberration Correction (SR&\&&AC) task is raised to recover a high-resolution aberration-free image xhqH×W×3subscript𝑥𝑞superscript𝐻𝑊3{x}_{hq}\in\mathbb{R}^{H{\times}W{\times}3}italic_x start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT from a low-resolution input aberration-image xlqHs×Ws×3subscript𝑥𝑙𝑞superscript𝐻𝑠𝑊𝑠3{x}_{{lq}}{\in}\mathbb{R}^{\frac{{H}}{{s}}\times\frac{{W}}{{s}}\times 3}italic_x start_POSTSUBSCRIPT italic_l italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG italic_s end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_s end_ARG × 3 end_POSTSUPERSCRIPT, where s𝑠sitalic_s is the scale factor of SR.

Refer to caption
Figure 5: Illustration of two pipelines for processing aberration-images of MPIP. (a) Comparison of image sensors with different pixel sizes. For a diffused spot through the optical system with a fixed size, more pixels of the sensor with smaller pixel sizes and higher resolutions are affected. (b) Raised two tasks based on sensors with different pixel sizes: Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&\&&AC). In summary, we target the recovery of a high-quality image from an aberration image of MPIP.

III-C PALHQ: Established Dataset of High-Quality PAL Images

The lack of high-quality image datasets for PAL comes as a bottleneck to the above tasks. A piece of previous work for CI of PAL, ACI [28], unfolds the annular PAL images into perspective ones to utilize the publicly available datasets, i.e. DIV2K [59]. However, the asymmetrical interpolation during unfolding induces extra image degradation, which further complicates the image degradation factors for MPIP. In addition, the annular image is more appealing for the simulation of aberrations in the original image plane and necessary for some vision tasks like PAL-based SLAM [41, 42]. In the case of benchmarks for processing panoramic images, e.g., the ODI-SR dataset [60] and the SUN360 panorama dataset [61], which are taken via fisheye cameras, the imaging process is also quite different from that of PAL [11]. These concerns raise an urgent request for high-quality panoramic annular image datasets.

To this intent, we propose PALHQ, a dataset of high-quality PAL images, to facilitate network training and evaluation of PAL-based low-level vision tasks. A well-designed PAL of 11111111 lenses and a Sony α𝛼\alphaitalic_α6600 camera are applied to capture high-resolution PAL images with negligible primary aberrations. PALHQ contains 550550550550 clear PAL images with a resolution of 3152×3152315231523152\times 31523152 × 3152, covering rich and varied scenes of indoor, natural, urban, campus, and scenic spots. We divide PALHQ into 500500500500 images for the training set and 50505050 images for the validation set (refer to the appendix for sample images of PALHQ). In PCIE, we benchmark both AC and SR&\&&AC on PALHQ, where the corresponding aberration images are generated by the imaging simulation model depicted below. Furthermore, PALHQ can be also transmitted to unfolded panoramas via equirectangular projection (ERP), which can support various panoramic image processing applications.

III-D Imaging Simulation Model

To quantitatively benchmark the raised two tasks and enable supervised training of learning-based models, paired aberration images and clear images are required. Following previous super-resolution works [62, 63] and CI works [28, 56], we construct an imaging simulation model to generate synthetic aberration-images in batches.

The wave-based simulation pipeline with random perturbation in [28] is adopted to generate multiple aberration distributions directly on clear annular PAL images. Specifically, the clear raw image R𝑅Ritalic_R is modulated by an optical system and then processed by ISP Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) to produce the final imaging result A𝐴Aitalic_A:

Aθ(x,y)=Γ[(rλRθ(x,y)Kθ(x,y,λ)𝑑λ)+N],subscript𝐴𝜃𝑥𝑦Γdelimited-[]tensor-productsubscript𝑟𝜆subscript𝑅𝜃𝑥𝑦subscript𝐾𝜃𝑥𝑦𝜆differential-d𝜆𝑁A_{\theta}(x,y)=\Gamma[(\int{{r_{\lambda}}R_{\theta}(x,y)\otimes K_{\theta}(x,% y,\lambda)d\lambda})\downarrow+N],italic_A start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = roman_Γ [ ( ∫ italic_r start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) ⊗ italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) italic_d italic_λ ) ↓ + italic_N ] , (1)

where rλsubscript𝑟𝜆{r_{\lambda}}italic_r start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is the wave response of the sensor. The noise N𝑁Nitalic_N and the sampling process ()absent(\cdot)\downarrow( ⋅ ) ↓ of the image sensor are also included in the model. We divide the image into patches for patch-wise convolution with PSFs Kθ(x,y,λ)subscript𝐾𝜃𝑥𝑦𝜆K_{\theta}(x,y,\lambda)italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) under different FoV θ𝜃\thetaitalic_θ. Different from [28], the division of FoV is centrosymmetric for annular images as is shown in Fig. 4(b). Through scalar diffraction integral [64], Kθ(x,y,λ)subscript𝐾𝜃𝑥𝑦𝜆K_{\theta}(x,y,\lambda)italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) is calculated based on the wavefront Φθ(x,y,λ)subscriptΦ𝜃superscript𝑥superscript𝑦𝜆\Phi_{\theta}(x^{\prime},y^{\prime},\lambda)roman_Φ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ ) on exit pupil plane, which is described by Zernike polynomials [65] mathematically:

Φθ(x,y,λ)=n,mCnm(θ,λ)Znm(x,y),subscriptΦ𝜃superscript𝑥superscript𝑦𝜆subscript𝑛𝑚subscriptsuperscript𝐶𝑚𝑛𝜃𝜆subscriptsuperscript𝑍𝑚𝑛superscript𝑥superscript𝑦\Phi_{\theta}(x^{\prime},y^{\prime},\lambda)=\sum_{n,m}{C^{m}_{n}}(\theta,% \lambda){Z^{m}_{n}}(x^{\prime},y^{\prime}),roman_Φ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ ) = ∑ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ , italic_λ ) italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (2)

where C(θ,λ)𝐶𝜃𝜆C(\theta,\lambda)italic_C ( italic_θ , italic_λ ) denotes Zernike coefficients under FoV θ𝜃\thetaitalic_θ and wavelength λ𝜆\lambdaitalic_λ and Z𝑍Zitalic_Z refers to polynomials of the coordinate (x,y)superscript𝑥superscript𝑦(x^{\prime},y^{\prime})( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) on exit pupil. The combination of different m𝑚mitalic_m and n𝑛nitalic_n represents different orders. Finally, we apply the random disturbance strategy in [28] to fine-tune the ideal C(θ,λ)𝐶𝜃𝜆C(\theta,\lambda)italic_C ( italic_θ , italic_λ ) from the Zemax𝑍𝑒𝑚𝑎𝑥Zemaxitalic_Z italic_e italic_m italic_a italic_x software, generating synthetic aberration images with diverse aberration distributions.

IV Low-Quality MPIP Images Recovery

In this section, we describe the proposed learning-based model to recover low-quality MPIP images, as shown in Fig. 6. The PSF information, characterizing the image degradation process, is represented as the PSF map, detailed in Sec. IV-A, serving as one additional modality for our model. With the PSF map, we design the PSF-aware Feature Modulator (PFM) and the PSF-aware Mix-Attention Block (PMAB), elaborated in Sec. IV-B and Sec. IV-C, respectively. Then, the PSF-aware Aberration-image Recovery Transformer (PART) is established and introduced in Sec. IV-D as a transformer-based paradigm for the raised two tasks.

Refer to caption
Figure 6: PART: Proposed PSF-aware Aberration-image Recovery Transformer. PART is established on a classical super-resolution paradigm [27, 30], incorporating stages of feature extraction, representation learning, and image reconstruction, for dealing with both AC and SR&\&&AC. Task-Processing leverages the pixel-unshuffle operation [63] for AC to reduce the spatial size of high-resolution images, whereas no operation is entailed for SR&\&&AC. PSF-aware Residual Transformer Block (PRTB) is the basic block of representation learning, where we design PSF-aware Feature Modulator (PFM) and PSF-aware Mix-Attention Block (PMAB) to learn spatially-variant degradation features with the guidance of PSF features. The mixing of Window-based Multi-head Self-Attention (W-MSA), PSF-aware Varied-Size Attention (P-VSA), and PFM enable PMAB to capture both global and local dependencies adaptively. For an intuitive understanding of the PSF map, we visualize it via a form of PSF distributions in Fig. 4.

IV-A The Representation of PSF Information

For non-blind optimization-based recovery methods in aberration correction, e.g. Wiener filter [51], PSFs K𝐾Kitalic_K of the system are exploited to predict the clear image xhqsubscript𝑥𝑞x_{hq}italic_x start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT from the aberration-image xabsubscript𝑥𝑎𝑏x_{ab}italic_x start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT by deconvolution. However, this method often fails when the PSFs deviate from the design stage during manufacture and require time-consuming strategies [43, 48] for processing complex spatially-variant blur. Data-driven learning-based models [27, 36], which can be plugged directly into existing end-to-end frameworks of lens design [18, 19, 20], have demonstrated more powerful abilities in image recovery, but may hit a bottleneck when the training data is scarce.

This motivates us to break the bottleneck by utilizing PSFs of the optical system in a learning-based model. The PSFs of n𝑛nitalic_n sampled FoVs of the applied MIIP can be calculated based on the wavefront as depicted in Eq. (2):

Ki(x,y)=𝒮(Φi(x,y)),i=1,2,3,,n.formulae-sequencesubscript𝐾𝑖𝑥𝑦𝒮subscriptΦ𝑖superscript𝑥superscript𝑦𝑖123𝑛{K_{i}}(x,y)=\mathcal{S}({\Phi_{i}}(x^{\prime},y^{\prime})),i=1,2,3,\cdots,n.italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) = caligraphic_S ( roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , italic_i = 1 , 2 , 3 , ⋯ , italic_n . (3)

The 𝒮()𝒮\mathcal{S}(\cdot)caligraphic_S ( ⋅ ) denotes scalar diffraction integral (refer to [64] for more details). Previous methods tend to use the kernel size of Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to guide the network [28] or refine the ill-posed term in deconvolution through a learning-based model [57]. Although these attempts can improve the recovery benefiting from the applications of Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the spatial intensity distribution of PSFs is not fully exploited to guide the deep feature extraction of the general image recovery paradigm.

To this intent, we propose a PSF representation method to produce a PSF map, containing both intensity and size distributions of PSF kernels, which is aligned with the image feature map. We first map the spatial PSFs Kiki×ki×3subscript𝐾𝑖superscriptsubscript𝑘𝑖subscript𝑘𝑖3K_{i}\in\mathbb{R}^{k_{i}{\times}{k_{i}}{\times}3}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT into the image feature shape H×W×C𝐻𝑊superscript𝐶{H{\times}{W}{\times}{C^{\prime}}}italic_H × italic_W × italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the kernel size of PSF under the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT FoV and Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the channels of mapped PSFs, serving as an additional modality aligned with the aberration-image. As previously shown in [66], the spatial-to-channel arrangement helps transform spatially-variant kernels into a feature map. Similarly, to produce a PSF feature map, the spatial PSFs Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are arranged into the channel dimension. Concretely, for a pixel at (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) of the image, we first calculate the vector p𝑝\overrightarrow{p}over→ start_ARG italic_p end_ARG from the image center (0,0)00(0,0)( 0 , 0 ) to (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), and define the vertical unit vectora(1,0)𝑎10\overrightarrow{a}(1,0)over→ start_ARG italic_a end_ARG ( 1 , 0 ). The PSF of the corresponding FoV is located by |p|/max(|p|)𝑝𝑝\left|\overrightarrow{p}\right|/\max\left(\left|{\overrightarrow{p}}\right|\right)| over→ start_ARG italic_p end_ARG | / roman_max ( | over→ start_ARG italic_p end_ARG | ), and rotated by the angle arccos(pa|p||a|)𝑝𝑎𝑝𝑎\arccos\left(\frac{\vec{p}\cdot\vec{a}}{\left|\vec{p}\right|\cdot\left|\vec{a}% \right|}\right)roman_arccos ( divide start_ARG over→ start_ARG italic_p end_ARG ⋅ over→ start_ARG italic_a end_ARG end_ARG start_ARG | over→ start_ARG italic_p end_ARG | ⋅ | over→ start_ARG italic_a end_ARG | end_ARG ), producing the PSF Kx,ysubscript𝐾𝑥𝑦K_{x,y}italic_K start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT of the pixel. For memory-friendly computation, we pad all Kx,ysubscript𝐾𝑥𝑦K_{x,y}italic_K start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT into unified size of maxikisubscript𝑖subscript𝑘𝑖\max\limits_{i}k_{i}roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and compress them into k×k×1superscript𝑘superscript𝑘1{k^{\prime}\times{k^{\prime}}\times 1}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 1 via adaptive average pooling:

Kx,y^=AveragePool(padding(Kx,y)),^subscript𝐾𝑥𝑦AveragePoolpaddingsubscript𝐾𝑥𝑦\hat{K_{x,y}}={\rm{AveragePool}}({\rm{padding}}{(K_{x,y})}),over^ start_ARG italic_K start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT end_ARG = roman_AveragePool ( roman_padding ( italic_K start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ) ) , (4)

where the choice of compressed size ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is ablated in Sec. V-F. The Kx,y^^subscript𝐾𝑥𝑦\hat{K_{x,y}}over^ start_ARG italic_K start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT end_ARG is then reshaped into 1×1×(k2)11superscript𝑘2{1\times 1\times(k^{\prime 2})}1 × 1 × ( italic_k start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT ) and inserted into each pixel to produce the PSF feature map xintH×W×(k2)subscript𝑥𝑖𝑛𝑡superscript𝐻𝑊superscript𝑘2x_{int}\in\mathbb{R}^{H\times{W}\times(k^{\prime 2})}italic_x start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × ( italic_k start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. In addition, considering the lost PSF size information during compressing, we also generate the size distribution map xsH×W×3subscript𝑥𝑠superscript𝐻𝑊3x_{s}\in\mathbb{R}^{H\times{W}\times 3}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT of RGB channels, where the value of each pixel represents the kernel size. Finally, the PSF map xpsfsubscript𝑥𝑝𝑠𝑓x_{psf}italic_x start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT is produced via Eq. (5):

xpsf=Concat(xint,xs),subscript𝑥𝑝𝑠𝑓Concatsubscript𝑥𝑖𝑛𝑡subscript𝑥𝑠x_{psf}={\rm{Concat}}{(x_{int},x_{s})},italic_x start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT = roman_Concat ( italic_x start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (5)

and the visualized PSF map is shown in the appendix. PSF map is an aligned modality of the aberration image characterizing the image degradation over FoVs, based on which we design a PSF-aware transformer, as described in the next subsection.

IV-B PSF-aware Feature Modulator

CNN layers have shown impressive abilities of local feature extraction, but are restricted to the fixed spatially-invariant kernels. However, the mathematical imaging model in Eq. (1) reveals that the aberration-induced blur is only generated by convolution with spatially-variant PSF kernels, whose inverse solution cannot be modeled by the fixed convolution kernels [55, 56].

To extract adaptive image features with the guidance of spatially-variant PSF kernels, we propose the PSF-aware Feature Modulator (PFM), as shown in the lower left of Fig. 6. PFM builds on the idea of filter adaptive convolution [66, 67], where a kernel map of H×W×(Ck2)𝐻𝑊𝐶superscript𝑘2{H{\times}{W}{\times}(Ck^{2})}italic_H × italic_W × ( italic_C italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is predicted from feature map of H×W×C𝐻𝑊𝐶{H{\times}{W}{\times}{C}}italic_H × italic_W × italic_C. Differently, in PFM, the kernel map xkernelsubscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙x_{kernel}italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT is predicated on the features of PSF map xpsfsubscript𝑥𝑝𝑠𝑓x_{psf}italic_x start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT, which has been compressed into a similar form as xkernelsubscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙x_{kernel}italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT. We first apply Epsfsubscript𝐸𝑝𝑠𝑓E_{psf}italic_E start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT as a 3×3333\times{3}3 × 3 convolution layer to extract features of PSF map xpsfsubscript𝑥𝑝𝑠𝑓x_{psf}italic_x start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT, as depicted in Eq. (6):

xpsf=Epsf(xpsf),subscriptsuperscript𝑥𝑝𝑠𝑓subscript𝐸𝑝𝑠𝑓subscript𝑥𝑝𝑠𝑓x^{\prime}_{psf}=E_{psf}(x_{psf}),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ) , (6)

where xpsfH×W×Csubscriptsuperscript𝑥𝑝𝑠𝑓superscript𝐻𝑊𝐶x^{\prime}_{psf}\in\mathbb{R}^{H\times{W}\times{C}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT is the extracted PSF feature map. Then, a lightweight kernel predictor composed of several convolution layers is proposed to output the kernel map xkernelH×W×(Ck2)subscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙superscript𝐻𝑊𝐶superscript𝑘2x_{kernel}\in\mathbb{R}^{H{\times}{W}{\times}(Ck^{2})}italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × ( italic_C italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT based on xpsfsubscriptsuperscript𝑥𝑝𝑠𝑓x^{\prime}_{psf}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT, as in Eq. (7):

xkernel=P(xpsf).subscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙𝑃subscriptsuperscript𝑥𝑝𝑠𝑓x_{kernel}=P(x^{\prime}_{psf}).italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT = italic_P ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ) . (7)

To reduce the memory cost and inference latency of kernel prediction, the predictor P𝑃Pitalic_P computes the kernel map on the downsampled features (by 4×4444{\times}44 × 4 average pooling). Benefiting from that the PSF map shares a similar form with the kernel map, we further simplify the P𝑃Pitalic_P where only one Max Pooling layer and one residual block of 1×1111{\times}11 × 1 convolution layers are applied, to predict the kernel map xkernelH8×W8×(Ck2)subscriptsuperscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙superscript𝐻8𝑊8𝐶superscript𝑘2x^{\prime}_{kernel}\in\mathbb{R}^{\frac{H}{8}{\times}\frac{W}{8}{\times}(Ck^{2% })}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × ( italic_C italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT in a smaller resolution. The final kernel map xkernelsubscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙x_{kernel}italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT is then obtained by ×8absent8{\times}8× 8 upsampling via bilinear interpolation. Finally, we reshape the xkernelsubscript𝑥𝑘𝑒𝑟𝑛𝑒𝑙x_{kernel}italic_x start_POSTSUBSCRIPT italic_k italic_e italic_r italic_n italic_e italic_l end_POSTSUBSCRIPT into a list of per-pixel kernels of k×k×C𝑘𝑘𝐶k{\times}{k}{\times}{C}italic_k × italic_k × italic_C and apply them to the corresponding pixels of image feature ximgsubscriptsuperscript𝑥𝑖𝑚𝑔x^{\prime}_{img}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT. PFM attempts to model the inverse process of the aberration-induced blur, i.e. deconvolution, which promotes the dynamic feature extraction of the aberration-image.

IV-C PSF-aware Mix-Attention Block

We put forward the PSF-aware Mix-Attention Block (PMAB) as the basic unit of our PSF-aware transformer, to process aberration images assisted with the PSF map, as shown at the middle bottom of Fig. 6. The Window-based Multi-head Self-Attention (W-MSA) of the Swin-T block [27] is first adopted to be the baseline attention mechanism for modeling spatially-variant convolution and long-range dependency, which is also important for stable training of the network.

To address the drawback of fixed window size in vanilla W-MSA, we further propose the PSF-aware Varied-Size Attention (P-VSA), shown on the lower right of Fig. 6. The vanilla varied-size attention [68] in high-level tasks predicts the sizes and locations of the windows from input features for computing self-attention on dynamic windows. Meanwhile, the kernel sizes of PSFs in different FoV regions reveal the severity of aberration-induced blur, which is relevant to the calculation of window-based self-attention. To better adaptively modulate the windows according to the PSF kernels, we make use of the PSF map features xpsfsubscriptsuperscript𝑥𝑝𝑠𝑓x^{\prime}_{psf}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT to generate PSF-aware varied-size windows. Concretely, the scale S𝑆Sitalic_S and offset O𝑂Oitalic_O of the varied-size windows are predicated on xpsfsubscriptsuperscript𝑥𝑝𝑠𝑓x^{\prime}_{psf}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT by the Window Transform block, which is composed of a 1×1111\times{1}1 × 1 convolution layer. Then, we sample the projected key and value tokens K,V𝐾𝑉K,Vitalic_K , italic_V of image features ximgsubscriptsuperscript𝑥𝑖𝑚𝑔x^{\prime}_{img}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT based on the transformed window to obtain Kvs,Vvssubscript𝐾𝑣𝑠subscript𝑉𝑣𝑠K_{vs},V_{vs}italic_K start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT. The cross-attention is computed between query Q𝑄Qitalic_Q of the default window and Kvs,Vvssubscript𝐾𝑣𝑠subscript𝑉𝑣𝑠K_{vs},V_{vs}italic_K start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT. The operation of P-VSA can be expressed as:

Q,K,V=Linear(WinPar(ximg)),𝑄𝐾𝑉LinearWinParsubscriptsuperscript𝑥𝑖𝑚𝑔Q,K,V={\rm{Linear}}{({\rm{WinPar}}(x^{\prime}_{img}))},italic_Q , italic_K , italic_V = roman_Linear ( roman_WinPar ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT ) ) , (8)
S,O=WinTrans(xpsf),𝑆𝑂WinTranssubscriptsuperscript𝑥𝑝𝑠𝑓S,O={\rm{WinTrans}}(x^{\prime}_{psf}),italic_S , italic_O = roman_WinTrans ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ) , (9)
Kvs,Vvs=Sample(K,S,O),Sample(V,S,O)formulae-sequencesubscript𝐾𝑣𝑠subscript𝑉𝑣𝑠Sample𝐾𝑆𝑂Sample𝑉𝑆𝑂K_{vs},V_{vs}={\rm{Sample}}(K,S,O),{\rm{Sample}}(V,S,O)italic_K start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT = roman_Sample ( italic_K , italic_S , italic_O ) , roman_Sample ( italic_V , italic_S , italic_O ) (10)
Attn(Q,Kvs,Vvs)=Softmax(QKvsd)Vvs,Attn𝑄subscript𝐾𝑣𝑠subscript𝑉𝑣𝑠Softmax𝑄superscriptsubscript𝐾𝑣𝑠top𝑑subscript𝑉𝑣𝑠{\rm Attn}(Q,K_{vs},V_{vs})={\rm Softmax}(\frac{QK_{vs}^{\top}}{\sqrt{d}})V_{% vs},roman_Attn ( italic_Q , italic_K start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT ) = roman_Softmax ( divide start_ARG italic_Q italic_K start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_v italic_s end_POSTSUBSCRIPT , (11)

where WinParWinPar\rm{WinPar}roman_WinPar denotes the window partition operation of Swin-T and d𝑑ditalic_d is the dimension of tokens.

Additionally, some works [30, 32] apply channel-attention-based convolution blocks in parallel with the self-attention to enhance the representation ability of the network. We insert the proposed PFM to PMAB in the same parallel way, where the filter adaptive convolution mechanism can better model the spatially-variant blur compared to channel-attention-based convolution.

Finally, PMAB is the mixing of W-MSA and P-VSA with a parallel 1×1111{\times}11 × 1 PFM. For the self-attention module, the image feature map ximgsubscriptsuperscript𝑥𝑖𝑚𝑔x^{\prime}_{img}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT is equally split along the channel dimension and processed by parallel W-MSA and P-VSA, then concatenated along the channel dimension again. The modulated feature map by parallel PFM is multiplied by a constant α𝛼\alphaitalic_α, to be added to the result of self-attention and the original feature map as common practice for stable training [32]. The whole process of PMAB is computed as:

ximg(1),ximg(2)=Split(ximg),subscriptsuperscriptsuperscript𝑥1𝑖𝑚𝑔subscriptsuperscriptsuperscript𝑥2𝑖𝑚𝑔Splitsubscriptsuperscript𝑥𝑖𝑚𝑔{x^{\prime}}^{(1)}_{img},{x^{\prime}}^{(2)}_{img}={\rm{Split}}({x^{\prime}}_{% img}),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT = roman_Split ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT ) , (12)
xattn=Concat(M-WSA(ximg(1)),P-VSA(ximg(2),xpsf)),subscript𝑥𝑎𝑡𝑡𝑛ConcatM-WSAsubscriptsuperscriptsuperscript𝑥1𝑖𝑚𝑔P-VSAsubscriptsuperscriptsuperscript𝑥2𝑖𝑚𝑔subscriptsuperscript𝑥𝑝𝑠𝑓x_{attn}={\rm{Concat}}({\rm{M\text{-}WSA}}({x^{\prime}}^{(1)}_{img}),{\rm{P% \text{-}VSA}}({x^{\prime}}^{(2)}_{img},x^{\prime}_{psf})),italic_x start_POSTSUBSCRIPT italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT = roman_Concat ( roman_M - roman_WSA ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT ) , roman_P - roman_VSA ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ) ) , (13)
xmix=xattn+αPFM(ximg,xpsf)+ximg,subscript𝑥𝑚𝑖𝑥subscript𝑥𝑎𝑡𝑡𝑛𝛼PFMsubscriptsuperscript𝑥𝑖𝑚𝑔subscriptsuperscript𝑥𝑝𝑠𝑓subscriptsuperscript𝑥𝑖𝑚𝑔x_{mix}=x_{attn}+\alpha{\rm{PFM}}({x^{\prime}}_{img},{x^{\prime}}_{psf})+{x^{% \prime}}_{img},italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT + italic_α roman_PFM ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_s italic_f end_POSTSUBSCRIPT ) + italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , (14)
y=xmix+FFN(xmix),𝑦subscript𝑥𝑚𝑖𝑥FFNsubscript𝑥𝑚𝑖𝑥y=x_{mix}+{\rm{FFN}}(x_{mix}),italic_y = italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT + roman_FFN ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ) , (15)

where FFNFFN{\rm{FFN}}roman_FFN is a common Feed Forward Network composed of a LayerNorm and a Multi-Layer Perceptron (MLP) layer.

IV-D PSF-aware Aberration-image Recovery Transformer

Most previous networks [16, 55] for aberration corrections often utilize the architecture of image deblurring methods, i.e. U-Net. However, for MPIP images with a high resolution (e.g., 3K3𝐾3K3 italic_K), the U-Net methods incur unacceptable computational costs due to the large image sizes at shallow layers. Differently, we look into the tasks from the perspective of image super-resolution, which processes image features with low resolution and reconstructs the high-quality image via an upsampling module. The aberration-induced blur brings aliasing between pixels and losses of image details, which can also be interpreted as “low resolution”. Thereby, as shown in Fig. 6, the PSF-aware Aberration-image Recovery Transformer (PART) is set up based on the structure of SwinIR [27] and our proposed PSF-aware mechanisms.

A Task-Processing module is first applied to transform the input image and PSF map to a small spatial size, where pixel-unshuffle [63] is leveraged for AC and no operation is entailed for SR&\&&AC. The PSF map is also concatenated with the aberration image as the input of the network. More precisely, PART contains three parts. (1) A feature extraction layer converts the input to image feature maps via a 3×3333{\times}33 × 3 convolution. (2) The representation learning stage applies stacks of transformer-based blocks ending with a convolution layer to enrich the learned degradation information of aberration-induced blur progressively. We design the PSF-aware Residual Transformer Block (PRTB) with several PMAB layers and a convolution layer. The PFM is inserted into each PRTB to modulate the learned features and model the inverse process of the aberration-induced blur. We also implement PFM at the beginning and end of the representation learning stage, for adaptive feature extraction and feature fusion based on PSF information. (3) The image reconstruction module further fuses the extracted deep features and recovers a high-quality image with higher resolution. With PART, we can recover a high-quality aberration-free image xhqsubscript𝑥𝑞x_{hq}italic_x start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT from either a high-resolution aberration image xabsubscript𝑥𝑎𝑏x_{ab}italic_x start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT or a low-resolution one xlqsubscript𝑥𝑙𝑞x_{lq}italic_x start_POSTSUBSCRIPT italic_l italic_q end_POSTSUBSCRIPT, providing a general solution to AC and SR&\&&AC:

xhqAC=PART(xab),xhqSR&AC=PART(xlq).formulae-sequencesubscriptsuperscript𝑥𝐴𝐶𝑞PARTsubscript𝑥𝑎𝑏subscriptsuperscript𝑥𝑆𝑅𝐴𝐶𝑞PARTsubscript𝑥𝑙𝑞{x^{AC}_{hq}}={\rm{PART}}(x_{ab}),{x^{SR\&AC}_{hq}}={\rm{PART}}(x_{lq}).italic_x start_POSTSUPERSCRIPT italic_A italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT = roman_PART ( italic_x start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ) , italic_x start_POSTSUPERSCRIPT italic_S italic_R & italic_A italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_q end_POSTSUBSCRIPT = roman_PART ( italic_x start_POSTSUBSCRIPT italic_l italic_q end_POSTSUBSCRIPT ) . (16)

V Experiments and Results

We conduct a comprehensive set of experiments to evaluate the proposed PCIE for minimalist and high-quality panoramic imaging. We first describe the implementation details of our work in Sec. V-A. The PCIE under different recovery models is then evaluated on both synthetic (Sec. V-B) and real (Sec. V-C) datasets. We further investigate the GAN-based training strategies for PCIE in Sec. V-D. At last, in Sec. V-E and Sec. V-F, ablation studies on training datasets and the architecture of PART are conducted.

V-A Implementation Details

TABLE I: Quantitative evaluation of PCIE with the AC pipeline on synthetic benchmarks with MPIP-P1 and MPIP-P2. We highlight the best and second results. The “*” for NAFNet and Restormer denotes that the cropping testing strategy is applied.
Method PALHQ-SynMPIP-P1 PALHQ-SynMPIP-P2
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
SR RRDB [29] 32.716 0.9265 0.0469 03.704 26.996 0.8503 0.0896 17.413
RCAN [34] 32.496 0.9257 0.0459 04.327 26.456 0.8443 0.0956 22.619
EDSR [33] 32.868 0.9282 0.0449 03.770 26.951 0.8500 0.0889 17.871
SwinIR [27] 32.913 0.9291 0.0446 03.670 26.935 0.8509 0.0884 17.630
EDT [31] 32.929 0.9288 0.0450 03.658 27.055 0.8518 0.0888 16.976
HAT [32] 32.925 0.9288 0.0447 03.748 26.827 0.8500 0.0889 18.498
GRL [30] 32.369 0.9256 0.0457 04.234 26.268 0.8424 0.0943 22.463
Deblur HINet [37] 32.238 0.9234 0.0476 04.159 26.401 0.8428 0.0933 22.341
NAFNet* [38] 32.837 0.9274 0.0441 03.845 27.045 0.8514 0.0856 17.504
Restormer* [35] 32.971 0.9287 0.0445 03.763 27.001 0.8510 0.0870 17.023
Uformer [36] 32.999 0.9290 0.0442 03.672 27.133 0.8525 0.0866 16.693
PSF-aware PI2RNet [28] 32.682 0.9268 0.0448 03.638 26.656 0.8471 0.0874 18.544
RRDB+ 32.816 0.9271 0.0456 03.746 27.050 0.8505 0.0895 17.103
GRL+ 32.847 0.9292 0.0454 03.627 27.020 0.8528 0.0864 16.281
PART (Ours) 33.143 0.9304 0.0435 03.571 27.198 0.8540 0.0855 16.436
TABLE II: Quantitative evaluation of PCIE with SR&\&&AC pipeline on synthetic benchmark.
Method PALHQ-SynMPIP-P1
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
RRDB [29] 28.856 0.8758 0.0733 09.957
RCAN [34] 28.238 0.8686 0.0787 12.689
EDSR [33] 28.817 0.8759 0.0715 10.670
SwinIR [27] 28.985 0.8781 0.0714 09.938
EDT [31] 29.008 0.8777 0.0726 10.750
HAT [32] 28.921 0.8771 0.0727 10.141
GRL [30] 28.695 0.8753 0.0714 09.829
RRDB+ 29.044 0.8774 0.0724 10.068
GRL+ 28.757 0.8768 0.0716 09.709
PART (Ours) 29.310 0.8819 0.0681 09.648

Synthetic Datasets. We apply the collected PALHQ dataset for training and evaluation. Based on PALHQ, the aberration images of two prototypes, i.e. PALHQ-SynMPIP-P1 and PALHQ-SynMPIP-P2, are generated by the simulation model of Eq. (1). Following [28], we set the random range of disturbance as 25%percent2525\%25 % and generate 10101010 virtual MPIP samples for the training set (500500500500 images) and 4444 for the validation set (50505050 images) to simulate the synthetic-to-real gap. For image sensors, the MV-SUA1600C camera with a pixel size of 1.34μm1.34𝜇𝑚1.34{\mu}m1.34 italic_μ italic_m and the MV-SUA133GC camera with a pixel size of 4μm4𝜇𝑚4{\mu}m4 italic_μ italic_m are applied for the AC and SR&\&&AC pipelines, respectively, where the ISP and wave response of them are simulated in the data generation. In addition, we use ×3absent3\times 3× 3 bicubic downsampling to produce low-resolution aberration images for SR&\&&AC, considering the sensors’ pixel sizes.

Real-world Datasets. As shown in Fig. 4, with only one more simple lens, the MPIP-P1 reveals much better image quality, which relieves the burden of the post-image processing pipelines. We manufacture MPIP-P1 and use it to record the RealMPIP3K-AC (58585858 images with a resolution of 2912×2912291229122912\times 29122912 × 2912) and RealMPIP1K-SR&\&&AC (64646464 images with a resolution of 992×992992992992\times 992992 × 992) with two cameras respectively, to provide real-world MPIP aberration-images for evaluating two pipelines of PCIE. We test models trained on PALHQ-SynMPIP-P1 (AC) and PALHQ-SynMPIP-P1 (SR&\&&AC) with RealMPIP3K-AC and RealMPIP1K-SR&\&&AC respectively.

Evaluation Metrics. For synthetic datasets with ground truth, PSNR and SSIM [69] are employed to evaluate the fidelity of the recovery results, whereas LPIPS [70] and FID [71] are employed to evaluate the perceptual quality.

For real datasets without reference clear image, we employ no-reference metrics, i.e. NIQE and BRISQUE, to evaluate the image quality of MPIP images in terms of natural images. The qualitative visual results are also provided for an intuitive evaluation. However, the NIQE [72] and BRISQUE [73] are built on the statistics of perspective natural images, which are challenging for assessing the MPIP images with the annular distribution of image content. Considering the specific tasks of correcting optical aberrations, we define the Optical-based Image Quality Evaluator (OIQE) for credible evaluation, based on the Modulation Transfer Function (MTF) of the imaging system calculated by a set of testing checkerboard images.

To be specific, we follow Spatial Frequency Response (SFR) [56] testing to calculate MTFs on image patches of “knife-edge” of different FoVs from different testing images. MTF50𝑀𝑇𝐹50MTF50italic_M italic_T italic_F 50 and MTFarea𝑀𝑇𝐹𝑎𝑟𝑒𝑎MTFareaitalic_M italic_T italic_F italic_a italic_r italic_e italic_a are used to characterize the MTF curves, where the former is the frequency when the MTF drops 50%percent5050\%50 % and the latter is the area under the MTF curve. We further define OIQE50𝑂𝐼𝑄𝐸50OIQE50italic_O italic_I italic_Q italic_E 50 and OIQEarea𝑂𝐼𝑄𝐸𝑎𝑟𝑒𝑎OIQEareaitalic_O italic_I italic_Q italic_E italic_a italic_r italic_e italic_a as the ratio of the average MTF50𝑀𝑇𝐹50MTF50italic_M italic_T italic_F 50 and MTFarea𝑀𝑇𝐹𝑎𝑟𝑒𝑎MTFareaitalic_M italic_T italic_F italic_a italic_r italic_e italic_a of the testing imaging pipeline to those of a well-designed panoramic imaging system. Accordingly, OIQE𝑂𝐼𝑄𝐸OIQEitalic_O italic_I italic_Q italic_E is defined as:

OIQE=OIQE50+OIQEarea2,𝑂𝐼𝑄𝐸𝑂𝐼𝑄𝐸50𝑂𝐼𝑄𝐸𝑎𝑟𝑒𝑎2OIQE=\frac{OIQE50+OIQEarea}{2},italic_O italic_I italic_Q italic_E = divide start_ARG italic_O italic_I italic_Q italic_E 50 + italic_O italic_I italic_Q italic_E italic_a italic_r italic_e italic_a end_ARG start_ARG 2 end_ARG , (17)

which measures the gap between the results of PICE and conventional panoramic lenses in terms of MTF. OIQE is only applied in the AC pipeline due to its specific design for evaluating the ability of the model to remove aberration-induced blur.

In addition, with the testing checkerboard images of OIQE, we generate the ground-truth images through edge extraction and re-coloring following [55], so that the PSNR and SSIM can be applied as metrics in this setting.

Finally, we conduct a user study as a subjective evaluation method. The results of the User Study (U.S.) will be presented as the percentage of times that each method’s results were chosen as the best.

The implementation details of the ground-truth generation pipeline and user study are depicted in the Appendix. Based on the above evaluation pipelines and metrics, a comprehensive evaluation of competitive recovery models on real-world datasets will be presented in Section V-C.

Compared Methods. For the AC pipeline, as shown in Table II, we compare PART with representative state-of-the-art SR models (RRDB [29], RCAN [34], EDSR [33], SwinIR [27], EDT [31], HAT [32], and GRL [30]), along with Deblur methods (HINet [37], NAFNet [38], Restormer [35], and UFormer [36]). Image restoration models with PSF-aware mechanisms, i.e. RRDB+, GRL+, and PI2RNet [28], are also included in the comparison. Here, “+” means that the methods are inserted with the designed PFM, where we select RRDB and GRL as the classical CNN- and state-of-the-art transformer-based SR model to investigate the adaptability of PSF-aware mechanisms to different types of models. For the PSF-aware methods in SR&\&&AC pipeline, only RRDB+ and GRL+ are selected due to the specific task requirement for super-resolution, as shown in Table II.

All the models are retrained on PALHQ-SynMPIP-P1 and PALHQ-SynMPIP-P2 with their original optimizers, learning rates, and schedulers, where the number of training iterations and the batch size are set the same as PART for a fair comparison. Additionally, we apply task-processing for all the SR models the same as PART.

Training Details. The compressed kernel size ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the PSF map is set to 5555 in our experiments, where an ablation study is conducted in Sec. V-F. In addition, we set the kernel size k𝑘kitalic_k of PFM to 3333 considering the computational efficiency. Following SwinIR [27], the PRTB number, PMAB number, channel number, attention head number, and window size are generally set to 6666, 6666, 180180180180, 6666, and 8888, respectively.

PART is trained on L1Loss, while other loss functions are explored in Sec. V-D. We train the models with the Adam optimizer with an initial learning rate of 2e42𝑒42e{-}42 italic_e - 4 and a batch size of 8888 on a single A800 GPU. For data augmentation, random crop, flip, and rotation are applied, where the ground-truth crop size is 256×256256256256{\times}256256 × 256 for AC and 196×196196196196{\times}196196 × 196 for SR&\&&AC to keep an image size of 64646464 in the representation learning stage. The number of training iterations is set to 200k200𝑘200k200 italic_k and the learning rate is halved at 100k100𝑘100k100 italic_k, 160k160𝑘160k160 italic_k, 180k180𝑘180k180 italic_k, and 190k190𝑘190k190 italic_k.

Refer to caption
Figure 7: Qualitative results of representative models on synthetic benchmarks for AC. We zoom in on image patches of different FoVs to show the details. (a) Results on PALHQ-SynMPIP-P1, where aberration-free images are produced by all models and PART provides more visually pleasant and clearer details. (b) Results on PALHQ-SynMPIP-P2. PCIE delivers clear aberration-free images, yet, with heavily corrupted detailed textures due to severe optical aberrations. Enabling much higher imaging quality with only one more spherical lens, MPIP-P1 is a superior choice for PCIE.

V-B Experiments on Synthetic Datasets

AC Pipeline. Table II shows numerical results of PCIE under different image recovery models on synthetic benchmarks of AC. Considering that the performance of NAFNet and Restormer is sensitive to input resolution [74, 75], the cropping testing strategy is applied for the two models (i.e., NAFNet* and Restormer*), which is depicted in the Appendix. We also present visual results of representative methods in Fig. 7. PCIE with most models achieves PSNR over 32dB32𝑑𝐵32dB32 italic_d italic_B on PALHQ-SynMPIP-P1 and over 26dB26𝑑𝐵26dB26 italic_d italic_B on PALHQ-SynMPIP-P2, producing impressive panoramic imaging results via a minimalist optical system. Compared to Deblur methods, SR methods overall deliver better results, illustrating the effectiveness of the SR framework in aberration correction. PSF-aware methods further outperform their baselines. Precisely, PI2RNet exceeds HINet, PART surpasses SwinIR, and RRDB+ and GRL+ outstrip their corresponding baselines by clear margins. We find that the models based on the window-attention mechanism (SwinIR, EDT, HAT, UFormer, GRL, and our proposed PART) realize more competitive results than CNN-based models, where the window-based self-attention can better model spatially-variant blur. Yet, the state-of-the-art SR model GRL performs poorly on the benchmarks, which is attributed to the stripe-based attention being difficult to adapt to MPIP images with annular distributions.

Overall, PART brings better results for PCIE, yielding state-of-the-art performance on two benchmarks, in terms of both fidelity-based metrics (PSNR and SSIM) and perceptual-based metrics (LPIPS and FID). As all the methods produce aberration-free visual results with some lost textures and artifacts, the recovered image of PART shows more visually pleasant details, as shown in Fig. 7(a), in all FoVs.

Further, applied with only one more spherical lens, the PCIE results of MPIP-P1 outperform those of MPIP-P2 by a large margin. For example, the PSNR drops by 3.050dB6.101dBsimilar-to3.050𝑑𝐵6.101𝑑𝐵3.050dB{\sim}6.101dB3.050 italic_d italic_B ∼ 6.101 italic_d italic_B when MPIP-P2 is equipped. As shown in Fig. 7(b), PCIE with MPIP-P2 delivers moderate clear aberration-free images. Yet, suffering from severe aberrations, its detailed textures are heavily corrupted, especially for large FoVs. In this sense, MPIP-P1 is a superior choice for PCIE to achieve minimalist and high-quality panoramic imaging.

SR&\&&AC Pipeline. The quantitative evaluation of PCIE with the SR&\&&AC pipeline is shown in Table II. Consistent with the observations in AC, the methods with window-based attention and PSF-aware mechanisms lead to better performance. PART sets the state of the art in the SR&\&&AC task, achieving improvements compared against the second best, e.g. 0.266dB0.266𝑑𝐵0.266dB0.266 italic_d italic_B in PSNR, 0.00380.00380.00380.0038 in SSIM, LPIPS from 0.07140.07140.07140.0714 to 0.06810.06810.06810.0681 (about 5%percent55\%5 %), and FID from 9.7099.7099.7099.709 to 9.6489.6489.6489.648 (about 6%percent66\%6 %).

Refer to caption
Figure 8: Comparison between AC and SR&\&&AC. The image patches are cropped from PALHQ-0532 (top) and PALHQ-0505 (bottom). We show the results of the proposed PART and its baseline SwinIR to illustrate the strengths of the AC pipeline, which produces richer and more realistic image details.

Comparing SR&\&&AC (Table II) with AC (Table II), we observe that the loss of spatial resolution in aberration-images causes significant deterioration to the imaging quality of PCIE, e.g., to an amount of 4.258dB3.674dBsimilar-to4.258𝑑𝐵3.674𝑑𝐵{-}4.258dB{\sim}{-}3.674dB- 4.258 italic_d italic_B ∼ - 3.674 italic_d italic_B in PSNR. The visual quality comparison between the two pipelines is provided in Fig. 8, where the imaging results of AC reveal richer and more realistic details. In this case, AC is a more competitive pipeline for reconstructing high-resolution aberration-free images, where the real sampled pixels of the sensor offer more convincing imaging features than super-resolved ones despite more aberration-induced blur.

TABLE III: Quantitative evaluation of PCIE on real-world benchmarks RealMPIP. The OIQE and PSNR/SSIM of original aberration images are 55.22%percent55.2255.22\%55.22 % and 16.215dB/0.799516.215𝑑𝐵0.799516.215dB/0.799516.215 italic_d italic_B / 0.7995, respectively. The PSNR and SSIM are calculated on the generated checkerboard image pairs. U.S. denotes the result of the user study. We also list the ranks on each metric in “()” and the Average Rank (A.R.) of each method for an intuitive evaluation.
Method RealMPIP3K-Checkerboard RealMPIP3K-AC RealMPIP3K-SR&AC A.R.\downarrow
OIQE\uparrow PSNR\uparrow SSIM\uparrow U.S.\uparrow NIQE\downarrow BRISQUE\downarrow U.S.\uparrow NIQE\downarrow BRISQUE\downarrow
RRDB [29] 66.94%(4) 19.587(3) 0.8872(4) 51.91%(3) 04.930(8) 45.692(2) 37.84%(5) 04.848(4) 50.729(5) 4.2
SwinIR [27] 67.51%(3) 18.991(5) 0.8863(5) 49.05%(4) 04.816(6) 46.380(5) 38.92%(4) 04.833(2) 50.555(4) 4.2
GRL [30] 64.63%(7) 19.348(4) 0.8885(3) 36.43%(7) 04.665(1) 46.587(6) 08.65%(6) 04.824(1) 51.057(6) 4.6
UFormer [36] 68.55%(2) 19.841(1) 0.8897(2) 61.43%(2) 04.914(7) 45.427(1) n.a. n.a. n.a. 2.5
PI2RNet [28] 64.86%(6) 18.443(7) 0.8734(7) 42.14%(6) 04.710(3) 47.148(8) n.a. n.a. n.a. 6.2
RRDB+ 58.28%(8) 18.458(6) 0.8758(6) 32.14%(8) 04.783(5) 46.675(7) 62.16%(3) 04.857(5) 50.383(2) 5.6
GRL+ 65.31%(5) 18.193(8) 0.8690(8) 48.33%(5) 04.724(4) 46.138(4) 68.92%(2) 04.842(3) 50.061(1) 4.4
PART 77.87%(1) 19.606(2) 0.8943(1) 78.57%(1) 04.707(2) 45.968(3) 83.51%(1) 04.933(6) 50.422(3) 2.2
Refer to caption
Figure 9: Visual results of PCIE on real MPIP with state-of-the-art models and our proposed PART. Top three rows: results of the AC pipeline. Bottom two rows: results of the SR&\&&AC pipeline. We choose the top-four performing methods to show the results, while UFormer is not applicable for SR&\&&AC.

V-C Experiments on Real-World Datasets

As shown in Table III, PCIE with representative models makes significant contributions to the removal of the aberration-induced blur of real-world MPIP images. To be specific, OIQE improves from 55.22%percent55.2255.22\%55.22 % to 58.28%77.87%similar-topercent58.28percent77.8758.28\%{\sim}77.87\%58.28 % ∼ 77.87 %, and PSNR/SSM improves from 16.215dB/0.799516.215𝑑𝐵0.799516.215dB/0.799516.215 italic_d italic_B / 0.7995 to 18.193dB/0.869019.841dB/0.8943similar-to18.193𝑑𝐵0.869019.841𝑑𝐵0.894318.193dB/0.8690{\sim}19.841dB/0.894318.193 italic_d italic_B / 0.8690 ∼ 19.841 italic_d italic_B / 0.8943. The results on NIQE and BRISQUE reveal a large variance, which is attributed to that these metrics are designed for perspective natural images rather than annular MPIP images. For a comprehensive and intuitive evaluation, we rank each method on each metric and provide the average rank (A.R.). In the real-world case, PART outperforms other models, achieving the best OIQE (77.87%percent77.8777.87\%77.87 %), and the best A.R. (2.2). The subjective evaluation of the User Study (U.S.) also illustrates that PART delivers more visual-pleasant panoramic images, which has far superior selection rates. The visual results of PCIE on real-world scenes are provided in Fig. 9. PCIE enables most methods to deliver high-quality panoramic images with few aberrations and high resolution, where PART sets the state of the art in terms of higher contrast, sharper edges, and fewer artifacts. Additionally, consistent with experiments on synthetic data, the recovered images of the SR&\&&AC pipeline reveal perceptually unpleasant artifacts.

V-D Investigation on GAN-based Training Strategies

To generate richer details for recovered images, we investigate GAN-based training strategies on classical models RRDB, SwinIR, and our PART. Following [76], the GAN-based loss functions in ESRGAN [29] and Local Discriminative Learning (LDL) [76] are adopted, where the former is a classical GAN-based framework and the latter is an improved strategy to remove artifacts. We take models trained with L1Loss, i.e. PSNR-oriented models, as pre-training generators, then apply GAN and LDL loss functions to enable these networks to generate more textures respectively. As shown in Table IV, on synthetic data, both GAN and LDL lead to a decrease in recovery accuracy (PSNR and SSIM), while bringing great gains under the perceptual quality metrics. LDL is a more competitive strategy that outperforms GAN with higher fidelity and fewer visual artifacts, especially with PART.

Regarding real-world data, we present the OIQE and qualitative results in Fig. 10. GAN-based training further contributes to the removal of the aberration-induced blur, achieving better OIQE with higher image contrast. Aside from this, GAN-based models deliver more realistic imaging results with richer textures, which also bring some perceptually unpleasant artifacts and fake details despite being trained with LDL. We point out that the GAN-based strategies offer the potential for learning a more realistic high-quality MPIP image. Still, the local statistics in LDL of perspective images may need to be adapted to annular images for better suppression of artifacts.

We have further explored other potential generative models, e.g., the diffusion model [77, 78]. Please refer to the Appendix for more results.

TABLE IV: Quantitative evaluation of GAN-based training on our benchmarks of AC and SR&\&&AC.
Task Method Training Strategy PALHQ-SynMPIP-P1
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
AC RRDB [29] PSNR-oriented 32.716 0.9265 0.0469 03.704
+GAN [29] 28.929 0.8840 0.0392 04.919
+LDL [76] 31.864 0.9166 0.0338 04.559
SwinIR [27] PSNR-oriented 32.913 0.9291 0.0446 03.670
+GAN [29] 29.916 0.8920 0.0449 04.254
+LDL [76] 31.770 0.9130 0.0297 03.444
PART PSNR-oriented 33.143 0.9304 0.0435 03.571
+GAN [29] 30.965 0.9045 0.0410 04.402
+LDL [76] 31.854 0.9148 0.0264 03.541
SR&AC RRDB [29] PSNR-oriented 28.856 0.8758 0.0733 09.957
+GAN [29] 25.842 0.8276 0.0638 12.564
+LDL [76] 28.112 0.8633 0.0561 09.746
SwinIR [27] PSNR-oriented 28.985 0.8781 0.0714 09.938
+GAN [29] 26.596 0.8385 0.0634 12.155
+LDL [76] 27.875 0.8575 0.0686 09.423
PART PSNR-oriented 29.310 0.8819 0.0681 09.648
+GAN [29] 28.382 0.8688 0.0682 08.897
+LDL [76] 28.608 0.8720 0.0508 08.715
Refer to caption
Figure 10: Evaluation of GAN-based training strategies on real-world data. We take the AC pipeline as an example, where image patches from RealMPIP3K-0031 and RealMPIP3K-0057 are presented.

V-E Effectiveness of PALHQ

The collected PALHQ demonstrates an impressive ability to train the model for recovering both synthetic and real-world MPIP images in previous experiments. In this section, we explore whether PALHQ is necessary for PCIE. As an alternative to PALHQ, we simulate the aberrations of MPIP-P1 directly on the publicly available perspective image dataset, i.e. Flickr2K [79], creating PanoFlickr2K for training.

We compare representative models trained on PanoFlickr2K and PALHQ on both synthetic and real-world benchmarks in Table V and Fig. 11. It becomes clear that PALHQ contributes significantly to high-quality panoramic imaging, where the numerical results in all metrics are improved by a large margin and the visual results are more perceptually pleasant with sharper edges, fewer artifacts, and fewer noises.

TABLE V: Quantitative comparison between the effectiveness of PALHQ and available HQ dataset in PCIE.
Task Method Training Dataset PALHQ-SynMPIP-P1
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
AC Uformer [36] PanoFlickr2K 32.340 0.8253 0.0484 04.558
PALHQ 32.999 0.9290 0.0442 03.672
RRDB [29] PanoFlickr2K 32.128 0.9235 0.0508 04.717
PALHQ 32.716 0.9265 0.0469 03.704
SwinIR [27] PanoFlickr2K 32.292 0.9255 0.0485 04.521
PALHQ 32.913 0.9291 0.0446 03.670
PART PanoFlickr2K 32.498 0.9259 0.0480 04.348
PALHQ 33.143 0.9304 0.0435 03.571
SR&AC RRDB [29] PanoFlickr2K 27.943 0.8688 0.0906 12.276
PALHQ 28.856 0.8758 0.0733 09.957
SwinIR [27] PanoFlickr2K 28.129 0.8705 0.0880 12.064
PALHQ 28.985 0.8781 0.0714 09.938
PART PanoFlickr2K 28.558 0.8748 0.0854 12.078
PALHQ 29.310 0.8819 0.0681 09.648
Refer to caption
Figure 11: PALHQ vs. PanoFlickr2K. The top row: illustration of PanoFlickr2K. We simulate the aberrations and FoV distributions of MPIP on perspective images of Flickr2K. Bottom two rows: image patches from RealMPIP3K-0020 and RealMPIP3K-0027. We take PART as an example to show the results trained on different datasets.

V-F Ablation Study

We conduct ablation studies to investigate how PSF-aware mechanisms contribute to high-quality MPIP image reconstruction. In all cases, the experiments are implemented with the AC pipeline on PALHQ-SynMPIP-P1, evaluated by PSNR and SSIM, and set up on the baseline model SwinIR.

Physical Information. As reported in Table VIII, the different types of physical information are concatenated with the input image respectively for an intuitive evaluation. The PSF map contains rich information characterizing aberration-induced blur, providing better results compared to the FoV map. Then, we set the optimal ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 5555. A larger ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT tends to improve the model’s scores, but the performance becomes saturated when ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is too large with redundant and sparse information.

PSF-aware Mechanisms. Table VIII shows that all the designed PSF-aware mechanisms contribute to reaching better scores, i.e0.230dB0.230𝑑𝐵0.230dB0.230 italic_d italic_B and 0.00130.00130.00130.0013 improvements in PSNR and SSIM. PFM attains the highest gains of 0.201dB0.201𝑑𝐵0.201dB0.201 italic_d italic_B in PSNR and 0.00120.00120.00120.0012 in SSIM. In addition, the performances of RRDB+ and GRL+ in Table II and Table II verify the consistent effectiveness of the plug-and-play PFM in other models, bringing improvements of 0.100dB0.478dBsimilar-to0.100𝑑𝐵0.478𝑑𝐵0.100dB{\sim}0.478dB0.100 italic_d italic_B ∼ 0.478 italic_d italic_B in PSNR and 0.00060.0036similar-to0.00060.00360.0006{\sim}0.00360.0006 ∼ 0.0036 in SSIM on SynMPIP-P1. Regarding the attention block, PMAB with 1×1111{\times}11 × 1 PFM and P-VSA enable adaptive self-attention guided by PSF information, outperforming the vanilla window-based self-attention.

Position of PFM. We further investigate the optimal position to insert PFM. As shown in Table VIII, we apply PFM after the feature extraction, at the last of each PRTB, and before the image reconstruction, for ablations. The PFM on shallow features reveals more competitive results, while increasing the number of PFM during the representation learning stage also leads to significant improvements. Using PFM in all ablated positions helps to reach the best performance, which is also corroborated by the observation in omnidirectional image super-resolution [6].

Effectiveness of PSF Representation. Table IX reports several possible PSF-aware mechanisms along with their vanilla versions without the guidance of the PSF map. The deformable (DConv and DeformSwin), FAC, and VSA mechanisms all deliver even worse performance compared to the baseline (0.957dB0.202dBsimilar-to0.957𝑑𝐵0.202𝑑𝐵-0.957dB{\sim}-0.202dB- 0.957 italic_d italic_B ∼ - 0.202 italic_d italic_B in PSNR, 0.00860.0021similar-to0.00860.0021-0.0086{\sim}-0.0021- 0.0086 ∼ - 0.0021 in SSIM), which illustrates that the image-only network is unable to implicitly learn the complex spatial distribution of aberrations, leading to the unreliable predictions of offsets, convolution kernels, and varied-size windows. Serving as a key modality, the PSF representation, i.e., the PSF map, which contains information of the intensity and size distribution of the PSF kernels, facilitates several potential PSF-aware mechanisms to achieve superior performance. To be specific, the guidance of PSF representation brings improvements of 0.261dB1.158dBsimilar-to0.261𝑑𝐵1.158𝑑𝐵0.261dB{\sim}1.158dB0.261 italic_d italic_B ∼ 1.158 italic_d italic_B in PSNR and 0.00270.0098similar-to0.00270.00980.0027{\sim}0.00980.0027 ∼ 0.0098 in SSIM to the vanilla mechanisms.

TABLE VI: Ablations on Physical Information.
Physical Information k’ PSNR SSIM
w/o - 32.913 0.9291
FoV map - 32.992 0.9293
1 33.012 0.9300
PSF map 5 33.021 0.9301
9 33.021 0.9299
TABLE VII: Ablations on PSF-aware Mechanisms.
PSF-aware Mechanism Params PSNR SSIM
w/o - 11.97M 32.913 0.9291
concat - 12.02M 33.021 0.9301
PFM - 16.72M 33.114 0.9303
PMAB 1×\times×1 PFM 14.32M 33.069 0.9302
P-VSA 12.18M 32.999 0.9297
both 14.53M 33.082 0.9303
all - 19.27M 33.143 0.9304
TABLE VIII: Ablations on the Position of PFM.
Position Params PSNR SSIM
w/o 11.97M 32.913 0.9291
first conv 12.61M 33.032 0.9297
PRTB 15.54M 33.071 0.9299
last conv 12.61M 32.971 0.9298
all 16.72M 33.114 0.9303
TABLE IX: Ablations on the effectiveness of PSF representation. Dconv: Deformable convolution [80], DeformSwin: Deformable Swin transformer [6], “P-”: the offsets are predicted from the PSF feature, FAC: Filter Adaptive Convolution [66, 67].
Method Params PSNR SSIM
baseline 11.97M 32.913 0.9291
w Dconv 14.71M 32.258 0.9237
w P-Dconv 14.71M 33.064 0.9303
w FAC 16.72M 31.956 0.9205
w PFM 16.72M 33.114 0.9303
w DeformSwin 13.21M 32.711 0.9270
w P-DeformSwin 13.21M 32.972 0.9297
w VSA 12.18M 32.496 0.9253
w P-VSA 12.18M 32.999 0.9297

V-G Summary

The extensive experiments illustrate the critical points in the proposed PCIE for achieving minimalist and high-quality panoramic imaging. We summarize the following primary findings of our experiments:

  • The proposed PCIE presents impressive high-quality imaging results, where the MPIP-P1 and AC pipeline are superior choices for delivering aberration-free panoramic images with much more realistic details.

  • In PCIE, we find that window-attention-based models reveal better results. Furthermore, PSF-aware mechanisms are effective for improving the performance of SR models, where the proposed PSF-aware transformer, i.e. PART, sets state of the art.

  • The PSF representation plays a significant role in PSF-aware mechanisms, facilitating effective learning of the inverse process of the aberration-induced blur.

  • Regarding the training strategies, GAN-based methods contribute to more realistic recovered images, but with some visually unpleasant artifacts and fake details. The generative model appears to be more competitive in PCIE if a good balance is struck when generating rich details and suppressing artifacts.

  • Comparing with the adaptation of perspective images, the collected high-quality panoramic annular images dataset, i.e. PALHQ, brings considerable improvements. PALHQ serves as the cornerstone of our PCIE for training a robust model to process MPIP images.

We hope that the PCIE can bring inspiration from optical design, network architecture, sensor choice, data preparation, and training strategies, for minimalist and high-quality panoramic imaging in mobile and wearable applications.

VI Conclusion and Discussion

VI-A Conclusion

In this paper, we design PCIE to present a general solution to minimalist and high-quality panoramic imaging. Based on the idea of PAL, the MPIP is proposed for 360° panoramic imaging with less than three lenses. Then, learning-based models, which are trained on synthetic aberration images from simulation, are applied to solve the aberration-induced blur and low resolution of MPIP images. A new dataset PALHQ is collected to fill the gap of high-quality PAL images for low-level vision. We explore utilizing PSF information of the optical system to improve the performance of models and design a PSF-aware transformer PART with PSF-aware mechanisms. The plug-and-play mechanism PFM can enhance modern SR models for removing aberration-induced blur, while PART with PMAB delivers state-of-the-art performance on both synthetic and real-world benchmarks. Extensive experiments are conducted to investigate how to improve PCIE, providing heuristic findings for constructing a computational-imaging-based minimalist panoramic system with impressive imaging quality, in terms of optical design, network architecture, sensor selection, training strategies, and data preparation.

VI-B Discussion and Future Work

There are still some limitations in PCIE, which call for further investigation into extremely high-quality imaging. First, the PSF-aware mechanisms are designed in a straightforward way, which improves the performance, yet, with extra parameters and computational overhead. More efficient and effective PSF-aware architectures or training strategies are expected to further enhance the performance. Meanwhile, the improvements on CNN-based models are less pronounced compared to those on transformer models. We are interested in the design of learnable PSF representation, PSF-aware dynamic, deformable, and dilated convolution, or PSF-aware varied-shape window attention for better exploration of PSF information. Then, we investigate state-of-the-art GAN-based training strategies, while there is open research space for further suppressing artifacts. Aside from this, the results of PCIE on real-world data are not as good as on synthetic data, where artifacts and fake details exist in some recovered images. The considerable synthetic-to-real gap needs future research on domain adaptation. The image number of PALHQ is also limited due to the difficulties of capturing high-quality PAL images under various scenes. We intend to design a hybrid training approach to take advantage of the large data size of the publicly available perspective datasets while improving the training with PALHQ. Finally, an end-to-end framework for joint optimization of MPIP design and recovery model will be focused on presenting a more general engine of minimalist and high-quality panoramic imaging.

References

  • [1] K. Yang, X. Hu, and R. Stiefelhagen, “Is context-aware CNN ready for the surroundings? Panoramic semantic segmentation in the wild,” TIP, vol. 30, pp. 1866–1881, 2021.
  • [2] K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic images,” in ECCV, vol. 11220, 2018, pp. 732–750.
  • [3] Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y. Zhao, “PanoFormer: Panorama transformer for indoor 360° depth estimation,” in ECCV, vol. 13661, 2022, pp. 195–211.
  • [4] K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-Painting: Seamless 360° panoramic image outpainting and beyond,” TIP, vol. 33, pp. 382–394, 2024.
  • [5] Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “SphereSR: 360° image super-resolution with arbitrary projection via continuous spherical image representation,” in CVPR, 2022, pp. 5667–5676.
  • [6] F. Yu, X. Wang, M. Cao, G. Li, Y. Shan, and C. Dong, “OSRT: Omnidirectional image super-resolution with distortion-aware transformer,” in CVPR, 2023, pp. 13 283–13 292.
  • [7] X. Sun et al., “OPDN: Omnidirectional position-aware deformable network for omnidirectional image super-resolution,” in CVPRW, 2023, pp. 1293–1301.
  • [8] S. Yang, C. Lin, K. Liao, and Y. Zhao, “FishFormer: Annulus slicing-based transformer for fisheye rectification with efficacy domain exploration,” arXiv preprint arXiv:2207.01925, 2022.
  • [9] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively complementary network for fisheye image rectification using appearance flow,” in CVPR, 2021, pp. 6348–6357.
  • [10] S. Yang, C. Lin, K. Liao, and Y. Zhao, “Innovating real fisheye image correction with dual diffusion architecture,” in ICCV, 2023, pp. 12 653–12 662.
  • [11] S. Gao, K. Yang, H. Shi, K. Wang, and J. Bai, “Review on panoramic imaging and its applications in scene understanding,” TIM, vol. 71, pp. 1–34, 2022.
  • [12] D. Cheng, C. Gong, C. Xu, and Y. Wang, “Design of an ultrawide angle catadioptric lens with an annularly stitched aspherical surface,” OE, vol. 24, no. 3, pp. 2664–2677, 2016.
  • [13] S. Gao, E. A. Tsyganok, and X. Xu, “Design of a compact dual-channel panoramic annular lens with a large aperture and high resolution,” AO, vol. 60, no. 11, pp. 3094–3102, 2021.
  • [14] Q. Jiang et al., “Computational optics meet domain adaptation: Transferring semantic segmentation beyond aberrations,” arXiv preprint arXiv:2211.11257, 2022.
  • [15] Y. Peng, Q. Fu, F. Heide, and W. Heidrich, “The diffractive achromat full spectrum computational imaging with diffractive optics,” TOG, vol. 35, no. 4, pp. 1–11, 2016.
  • [16] Y. Peng, Q. Sun, X. Dun, G. Wetzstein, W. Heidrich, and F. Heide, “Learned large field-of-view imaging with thin-plate optics,” TOG, vol. 38, no. 6, pp. 1–14, 2019.
  • [17] X. Li, J. Suo, W. Zhang, X. Yuan, and Q. Dai, “Universal and flexible optical aberration correction using deep-prior based deconvolution,” in ICCV, 2021, pp. 2593–2601.
  • [18] Q. Sun, C. Wang, Q. Fu, X. Dun, and W. Heidrich, “End-to-end complex lens design with differentiate ray tracing,” TOG, vol. 40, no. 4, pp. 1–13, 2021.
  • [19] C. Wang, N. Chen, and W. Heidrich, “dO: A differentiable engine for deep lens design of computational imaging systems,” TCI, vol. 8, pp. 905–916, 2022.
  • [20] X. Yang, Q. Fu, and W. Heidrich, “Curriculum learning for ab initio deep learned refractive optics,” arXiv preprint arXiv:2302.01089, 2023.
  • [21] K. Zhang, X. Zhong, L. Zhang, and T. Zhang, “Design of a panoramic annular lens with ultrawide angle and small blind area,” AO, vol. 59, no. 19, pp. 5737–5744, 2020.
  • [22] J. Wang, J. Bai, K. Wang, and S. Gao, “Design of stereo imaging system with a panoramic annular lens and a convex mirror,” OE, vol. 30, no. 11, pp. 19 017–19 029, 2022.
  • [23] P. Greguss, “Panoramic imaging block for three-dimensional space,” Jan. 28 1986, US Patent 4,566,763.
  • [24] I. Powell, “Panoramic lens,” AO, vol. 33, no. 31, pp. 7356–7361, 1994.
  • [25] S. Thibault, J. Gauvin, M. Doucet, and M. Wang, “Enhanced optical design by distortion control,” in SPIE, vol. 5962, 2005, pp. 307–314.
  • [26] D. Geng, H.-t. Yang, C. Mei, and Y.-h. Li, “Optical system design of space fisheye lens and performance analysis,” in SPIE, vol. 10462, 2017, pp. 1276–1282.
  • [27] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using swin transformer,” in ICCVW, 2021, pp. 1833–1844.
  • [28] Q. Jiang, H. Shi, L. Sun, S. Gao, K. Yang, and K. Wang, “Annular computational imaging: Capture clear panoramic images through simple lens,” TCI, vol. 8, pp. 1250–1264, 2022.
  • [29] X. Wang et al., “ESRGAN: Enhanced super-resolution generative adversarial networks,” in ECCVW, vol. 11133, 2018, pp. 63–79.
  • [30] Y. Li et al., “Efficient and explicit modelling of image hierarchies for image restoration,” in CVPR, 2023, pp. 18 278–18 289.
  • [31] W. Li, X. Lu, J. Lu, X. Zhang, and J. Jia, “On efficient transformer and image pre-training for low-level vision,” arXiv preprint arXiv:2112.10175, 2021.
  • [32] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more pixels in image super-resolution transformer,” in CVPR, 2023, pp. 22 367–22 377.
  • [33] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPRW, 2017, pp. 1132–1140.
  • [34] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in ECCV, vol. 11211, 2018, pp. 294–310.
  • [35] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in CVPR, 2022, pp. 5718–5729.
  • [36] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general U-shaped transformer for image restoration,” in CVPR, 2022, pp. 17 662–17 672.
  • [37] L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “HINet: Half instance normalization network for image restoration,” in CVPRW, 2021, pp. 182–192.
  • [38] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in ECCV, vol. 13667, 2022, pp. 17–33.
  • [39] H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” arXiv preprint arXiv:2205.10468, 2022.
  • [40] J. Zhang, K. Yang, C. Ma, S. Reiß, K. Peng, and R. Stiefelhagen, “Bending reality: Distortion-aware transformers for adapting to panoramic semantic segmentation,” in CVPR, 2022, pp. 16 896–16 906.
  • [41] H. Chen, W. Hu, K. Yang, J. Bai, and K. Wang, “Panoramic annular SLAM with loop closure and global optimization,” AO, vol. 60, no. 21, pp. 6264–6274, 2021.
  • [42] Z. Wang, K. Yang, H. Shi, P. Li, F. Gao, and K. Wang, “LF-VIO: A visual-inertial-odometry framework for large field-of-view cameras with negative plane,” in IROS, 2022, pp. 4423–4430.
  • [43] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Schölkopf, “Non-stationary correction of optical aberrations,” in ICCV, 2011, pp. 659–666.
  • [44] Y. Liu, C. Zhang, T. Kou, Y. Li, and J. Shen, “End-to-end computational optics with a singlet lens for large depth-of-field imaging,” OE, vol. 29, no. 18, pp. 28 530–28 548, 2021.
  • [45] G. Côté, F. Mannan, S. Thibault, J.-F. Lalonde, and F. Heide, “The differentiable lens: Compound lens search over glass surfaces and materials for object detection,” in CVPR, 2023, pp. 20 803–20 812.
  • [46] H. Trussell and B. Hunt, “Image restoration of space variant blurs by sectioned methods,” in ICCASP, vol. 3, 1978, pp. 196–198.
  • [47] J. Kim, A. Tsai, M. Cetin, and A. S. Willsky, “A curve evolution-based variational approach to simultaneous image restoration and segmentation,” in ICIP, vol. 1, 2002, pp. I–I.
  • [48] E. Kee, S. Paris, S. Chen, and J. Wang, “Modeling and removing spatially-varying optical blur,” in ICCP, 2011, pp. 1–8.
  • [49] Y. Xue, Q. Yang, G. Hu, K. Guo, and L. Tian, “Deep-learning-augmented computational miniature mesoscope,” Optica, vol. 9, no. 9, pp. 1009–1021, 2022.
  • [50] L. Denis, E. Thiébaut, F. Soulez, J.-M. Becker, and R. Mourya, “Fast approximations of shift-variant blur,” IJCV, vol. 115, pp. 253–278, 2015.
  • [51] N. Wiener, Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications.   MIT press Cambridge, MA, 1949, vol. 113, no. 21.
  • [52] W. H. Richardson, “Bayesian-based iterative method of image restoration,” JOSA, vol. 62, no. 1, pp. 55–59, 1972.
  • [53] L. B. Lucy, “An iterative technique for the rectification of observed distributions,” The Astronomical Journal, vol. 79, p. 745, 1974.
  • [54] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” in CVPR, 2015, pp. 769–777.
  • [55] S. Chen, H. Feng, K. Gao, Z. Xu, and Y. Chen, “Extreme-quality computational imaging via degradation framework,” in ICCV, 2021, pp. 2612–2621.
  • [56] S. Chen, T. Lin, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Computational optics for mobile terminals in mass production,” TPAMI, vol. 45, no. 4, pp. 4245–4259, 2023.
  • [57] T. Lin, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Non-blind optical degradation correction via frequency self-adaptive and finetune tactics,” OE, vol. 30, no. 13, pp. 23 485–23 498, 2022.
  • [58] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Learning a 3D-CNN and transformer prior for hyperspectral image super-resolution,” Information Fusion, vol. 100, p. 101907, 2023.
  • [59] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: Dataset and study,” in CVPRW, 2017, pp. 1122–1131.
  • [60] X. Deng, H. Wang, M. Xu, Y. Guo, Y. Song, and L. Yang, “LAU-net: Latitude adaptive upscaling network for omnidirectional image super-resolution,” in CVPR, 2021, pp. 9189–9198.
  • [61] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place representation,” in CVPR, 2012, pp. 2695–2702.
  • [62] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in ICCV, 2021, pp. 4771–4780.
  • [63] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data,” in ICCVW, 2021, pp. 1905–1914.
  • [64] E. Huggins, “Introduction to fourier optics,” Physics Teacher, vol. 45, no. 6, pp. 364–368, 2007.
  • [65] V. N. Mahajan, “Zernike circle polynomials and optical aberrations of systems with circular pupils,” AO, vol. 33, no. 34, pp. 8121–8124, 1994.
  • [66] D. Li et al., “Involution: Inverting the inherence of convolution for visual recognition,” in CVPR, 2021, pp. 12 321–12 330.
  • [67] Y. Jiang, B. Wronski, B. Mildenhall, J. T. Barron, Z. Wang, and T. Xue, “Fast and high quality image denoising via malleable convolution,” in ECCV, vol. 13678, 2022, pp. 429–446.
  • [68] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “VSA: Learning varied-size window attention in vision transformers,” in ECCV, vol. 13685, 2022, pp. 466–483.
  • [69] hou Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, vol. 13, no. 4, pp. 600–612, 2004.
  • [70] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
  • [71] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, vol. 30, 2017, pp. 6626–6637.
  • [72] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” SPL, vol. 20, no. 3, pp. 209–212, 2012.
  • [73] A. Mittal, A. Moorthy, and A. Bovik, “Referenceless image spatial quality evaluation engine,” in ACSSC, vol. 38, 2011, pp. 53–54.
  • [74] X. Chu, L. Chen, C. Chen, and X. Lu, “Improving image restoration by revisiting global information aggregation,” in ECCV, 2022, pp. 53–71.
  • [75] L. Beyer et al., “FlexiViT: One model for all patch sizes,” in CVPR, 2023, pp. 14 496–14 506.
  • [76] J. Liang, H. Zeng, and L. Zhang, “Details or artifacts: A locally discriminative learning approach to realistic image super-resolution,” in CVPR, 2022, pp. 5647–5656.
  • [77] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, 2020, pp. 6840–6851.
  • [78] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffusion models as plug-and-play priors,” in NeurIPS, vol. 35, 2022, pp. 14 715–14 728.
  • [79] R. Timofte et al., “NTIRE 2017 challenge on single image super-resolution: Methods and results,” in CVPRW, 2017, pp. 1110–1121.
  • [80] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets V2: More deformable, better results,” in CVPR, 2019, pp. 9300–9308.
  • [81] C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models,” in CVPR, 2020, pp. 8825–8835.
  • [82] L. Hoyer, D. Dai, and L. Van Gool, “DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation,” in CVPR, 2022, pp. 9914–9925.
  • [83] A. Jaiswal, X. Zhang, S. H. Chan, and Z. Wang, “Physics-driven turbulence image restoration with stochastic refinement,” in ICCV, 2023, pp. 12 136–12 147.
  • [84] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” TPAMI, vol. 45, no. 4, pp. 4713–4726, 2023.
  • [85] X. Zhang, H. Zeng, S. Guo, and L. Zhang, “Efficient long-range attention network for image super-resolution,” in ECCV, vol. 13677, 2022, pp. 649–667.

Appendix A Sample Images of PALHQ

We show the shooting device and sample images of PALHQ in Fig. A.1. The high-quality PAL images dataset covers a wide variety of scenes. The PAL can present 360° imaging of the surroundings, but with a blind area in the center of the image due to the reflective surface in the center FoV. As illustrated on the left of Fig. A.1, the PAL is usually placed toward the sky during the application, where the occlusion of the blind area causes little influence on the acquisition of panoramic information. PALHQ serves as the cornerstone of our PCIE for training a robust model to process MPIP images. Additionally, PALHQ can be transmitted to unfolded panoramas via equirectangular projection (ERP) for other various panoramic image processing applications. More types of minimalist PAL design, e.g., using fewer lenses or applying meta surface, would also benefit from PALHQ for training learning-based recovery models to improve imaging quality.

Refer to caption
Figure A.1: The shooting device and sample images of PALHQ. With a well-designed PAL of 11111111 lenses and a Sony α𝛼\alphaitalic_α6600 camera (on the left), we capture high-quality panoramic images covering various scenes including indoor, natural, urban, campus, and scenic spots (on the right).

Appendix B Visualization of the PSF Map

The visualization of the process of producing PSF maps is illustrated in Fig. B.1. As depicted in Sec. IV-A, we locate the corresponding FoV of the target pixel and obtain the R/G/B PSFs. Then, they are rotated according to their locations and compressed into a shape of k×k×1superscript𝑘superscript𝑘1{k^{\prime}\times{k^{\prime}}\times 1}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 1 via adaptive average pooling, where ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is set to 5555 as an example. Finally, we reshape the compressed kernel into channel dimension and insert it into the pixel to produce the PSF map.

Refer to caption
Figure B.1: The visualization of compressing PSFs into PSF maps.

Appendix C Pipeline for OIQE Calculation

The detailed pipeline for calculating the defined OIQE is shown in Fig. C.1. We capture 8888 testing images of the checkerboard under different ISP settings with our MPIP-P1 and AC pipeline. For evaluating different models in terms of aberration correction in real-world scenes, the processed testing images are fed into the OIQE pipeline, where sample knife-edges from different FoVs are cropped for SFR testing. In OIQE, the comprehensive results of different shooting settings and FoVs present a more credible evaluation of the ability to correct optical aberrations.

Refer to caption
Figure C.1: Pipeline for OIQE calculation. ESF: Edge Spread Function. LSF: Line Spread Function.

Appendix D Implementation Details of the Simulation Model

In this section, we introduce how to generate aberration images based on the simulation model and Zemax software in specific. To apply Eq. (1) for simulating aberrations, a set of PSFs under all FoVs of the target optical lens is required. We input the structure of MPIP into Zemax, then calculate the Zernike standard coefficients under different FoVs (128128128128 FoVs from the minimum to the maximum FoV) and wavelengths (31313131 wavelengths from 400700nmsimilar-to400700𝑛𝑚400{\sim}700nm400 ∼ 700 italic_n italic_m), where the first 37373737 polynomials are kept as a common practice. In this way, the Zernike coefficients matrix with a shape of 31×128×37311283731{\times}128{\times}3731 × 128 × 37 is produced. Then, we plug the coefficients into Eq. (2) to describe the wavefronts under all FoVs and wavelengths. The random disturbance strategy is applied here to fine-tune the coefficients for multiple virtual aberration distributions. Before calculating PSFs, we also need to have access to the spot diagram and illumination distribution of the MPIP in Zemax, where the sizes of spots determine the kernel sizes (the ratio of the spot size to the pixel size) of PSFs, and the illumination provides the relative amplitude of PSFs. Finally, the wavefronts are transformed to PSFs Kθ(x,y,λ)subscript𝐾𝜃𝑥𝑦𝜆K_{\theta}(x,y,\lambda)italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) via Eq. (D.1) to Eq. (D.3):

𝒫θ(x,y,λ)=P(x,y)eiΦθ(x,y,λ),subscript𝒫𝜃𝑥𝑦𝜆𝑃𝑥𝑦superscript𝑒isubscriptΦ𝜃𝑥𝑦𝜆\mathcal{P}_{\theta}(x,y,\lambda)=P(x,y)e^{\mathrm{i}\Phi_{\theta}(x,y,\lambda% )},caligraphic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) = italic_P ( italic_x , italic_y ) italic_e start_POSTSUPERSCRIPT roman_i roman_Φ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) end_POSTSUPERSCRIPT , (D.1)
Eθ(x,y,λ)=E0λd𝒫θ(x,y,λ)ei2πλd(xx+yy)𝑑x𝑑y,subscript𝐸𝜃𝑥𝑦𝜆subscript𝐸0𝜆𝑑double-integralsubscript𝒫𝜃superscript𝑥superscript𝑦𝜆superscript𝑒i2𝜋𝜆𝑑superscript𝑥𝑥superscript𝑦𝑦differential-dsuperscript𝑥differential-dsuperscript𝑦E_{\theta}(x,y,\lambda)=\frac{E_{0}}{{\lambda}d}\iint{\mathcal{P}_{\theta}(x^{% \prime},y^{\prime},\lambda)e^{-\mathrm{i}\frac{2\pi}{{\lambda}d}(x^{\prime}x+y% ^{\prime}y)}dx^{\prime}dy^{\prime}},italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) = divide start_ARG italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_λ italic_d end_ARG ∬ caligraphic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ ) italic_e start_POSTSUPERSCRIPT - roman_i divide start_ARG 2 italic_π end_ARG start_ARG italic_λ italic_d end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x + italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_y ) end_POSTSUPERSCRIPT italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (D.2)
Kθ(x,y,λ)=|Eθ(x,y,λ)|2,subscript𝐾𝜃𝑥𝑦𝜆superscriptsubscript𝐸𝜃𝑥𝑦𝜆2K_{\theta}(x,y,\lambda)={|E_{\theta}(x,y,\lambda)|^{2}},italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) = | italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (D.3)

where P(x,y)𝑃𝑥𝑦P(x,y)italic_P ( italic_x , italic_y ) is the circ function and d𝑑ditalic_d is the distance from exit pupil to image plane. With multiple sets of Kθ(x,y,λ)subscript𝐾𝜃𝑥𝑦𝜆K_{\theta}(x,y,\lambda)italic_K start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y , italic_λ ) under different random disturbances, we generate aberration images of multiple virtual MPIP samples via Eq. (1), where the high-quality MPIP images are transformed to raw images by invert-ISP (Gamma Decompression, Invert Color Correction Matrix, and Invert White Balance), and the aberrated raw images are further processed by ISP (Mosaiced, Adding Noise, Demosaiced, White Balance, Color Correction Matrix, and Gamma Compression) to obtain the final results.

Appendix E Implementation Details of Model Testing

During the testing (inference) stage, the input is the full-resolution panoramic images, except for Restormer and NAFNet. For global self-attention-based methods, e.g., Restormer, the performance of the model is sensitive to the image resolution, which requires the same resolution during testing and training to maintain consistent high performance [74, 75]. Moreover, in our tasks, a larger resolution of testing input represents more complex spatially variant degradation (related to more FoVs), which introduces a larger gap with the training data. Consequently, in TABLE I of the paper, the results of Restormer are those under the cropping testing strategy, where the input image is cropped into overlapped patches of 256×256256256256{\times}256256 × 256. The same is true for NAFNet. To further illustrate this issue, we test the performance of Restormer and NAFNet under different crop sizes of input, as shown in Table E.1. When the training crop size is 256×256256256256{\times}256256 × 256, the performances of the models drop significantly when the testing crop size increases from 256×256256256256{\times}256256 × 256 to 3152×3152315231523152{\times}31523152 × 3152 (the full-resolution).

TABLE E.1: The impacts of input resolution on Restormer and NAFNet. The models are trained on 256×256256256256{\times}256256 × 256 image patches and tested with different resolutions. We take the results on SynMPIP-P1 as an example. The results in the table are PSNR/SSIM.
Method Input Resolution
3152 1024 256
Restormer 27.424/0.8826 30.333/0.9088 32.971/0.9287
NAFNet 28.494/0.8853 30.481/0.9099 32.837/0.9274

Appendix F Ground-Truth Generation Pipeline for Checkerboard Images.

Capturing real data with Ground Truth (GT) is challenging in the computational imaging field, where no reliable data acquisition pipeline is available in related work. Taking GT images displayed on the screen with the optical system to be measured could be a solution [16]. However, there still exists a gap between the screen and the real image. At the same time, for a special panoramic system, i.e., MPIP, no suitable screen is available for capturing paired panoramic images.

Consequently, we make an early effort to generate GT images based on captured special patterns. For the black and white geometric pattern, e.g., the checkerboard, degraded by aberration degradation, we only need to extract its edge and re-color each part according to its original distribution, to generate its GT pattern. This method was once applied in [55] to generate checkerboard pairs for training a degradation network. In our case, we crop patches of checkerboard test images captured by MPIP, under different FoVs, and generate corresponding GTs by the above method, as shown in Fig. F.1. In this way, we only need to crop the patches of the same area on the imaging results of PCIE, and then calculate the error metric, e.g., PSNR and SSIM, with the corresponding GTs. The checkerboard testing set of RealMPAL consists of 7777 paired images, where checkerboard patches under different FoVs and ISP settings are included.

However, the pipeline is only a preliminary experiment, which still reveals some weaknesses. For example, the coloring method for GT is worth further investigating, because the chessboard captured by a well-designed PAL is also not as ideal as the GT. The calculation of PSNR and SSIM between recovered images and GT might not fully reflect the ability of aberration correction of the model. Compared to it, the QIQE defined based on the optical metric MTF, is more credible and suitable for evaluating the aberration correction task.

Refer to caption
Figure F.1: The illustration of the Ground-Truth (GT) generation and quantitative testing pipelines for testing checkerboard images. With the generated GT checkerboard images, we can calculate the error metrics, e.g., PSNR and SSIM, on real-world data.

Appendix G Implementation Details of User Study

To conduct a subjective evaluation of imaging results of PCIE in real-world scenes, we randomly sample 10101010 images from RealMPAL3K and 10101010 images from RealMPAL1K for the AC and SR&\&&AC pipeline, respectively. 42424242 volunteers are invited to participate in the survey, where they need to go through the imaging results of all the methods in Table III and select half of them with the best image quality. The final statistical result is presented as the percentage of each method that is being selected, which is the U.S. in Table III.

Appendix H Failure Case of PSF-aware Mechanisms

From the quantitative results on synthetic datasets, RRDB+ delivers better results than RRDB on most metrics and tasks. However, compared to the significant improvements by PSF-aware mechanisms to SwinIR and GRL, the improvements to RRDB are limited, which even leads to worse FID in some cases (AC-SynMPIP-P1 and SR&\&&AC-SynMPIP-P1). The OIQE from 66.94%58.28%similar-topercent66.94percent58.2866.94\%{\sim}58.28\%66.94 % ∼ 58.28 % also illustrates the limitations of PFM on RRDB. This is a failure case of the PSF-aware mechanisms.

We speculate that this is caused by the unsatisfactory robustness of the CNN-based model to the domain shift of testing data [14, 81, 82]. In our evaluation, for both synthetic and real data, the aberration distribution, i.e., the PSF distribution, is slightly different from the standard distribution in PSF representation, which is often the case in the real-world scene due to the manufacture and assembly errors of the lens. In this case, the model has to learn the actual distribution from the standard PSF distribution to guide the PFM. Consequently, the CNN-based model is seriously affected by the domain gap, leading to the unreliable prediction of dynamic kernels in PFM, which brings worse performance.

Appendix I More Results of Generative Models.

Except for the GAN-based training strategies, the recently developed diffusion model [77, 78, 83] shows strong abilities to generate realistic images with rich details. Consequently, we further explore the potential of the diffusion model in our AC task.

TABLE I.1: The quantitative results and computational overhead of SwinIR and corresponding generative models. The denoising steps of SR3 are set to 10101010 to make sure that the model can converge. The parameters and FLOPs of SR3 are multiplied by 10101010 considering the steps.
PSNR SSIM LPIPS FID Params FLOPs
SwinIR 32.913 0.9291 0.0446 03.670 11.97M 407.76G
SwinIR+GAN 29.916 0.8920 0.0449 04.254 11.97M 407.76G
SwinIR+LDL 31.770 0.9130 0.0297 03.444 11.97M 407.76G
SwinIR+SR3 31.281 0.8412 0.4110 17.956 11.97M+13.85M 407.76G+3481G

Following [83], we refine the PSNR-oriented SwinIR model with a diffusion model SR3 [84], where the recovered images of SwinIR are applied as the condition of the diffusion model, considering that the amount of PALHQ is too small to train a diffusion model from scratch. The experimental results of the refined model are shown in Fig. I.1 and Table I.1, where the diffusion model (SwinIR+SR3) cannot bring improvements on perceptual-based metrics like GAN and LDL, but leads to worse performance.

It is known that diffusion models require training on large amounts of datasets under large denoising steps (2000200020002000 as common practice) for good performance [77, 84]. When the dataset size is small, the number of denoising steps should be set as small to make sure that the model can converge during training (10101010 steps in our case). However, the small steps mean a weak denoising ability, leading to the terrible performance of the diffusion model (the residual noise and color deviation). Moreover, for aberration correction of high-resolution panoramic images, the computational overhead is considerable, where the additional steps of denoiser are unacceptable. The computational overhead of the SwinIR and additional diffusion model is also shown in Table I.1, where the Floating Point Operations (FLOPs) are calculated with the input resolution of 1024×1024102410241024{\times}10241024 × 1024.

In summary, the diffusion model is not suitable for the PCIE currently, but it could be a competitive solution in the future if the PALHQ is developed for larger datasets and an efficient inference pipeline is proposed.

Refer to caption
Figure I.1: The qualitative results of the refined SwinIR model with SR3. Due to the small denoising steps in our case, the results of SR3 suffer from residual noise and color deviation.

Appendix J Training Details.

The parameters of the proposed PART are 19.27M19.27𝑀19.27M19.27 italic_M, which takes 52525252 hours for training 200k200𝑘200k200 italic_k iterations with a batch size of 8888 on a single A800 GPU. Due to the dynamic convolution operations in the model, the small amount of training data, and the guidance of PSF prior, the PSF-aware Transformer can be trained well in small numbers of iterations. The training curve in Fig. J.1 shows that the model has converged at the end of training.

Refer to caption
Figure J.1: The training loss curve of the PART.

Appendix K The Analysis of Computational Overhead.

TABLE K.1: The computational overheads of representative methods. The FLOPs, memory cost, and inference latency are calculated with the input resolution of 1024×1024102410241024{\times}10241024 × 1024 on a single A800 GPU.
Method PSNR SSIM LPIPS FID Params FLOPs Memory Latency
Baselines RRDB 32.716 0.9265 0.0469 03.704 16.72M 588.28G 0.58GB 0.15s
SwinIR 32.913 0.9291 0.0446 03.670 11.97M 407.76G 0.65GB 0.97s
GRL 32.369 0.9256 0.0457 04.234 20.27M 649.24G 1.23GB 1.37s
PSF-aware RRDB+ 32.816 0.9271 0.0456 03.746 18.61M 589.75G 0.83GB 0.27s
GRL+ 32.847 0.9292 0.0454 03.627 25.60M 653.43G 2.11GB 1.41s
PART 33.143 0.9304 0.0435 03.571 19.27M 412.08G 1.90GB 1.17s
PART-S 32.812 0.9278 0.0452 03.710 03.89M 077.47G 1.00GB 0.73s

The parameters, FLOPs (Floating Point Operations), memory cost, and inference latency of PSF-aware methods with their baselines are presented in Table K.1. The FLOPs, memory cost, and inference latency are calculated with the input resolution of 1024×1024102410241024{\times}10241024 × 1024 on a single A800 GPU, which only needs to be roughly scaled for other input resolutions.

As shown in Table K.1, the PSF-aware mechanisms only introduce negligible additional computational overheads (0.25%1.06%similar-topercent0.25percent1.060.25\%{\sim}1.06\%0.25 % ∼ 1.06 % of FLOPs), while bringing significant improvements over the baselines. The increase in the number of parameters is mainly due to the the prediction of dynamic convolution kernel of each pixel, which introduces little computational overheads. The defect of the latency is caused by the transformer-based backbones, while the additional latency brought by PSF-aware mechanisms is not evident (0.040.2ssimilar-to0.040.2𝑠0.04{\sim}0.2s0.04 ∼ 0.2 italic_s). Benefiting from the efficiency and effectiveness of PSF-aware mechanisms, our future work will focus on more efficient backbones, e.g., lightweight SR backbones [85]) to achieve light-weight and high-quality panoramic imaging.

Moreover, we also release a lightweight version of PART, i.e., PART-S (with smaller depth and embedding dim), considering the potential applications of the PCIE in mobile and wearable terminals. With only 19.00%percent19.0019.00\%19.00 % of the computational overhead, PART-S can achieve comparable performance to the baseline SwinIR.