3
$\begingroup$

I have an image rendered by a simulated camera in Unity. Using this image and the known (virtual) camera pose it corresponds to, I would like to render a new image from another viewpoint. Having read from a few sources (mainly Szeliski's book), I know this is possible with planar homography only if the scene is planar. Otherwise, it seems like one would need a 3D warp which requires depth information and multiple reference frames. In my scenario, I assume that the camera pose change is so small that the scene/object can be thought as planar.

As an example, I placed a cube into a Unity scene, with the cube center at the world coordinates $(0,0,0)$ (see here the scene). Then I captured the current image (with camera pose $M_1$), moved the camera to the pose $M_2$ and captured the desired image (which I use as a reference to evaluate the correctness of the warped image) as well as both camera poses ($M_1, M_2$). Here is the mathematical model that I use to compute the planar homography (theory based on this paper which is also referred to by the OpenCV tutorial "Homography from the camera displacement"):

To transform a 3D point from camera frame 1 to camera frame 2: \begin{align} \sideset{^{c_2}}{_{c_1}}M = \sideset{^{c_2}}{_{o}}M\cdot \sideset{^{o}}{_{c_1}}M = \sideset{^{c_2}}{_{o}}M\cdot \left(\sideset{^{c_1}}{_{o}}M\right)^{-1}=\begin{bmatrix} \sideset{^{c_2}}{_{o}}R &\sideset{^{c_2}}{_{o}}t \\ 0_{3\times1} & 1 \end{bmatrix}\cdot\begin{bmatrix} \sideset{^{c_1}}{_{o}^T}R &-\sideset{^{c_1}}{_{o}^T}R \\ 0_{3\times1} & 1 \end{bmatrix} = \begin{bmatrix} \sideset{^{2}}{_1}R &\sideset{^2}{_1}t \\ 0_{3\times1} & 1 \end{bmatrix} \end{align}

Then, the Euclidean homography is \begin{align} \sideset{^{2}}{_1}H = \sideset{^{2}}{_1}R + \frac{\sideset{^{2}}{_1}t\cdot n^T}{d} \end{align}

where $n$ is the normal vector of the plane and $d$ is the distance to the plane from camera 1. A derivation of this using the plane equation is given here.

Finally, the projective homography is computed using the calibration matrix $K$: $$\mathbf{G} = \gamma \mathbf{K}\mathbf{H}\mathbf{K}^{-1}$$

which is used to warp the image using the OpenCV function warpPerspective after scaling.

My implementation is based on the OpenCV tutorial mentioned above. Here is my code in Python:

import cv2
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.transform import Rotation as R
np.set_printoptions(suppress=True)


def build_camera_matrix(f, w, h, sx, sy):
    """Create a camera intrinsics matrix K using the physical camera parameters from Unity

    Parameters
    ----------
    f : float
        Focal length in mm
    w : int
        Image width
    h : int
        Image height
    sx : float
        Width of the camera sensor in mm
    sy : float
        Height of the camera sensor in mm
    """

    K = np.eye(3)
    # Convert the focal length from millimeters to pixels
    fx = f * w / sx
    fy = f * h / sy
    K[0][0] = fx
    K[1][1] = fy
    K[0][2] = w/2
    K[1][2] = h/2
    return K


if __name__ == '__main__':
    img1_path = 'cube1.png'    # The image to be warped
    img2_path = 'cube2.png'    # Reference image i.e. the warped image should match this.
    img1 = cv2.imread(img1_path)
    img2 = cv2.imread(img2_path)
    cv2.imshow('img1', img1)
    cv2.waitKey(0)
    cv2.imshow('img2', img2)
    cv2.waitKey(0)
    img1_img2_combined = cv2.addWeighted(img1, 0.5, img2, 0.5, 0)
    cv2.imshow('img1 + img2', img1_img2_combined)
    cv2.waitKey(0)

    # Camera intrinsics parameters from Unity physical camera
    f = 25.0
    sx, sy = 25.34, 14.25

    # Image resolution
    w, h = 1920, 1080

    K = build_camera_matrix(f, w, h, sx, sy)
    print("K =\n", K)
 
    # transform.position and transform.rotation from Unity 
    # Quaternion order: (x,y,z,w)
    
    # Translations: y inverted to convert from Unity (left-handed y up) coordinate system to OpenCV (right-handed, y down)
    t1 = np.array([1.0, -1.0, 4.0]).reshape((3, 1))
    t2 = np.array([1.2, -1.0, 4.5]).reshape((3, 1))  # cube 2

    print("t1 =\n", t1)
    print("t2 =\n", t2)

    # Quaternions (cube, q_x and q_z inverted, see the edit at the end of the question)
    # q1 = np.array([-0.000, 0.993, 0.122, 0.000])
    # q2 = np.array([0.002, 0.994, 0.109, -0.022])

    # Create rotation matrices
    R1 = R.from_quat(q1).as_matrix()
    R2 = R.from_quat(q2).as_matrix()

    print("R1 =\n", R1)
    print("R2 =\n", R2)

    # Normal vector
    n = np.array([0, 0, 1]).reshape((3, 1))
    n1 = R1 @ n
    print("n1 =\n", n1)

    # d = distance from the plane to t1
    d = n1.T.dot(t1)
    print("d =\n", d)

    R12 = R2 @ R1.T
    t12 = R2 @ (-R1.T @ t1) + t2
    print("R12 =\n", R12)
    print("t12 =\n", t12)

    # Compute homography
    H_euc = R12 + ((t12 @ n1.T) / d)
    H = K @ H_euc @ np.linalg.inv(K)

    # Normalize
    H_euc /= H_euc[2, 2]
    H /= H[2, 2]

    print("Euclidean Homography:\n", H_euc)
    print("Homography from absolute camera poses:\n", H)

    # Warp the current image 
    img1_warp = cv2.warpPerspective(img1, H, (w, h), cv2.INTER_CUBIC)

    cv2.imshow('Warped img1', img1_warp)
    cv2.waitKey(0)
     
    # Overlay the warped image to the reference image
    img2_overlay_img1_warped = cv2.addWeighted(img2, 0.5, img1_warp, 0.5, 0)
    cv2.imshow('img2 + warped img1', img2_overlay_img1_warped)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

Implementation notes:

Since Unity uses a left-handed coordinate system (y looking up) and OpenCV a right-handed one (y looking down), I invert the sign of the y-coordinates of the translation and orientation vectors (quaternion) as suggested here.

To get the calibration matrix, I use a physical camera in Unity and compute the focal length in pixels as follows:

$$ f_x = f*w/s_x, f_y = f*h/s_y $$

where $f$ is the focal length in mm; $s_x$, $s_y$ are the width and height of the (virtual) image sensor, and $w$, $h$ are width and height of the image. Since my image resolution is $1920\times1080$ (aspect ratio 16:9) I set $s_x=25.34$ and $s_y=14.25$ to obtain an image sensor with the same aspect ratio. Otherwise, the Unity camera applies some distortion to fit the resolution gate to the "film gate" depending on how the "gate fit" property is set. By setting both gates to the same aspect ratio, I avoid dealing with the scaling caused by the gate fit. In summary, the values 25.34 and 14.25 could be any other value that yield the same aspect ratio; for simplicity, I just picked the values from a random (real) camera given in this table on Wikipedia.

Problems and questions:

Unfortunately, with this approach I get a warped image which is quite different than the desired image (corresponding to camera pose $M_2$). Below you see current and desired image overlaid together, and the overlay of the desired image and the warped image. Clearly, the warped image is way off the target.

Left: Current+desired, right: desired+warped

After checking the implementation many times, I started to think that something in my assumptions or understanding might be wrong. Hence, my questions are:

  1. Does it make sense to set the parameters of the camera intrinsics matrix $K$ as described above?
  2. Is there something I'm missing while converting between two coordinate systems? I'm quite certain that the $y$ coordinate of the translation vector should be inverted but I have some doubts on converting the quaternion from one coordinate system to the other (although I checked this).

EDIT on Question 2: After reading the above linked answer more carefully, I realized the rotation direction around the axes must also be inverted in addition to inverting the $y$ coordinate. So, a quaternion $\mathbf{q}=(q_x,q_y,q_z,q_w)$ becomes $\mathbf{q'}=(-q_x,q_y,-q_z,q_w)$. $q_y$ remains positive because it is inverted once to account for the opposite direction of the y-axis between the two coordinate systems, and the second time to take into account the rotation direction. I update the code accordingly.

$\endgroup$
2
  • $\begingroup$ It looks like the worst part of the error in your warped image is an offset. I would suggest you try correcting for that offset by hand with a few images, and see what you get. I suspect you're going to be disappointed in your quest to do it this way -- if you get it working to your satisfaction with simple images, set up a scene with two or three objects, overlapping and at different depths, and pan through it. Even for small angles I think you'll see a dramatic -- and dramatically disappointing -- difference. $\endgroup$
    – TimWescott
    Commented Sep 15, 2021 at 15:11
  • $\begingroup$ This won't work with objects at different depths; the assumption behind planar homography is that the scene/object is planar. Of course, a cube has a certain depth but in my app there is a human model which is much closer to being planar, only with small camera movements though. Anyway, larger camera movements would cause other problems like disocclusions etc. I will try shifting manually for a few images and see if the offset is consistent and what other distortion remains after the offset is corrected. What causes this offset is still not clear to me though.. $\endgroup$ Commented Sep 15, 2021 at 15:32

0