You can think of the object being photographed as a collection of light-emitting points. The light emitted from each point is spherically symmetric, at least over the small angle that we see.
If you sample one of those point sources at two points, and the source lies exactly on the perpendicular bisector of the line segment between the sampling points, then light of the same amplitude and phase will reach both sampling points at the same time. If the source is slightly off that plane, then the matching light will reach the sampling points at slightly different times. This allows you to disentangle the light coming from different directions.
I don't know anything about real-world VLBI, but in principle the way you extract an image from the raw data is by simulating a camera. You introduce electromagnetic waves with the strengths you measured at the spacetime locations where you measured them, and simulate them reflecting off a planet-sized mirror (or refracting in a planet-sized lens) and hitting a detector. You can show that light from any point in your source object will be focused to a point on the detector by your mirror/lens, i.e. it constructively interferes there and destructively interferes elsewhere on the detector. By linearity, the sum of the light from all the source points forms an image of the source.