Photometric Stereo Depth Reconstruction

1. Background and motivation

Traditional bone registration relies on manually selecting optical markers and CT fiducials. This is slow and error-prone. The goal here is to recover a dense depth map of the surgical surface directly from images, so the depth map can be registered to a pre-operative CT scan automatically.

Commercial inspiration: MedicalTek's MonoStereo system in Taiwan converts 2D endoscope video into 3D in real time using a single-image inverse-square + Lambertian model plus learned priors. The classical multi-light version of the same idea, photometric stereo, recovers true surface normals algebraically from three or more images, then integrates them into a depth map. That is the technique implemented here.

2. Physical model

2.1 Lambertian reflection

A matte surface scatters incoming light equally in all directions. Its observed brightness depends only on the angle between the surface normal and the light direction:

I = ρ · L · cos(θ)

where ρ is albedo (surface reflectivity), L is light intensity, and θ is the angle between surface normal n and light direction L. This is Lambert's cosine law.

2.2 Inverse-square falloff

For a point light source, intensity at the surface drops with the square of distance because the same energy spreads over a sphere of area 4πr²:

I ∝ 1 / r²

2.3 Combined endoscope model (single light, coaxial)

When the light source is on the camera axis (e.g. endoscope tip), the two combine:

I(p) = ρ · L · cos(θ) / r²

Inverting this for r gives the single-image depth estimator. Useful but limited: it conflates ρ, cos(θ), and r² into a single brightness number.

2.4 Photometric stereo (Woodham 1980)

With three images of the same surface lit from three known different directions, the per-pixel equation system becomes:

I_k = ρ · (n · L_k)    for k = 1, 2, 3

Stacking into matrix form:

[I₁]   [L₁ᵀ]
[I₂] = [L₂ᵀ] · (ρn)
[I₃]   [L₃ᵀ]

Let g = ρn. Then g = L⁻¹ · I, computed once per pixel by a single matrix multiply. From g:

Albedo: ρ = |g|
Surface normal: n = g / |g|

Critical advantage over the single-image method: distance does not appear in these equations. Photometric stereo recovers the surface orientation directly, with no inverse-square assumption.

2.5 Recovering depth from normals

Surface normals give the gradient of the depth function, not depth itself. The relationship (with nz > 0 pointing toward the camera and z increasing away from camera):

dz/dx_image = +nx / nz
dz/dy_image = −ny / nz   ← sign flip: image y points down, world Y points up

Integrate these gradients across the image to recover z(x, y) up to an additive constant. Sign convention bit us during development — see section 6.

2.6 Absolute scale anchoring (coaxial light)

Pure photometric stereo produces depth up to an additive offset. To get real centimeters, take one additional image with a coaxial light source (light on the camera axis) and use the inverse-square law at the brightest pixel:

r_near = sqrt(ρ · L / I_brightest)

This single measurement fixes the offset for the entire depth map.

3. Pipeline

The implementation has five stages:

Render or capture three lit images of the same scene with three different known light directions.
Solve photometric stereo per pixel: g = L⁻¹ · I → albedo, surface normal.
Integrate surface normals into a relative depth map (weighted least-squares Poisson solve).
Calibrate absolute scale using a coaxial-light reference image (inverse-square at the brightest point).
Visualize as a heatmap with numerical labels in cm.

Figure: Stage 1 — the three input images of the same sphere lit from three different known directions.

4. Code structure (photometric_stereo_sphere.py)

4.1 `make_sphere_scene(size, radius_cm, center_dist_cm, albedo, light_dirs)`

Generates a synthetic test scene:

A sphere of radius 1.5 cm centered at depth 4.0 cm from the camera
300×300 image, ~60 pixels per cm
Three lit images using directional lights at [−1, 1, 1.5], [1, 1, 1.5], [0, −1, 1.5] (unit-normalized)
Returns: list of RGB images, ground truth depth map, ground truth normals, foreground mask, light direction matrix

4.2 `solve_photometric_stereo(images, light_dirs, mask)`

The core Woodham step. Stack three luminance images into a (3, N) matrix where N = H × W. Compute L⁻¹ once (3×3 inverse), then G = L⁻¹ · I_stack recovers g = ρn for every pixel in a single matrix multiply. Split into albedo = |g| and normal = g / |g|.

Figure: Stage 2 — recovered surface normals vs ground truth, plus the recovered albedo (uniform gray, as expected for a uniform-reflectivity sphere).

4.3 `integrate_normals_to_depth(normals, mask, cm_per_pixel)`

Sparse weighted least-squares Poisson solver. For each masked pixel, write two equations:

nz · (z[y, x+1] − z[y, x]) = +nx · cm_per_pixel
nz · (z[y+1, x] − z[y, x]) = −ny · cm_per_pixel

The nz weighting downweights silhouette pixels where nz ≈ 0 and the slope estimate nx/nz would otherwise blow up. Anchor one pixel to 0 to make the system non-singular. Build as a sparse CSR matrix, solve with scipy.sparse.linalg.lsqr.

4.4 `calibrate_scale_with_coaxial(z_relative, mask, coaxial_image, albedo, light_power_cm2)`

Fixes the additive offset. Take the top 0.1% brightest pixels in the coaxial image (where cos θ ≈ 1, so they are the nearest points). Average their red-channel intensity → r_near = sqrt(ρL / I). Set the offset so the integrated depth at those bright pixels equals r_near. Also handles sign flip if photometric stereo returned a flipped-orientation solution.

4.5 `render_coaxial(normals, mask, albedo_map, light_power_cm2, z_true, noise)`

Helper that synthesizes the calibration image used by calibrate_scale_with_coaxial. For real hardware, this image comes from the actual scope.

4.6 Visualization helpers

annotate_sphere, labeled, and the figure-generating code in main() produce five output PNGs with depth labels, comparison plots, and a 3D surface rendering.

5. Light direction calibration

In the synthetic demo, light directions are hardcoded because we both generate the images and solve them. In a real rig you must measure them. Four standard methods:

5.1 Chrome sphere calibration (gold standard)

Place a shiny ball of known size in the scene. The specular highlight on the ball pinpoints the light source position. For a highlight at pixel (x, y) on a sphere with center (cx, cy) and radius r pixels:

nx = (x − cx) / r
ny = −(y − cy) / r       ← image-y flip
nz = sqrt(1 − nx² − ny²)
L = 2(n · V)n − V        ← mirror reflection of view direction V = [0, 0, 1]

Do this once per light. Re-calibrate whenever any light moves.

5.2 Known-geometry object

Image a calibration target whose shape you know. Solve for the L that best explains the observed brightness pattern. A 3-parameter nonlinear least-squares fit.

5.3 Direct physical measurement

If the lights are fixed in a rig (LED ring, scope tip), measure their positions in millimeters from a reference point and convert to camera coordinates. This is what surgical scopes effectively do — LED positions are known from manufacturing.

5.4 Uncalibrated photometric stereo

Solve for lights and surface jointly. Suffers from the generalized bas-relief ambiguity — multiple (lights, surface) pairs explain the same images. Needs additional constraints (integrability, prior on one normal, multiple objects). Active research area.

5.5 Geometric requirement

The three light directions must form a linearly independent set for L⁻¹ to exist. Lights all roughly above the object → L becomes near-singular → solver explodes on noise. Recommended configuration: roughly 120° apart in azimuth, all tilted 30–45° above the horizon.

6. Bugs encountered and fixed

6.1 Unit mismatch in integration (factor of ~60,000 error)

Symptom: Integrated depth span was 153 cm when the true span was 1.5 cm.

Cause: Wrote forward differences as z[y, x+1] − z[y, x] = −nx/nz (unitless slope) instead of = −nx/nz · cm_per_pixel. The integration accumulated slope × pixel-count, producing depth in “pixels” instead of cm.

Fix: Multiply gradient RHS by cm_per_pixel everywhere. For our 300-pixel image with 1.5 cm sphere radius spanning 0.3 × W pixels, cm_per_pixel = 1.5 / 90 ≈ 0.0167.

6.2 RGB vs luminance in coaxial calibration

Symptom: Coaxial calibration produced r_near = 3.09 cm when true depth at brightest pixel was 2.52 cm.

Cause: Computed luminance with weighted RGB sum (0.15R + 0.70G + 0.15B), but the tissue tint applied to the rendered image scaled G and B by 0.55. So luminance was 0.6175× the raw intensity. The calibration was applying sqrt(ρL / I_luminance) instead of sqrt(ρL / I_raw).

Fix: Use the red channel directly for coaxial calibration. R preserves the raw intensity in our tissue-tinted rendering.

6.3 Sign convention on y-axis (factor of 2 shape error)

Symptom: Even with true normals, integration produced span 3.45 cm vs true 1.5 cm — the shape was stretched.

Cause: Textbook photometric stereo writes dz/dx = −nx/nz, dz/dy = −ny/nz for a coordinate system where z increases toward the camera and y increases upward. Our convention has z increasing away from the camera and image y increasing downward. The two sign flips do not cancel — they affect different equations.

Fix: A sign sweep on true normals showed (+nx, −ny) produces the correct shape:

sx=+1, sy=+1: best slope=0.000,  error after best-fit=0.296 cm  ← wrong
sx=+1, sy=−1: best slope=0.995,  error after best-fit=0.013 cm  ← CORRECT
sx=−1, sy=+1: best slope=−0.995, error after best-fit=0.013 cm
sx=−1, sy=−1: best slope=0.000,  error after best-fit=0.296 cm  ← wrong

Settled on dz/dx_img = +nx/nz, dz/dy_img = −ny/nz.

6.4 Silhouette instability

Symptom: Near the sphere silhouette, nz → 0, so −nx/nz and −ny/nz blow up. Tried clamping nz with safe_nz = max(nz, 1e-3) — produced huge spurious gradients at the boundary that corrupted the entire integration.

Fix: Reformulate the gradient equations in their undivided form:

nz · (z[y, x+1] − z[y, x]) = +nx · cm_per_pixel

Now silhouette pixels (nz ≈ 0) contribute negligibly to the least-squares system, exactly proportional to how reliable they are. No clamping needed.

7. Results

Synthetic sphere, radius 1.5 cm, center at depth 4.0 cm:

Metric	Value
True depth range	2.500 – 4.000 cm (span 1.5 cm)
Estimated depth range	2.418 – 3.987 cm
Mean absolute error	0.082 cm (2.75% of mean depth)
Normal recovery error	0.041 (vs unit vectors)

Figure: final depth heatmap with five sample points labeled in cm — nearest pixel ~2.5 cm (deep blue), silhouette edges ~4.0 cm (red/orange).

Figure: estimated depth (left) vs ground truth (right) with a shared colorbar; mean absolute error 0.082 cm.

Figure: reconstructed 3D surface of the visible hemisphere of the sphere.

Reference: the academic photometric-endoscopy paper from Batlle, Montiel & Tardós (arXiv:2204.09083) reports 7% mean error in vivo. Our 2.75% on a synthetic sphere with known lights is in the same ballpark, slightly better as expected because the scene is noise-free and the lights are perfectly known.

8. Limitations and next steps

8.1 What this implementation does well

Recovers absolute-scale depth from three lit images plus a coaxial reference
Robust to silhouette boundaries via weighted Poisson integration
Handles arbitrary masks (works on any object outline, not just spheres)
Real-units output (cm) via the coaxial calibration step

8.2 What it does NOT handle

Specular highlights on wet/shiny surfaces — would need a detection-and-masking pre-step (the earlier single-image version does this; can be ported)
Cast shadows — pixels in shadow violate I = ρ · n · L. Standard fix: detect dark pixels per-image, exclude them from that light's equation, fall back to two-light solve where needed
Non-Lambertian materials — only the diffuse term is modeled
Unknown light directions — current code assumes calibrated lights (see section 5)
Moving lights or moving camera between frames — assumes the three exposures are temporally coregistered

8.3 Next milestones for Lumina Bone

Hardware light calibration — implement chrome-sphere calibration routine (section 5.1) and integrate into capture workflow.
Real image testing — capture three lit images of a bone phantom and run the pipeline end-to-end.
CT registration — implement ICP between the photometric depth map and the pre-op CT surface.
Shadow/specular handling — port the masking code from the single-image pipeline.
Latency optimization — current sparse-solve takes a few seconds at 300×300; for real-time use, port to GPU (cuSPARSE) or use a multigrid Poisson solver.

9. Files in the repository

File	Purpose
`photometric_depth.py`	Single-image inverse-square + Lambertian pipeline. Handles specular masking, edge-aware smoothing, and stereo view synthesis. Input: one image.
`photometric_heatmap.py`	Wrapper around the single-image pipeline that produces labeled heatmaps with cm values.
`photometric_stereo_sphere.py`	Main file. Three-light photometric stereo on a synthetic sphere. Produces five output PNGs including the labeled depth heatmap and 3D surface.

9.1 How to run

pip install numpy scipy matplotlib pillow
python photometric_stereo_sphere.py

Outputs five PNGs to ./outputs/ (path is configurable at the top of main()):

ps_01_inputs.png — the three lit input images
ps_02_normals.png — recovered vs ground-truth surface normals
ps_03_depth_heatmap.png — depth heatmap with numerical labels in cm
ps_04_comparison.png — estimated vs ground-truth depth side by side
ps_05_3d_surface.png — 3D surface plot of the reconstructed sphere

10. References

Woodham, R. J. (1980). Photometric method for determining surface orientation from multiple images. Optical Engineering 19(1).
Batlle, V. M., Montiel, J. M. M., & Tardós, J. D. (2022). Photometric single-view dense 3D reconstruction in endoscopy. arXiv:2204.09083.
Frankot, R. T. & Chellappa, R. (1988). A method for enforcing integrability in shape from shading algorithms. IEEE PAMI 10(4).
MedicalTek (MDTK) MonoStereo product documentation: medicaltek.biz