Case Studies

Real bugs we've found and fixed in paper implementations. Names shared with client permission.

Quick Fix4 hours to fix· Percepto AI

NeRF Ray Sampling Bug

NeRF: Representing Scenes as Neural Radiance Fields (Mildenhall et al., 2020)

Problem

Novel view synthesis produced blurry artifacts at object boundaries. Training metrics looked correct but visual quality degraded at inference.

Root Cause

The open-source implementation sampled rays uniformly instead of using stratified sampling with the hierarchical refinement described in Section 5.2. Additionally, the near/far bounds were hardcoded instead of being scene-adaptive.

Fix

Implemented proper stratified sampling with coarse-to-fine hierarchy. Added dynamic near/far bound computation based on scene bounding box. Added a regression test comparing ray sample distributions against the paper's Figure 5.

Impact

PSNR improved from 26.3 to 31.1 on the Blender dataset, matching paper-reported results.

Deep Diagnosis3 days to fix· SynthLab

Diffusion Model Gradient Computation

Denoising Diffusion Probabilistic Models (Ho et al., 2020)

Problem

Generated images had persistent high-frequency noise that didn't decrease with more sampling steps. FID scores were 3x worse than reported.

Root Cause

The noise schedule implementation used a linear beta schedule when the paper specifies a cosine schedule for the particular model variant. The variance was also computed using the wrong timestep indexing (off-by-one), causing cumulative error.

Fix

Replaced linear beta schedule with cosine schedule from the Improved DDPM paper. Fixed timestep indexing in variance computation. Added validation tests comparing alpha_cumprod values against Table 1 in the paper.

Impact

FID improved from 35.2 to 11.8 on CIFAR-10, within 5% of paper-reported results.

Deep Diagnosis2 days to fix· LangCore

Transformer Attention Scaling

Attention Is All You Need (Vaswani et al., 2017)

Problem

Custom Transformer variant showed training instability after 10K steps. Loss would spike and never recover, despite careful learning rate tuning.

Root Cause

The attention scaling factor was applied after softmax instead of before, causing extreme values in the attention weights. Combined with float16 training, this led to overflow in gradient computation. The reference implementation also had a subtle bug where the causal mask was applied after scaling rather than before.

Fix

Moved scaling factor before softmax as specified in Equation 1 of the paper. Reordered causal mask application. Added numerical stability checks and attention weight distribution logging.

Impact

Training stabilized for 100K+ steps. Perplexity on WikiText-103 matched the paper's reported 18.3.

Full Pipeline Review8 days to fix· MedScan

Full Pipeline: Vision Transformer for Medical Imaging

An Image is Worth 16x16 Words (Dosovitskiy et al., 2020)

Problem

ViT fine-tuned on chest X-rays showed 15% lower accuracy than expected from the paper. The team had spent 6 weeks tuning hyperparameters with no improvement.

Root Cause

Multiple compounding issues: (1) Patch embedding used overlapping patches instead of non-overlapping, (2) Position embeddings were randomly initialized instead of being interpolated from pretrained model, (3) Classification token was appended instead of prepended, (4) Layer norm was applied in pre-norm style when the pretrained weights used post-norm.

Fix

Comprehensive pipeline audit and 4 targeted fixes aligned with Sections 3.1-3.3 of the paper. Added a paper-alignment test suite that validates each component against expected intermediate shapes and value ranges.

Impact

Accuracy improved from 78.2% to 93.1%, exceeding the fine-tuning baseline in Table 5.

Have a similar issue?

Book Free Consultation