computer vision
20 October 2024
6 min read

Computer vision in real-world applications

Computer vision benchmarks have become almost embarrassingly easy for modern neural networks. ImageNet top-1 accuracy has crossed 90 percent. Object detection on COCO is faster and more accurate than human annotators under controlled conditions. Yet walk into any real deployment - a construction site, a hospital corridor, the loading dock of a distribution centre - and the gap between benchmark performance and operational reliability becomes immediately apparent.

The reality gap: datasets vs. real environments

Every dataset is a frozen sample of the world at a particular place and time. ImageNet images were scraped from the early web - mostly JPEG-compressed photographs taken in reasonable lighting by people who intended their subjects to be visible. The distribution of lighting conditions, viewpoints, occlusions, and backgrounds in a real industrial or outdoor environment is radically different.

This mismatch, often called domain shift or the reality gap, manifests in predictable ways. A model trained on clean indoor images of household objects fails on the same objects under fluorescent light or partial shadow. A pedestrian detector trained on daytime urban footage degrades at dusk, in rain, or with unusual clothing. Benchmark accuracy, measured on held-out samples from the same distribution as training data, tells you almost nothing about real-world performance.

Closing this gap requires either collecting deployment-domain data at scale - expensive, slow, sometimes impossible - or building representations that are inherently more invariant to irrelevant factors. Data augmentation helps: randomly perturbing brightness, contrast, blur, and geometric transform during training forces the model to ignore those variations. Domain adaptation techniques go further, explicitly aligning feature distributions between source and target domains without requiring labels in the target domain.

Spatial and temporal understanding: 6D pose and motion prediction

Recognising that an object is present in a scene is only the first step for most robotics applications. Manipulation requires knowing where the object is in three-dimensional space and how it is oriented - its full 6D pose (three translational and three rotational degrees of freedom). This is substantially harder than 2D detection.

Iterative methods like DeepIM or the more recent FoundPose refine an initial pose estimate through repeated comparison with a rendered model, converging on a precise estimate over several iterations. Direct regression methods predict pose in a single forward pass, trading accuracy for speed. Point cloud-based approaches - operating on depth sensor output rather than RGB images - sidestep some of the texture and lighting sensitivity inherent in appearance-based methods at the cost of requiring depth hardware.

Temporal understanding adds a further dimension. A robot navigating alongside humans needs to predict not just where people are now but where they are likely to be in the next two seconds. Optical flow provides dense motion estimation between frames. Learned trajectory prediction models condition on observed history to produce probabilistic forecasts over future positions. The interaction between prediction uncertainty and robot motion planning is an area of intense current research, particularly for shared human-robot workspaces.

Simulation breakthroughs closing the gap

The cost of acquiring labelled real-world data has driven substantial investment in photorealistic simulation. NVIDIA Isaac Sim, Blender-based pipelines, and game-engine-derived environments can now render scenes that are difficult to distinguish from photographs under close inspection - and they can do so with automatic, perfect ground-truth annotation for depth, surface normals, segmentation masks, and 6D object poses.

Domain randomisation is the complementary technique: rather than making simulation as realistic as possible, randomise every parameter - lighting, texture, camera position, physics properties - across a vast range. Training on this aggressively varied distribution produces representations that are robust to any specific configuration because they have learned to ignore all of it. The underlying object structure becomes the only consistent signal, and it is exactly the signal that transfers to the real world.

The combination of photorealistic simulation and domain randomisation is not a solved problem - domain gaps remain - but the practical sim-to-real transfer results published in recent years would have seemed implausible a decade ago. Policies trained entirely in simulation are being deployed on physical robots in unstructured environments with meaningful success rates.

Speed vs. accuracy trade-offs in deployment

Academic benchmarks optimise for accuracy. Deployment optimises for the product of accuracy, latency, power consumption, and hardware cost. These objectives are in fundamental tension, and navigating that tension is where most of the engineering work in applied computer vision actually happens.

A surgical robot can afford relatively high latency - human surgeons operate on timescales of hundreds of milliseconds. An autonomous vehicle has a hard latency budget set by physics: at 120 km/h, 100 ms of processing delay corresponds to more than three metres of travel. An embedded drone may have a power budget of two watts for all computation, constraining not just the model architecture but the precision of arithmetic.

Model compression techniques - pruning, quantisation, knowledge distillation - reduce model size and inference time with controlled accuracy loss. Neural architecture search can find efficient architectures that occupy better points on the Pareto frontier of accuracy versus efficiency than hand-designed networks. Hardware-aware design, co-optimising model and inference hardware together, represents the current frontier of deployment efficiency.

Continuous learning in the field

A model deployed today will encounter distribution shift tomorrow. Lighting conditions change with seasons. New product variants arrive on the factory floor. Human clothing and behaviour evolve. A static model that cannot adapt will degrade over time; the question is whether the degradation is gradual and detectable or sudden and catastrophic.

Continual learning - updating a deployed model on new data without catastrophic forgetting of old capabilities - is one of the most challenging open problems in the field. Elastic weight consolidation, progressive neural networks, and replay-based methods each offer partial solutions with different computational and memory requirements. None provides a complete answer for the open-world, open-ended deployment scenarios that real robots inhabit.

The practical solution for many deployments is human-in-the-loop correction: flag low-confidence predictions for human review, collect the corrected labels, retrain periodically. This is operationally expensive and introduces latency between encountering a new pattern and adapting to it. Closing that loop more tightly - ideally toward online adaptation in real time - is where the field is heading, even if it has not yet arrived.