Saturday, March 7, 2026

Two Agents, One Codebase: An F1 Race Team Approach to Porting ACOLITE to Rust

In Formula 1, every team fields two drivers. Not as a backup plan – as a strategy. One driver pushes the pace, forcing rivals to respond. The other holds position, manages tyres, and covers the alternative strategy. They share telemetry, they share a garage, but they are running different races on the same track. The team wins when both cars score points, not when one driver tries to do everything.

Porting a scientific Python codebase to Rust feels remarkably similar. You need the aggressive driver – the one who charges into unfamiliar code and lays down fast laps of Rust implementation. And you need the calculating driver – the one who reads the data, watches for degradation, and calls out when the numerical precision is drifting. Two AI coding agents, paired like Norris and Piastri, sharing a codebase but operating on different parts of the problem.

The Starting Grid: Why Rust for ACOLITE?

ACOLITE is RBINS’ atmospheric correction toolkit for aquatic remote sensing. It handles everything from Landsat and Sentinel-2 to hyperspectral sensors like PACE OCI (286 bands) and PRISMA (239 bands). The Dark Spectrum Fitting (DSF) algorithm is elegant – image-based, no external atmospheric inputs – but in Python, processing a full PACE scene involves reading 291 NetCDF variables, interpolating multi-dimensional LUTs, and correcting each pixel’s reflectance through a chain of gas transmittance, Rayleigh scattering, and aerosol models. On a decent machine, this takes around 230 seconds.

The seed was planted at FOSS4G 2025 in Auckland when Leo Hardtke ran a tutorial on Earth Observation processing with Rust. It was plagued by Nix environment issues (as I noted in my conference write-up), but when the code ran, it was fast. Zero-cost abstractions and fearless concurrency are not just slogans at that point – they are wall-clock seconds you are not spending waiting for your atmospheric correction to finish.

I had also been watching Rob Woodcock’s acolite-mp branch, which tackled the same performance problem from within Python. His approach was clever: per-band parallelism with memory budgets tuned to cloud CPU-to-RAM ratios (2, 4, or 8 GiB per core), replacing NumPy’s interpolation with the multithreaded pyinterp, and carefully managing the GIL contention that Python’s threading model inflicts on you. He got Sentinel-2 from 791s down to 197s and Landsat from 312s to 99s on a 24-core i9 – roughly a 3-4x speedup.

But the GIL is still there. The memory model is still Python’s. And as Rob himself noted, “further performance improvements are possible but require more extensive changes to the file handling” and “there is a fair amount of GIL contention which limits threading being caused by some structural choices in the implementation.” At some point, you are fighting the language rather than the problem.

Rust sidesteps all of this. No GIL. No garbage collector. Rayon gives you data-parallel iterators that map across bands or tiles with work-stealing. Memory usage is deterministic and known at compile time – you can profile it statically before deploying, which is a sentence that makes no sense in Python-land but is table stakes in systems programming.

The Pit Crew: Two Agents via ACP

Here is where the teammate analogy really kicks in. In F1, a team with only one driver is not half a team – it is no team at all. You cannot run a split strategy with a single car. You cannot use one driver to hold up a rival while the other pulls a gap. The performance of the pair exceeds the sum of the individuals because they create options that a solo driver simply cannot.

Porting 40,000+ lines of scientific Python to Rust is the same. A single AI agent writing Rust will drift – the implementation slowly diverging from Python’s numerical behaviour until your reflectance values are off by just enough to be scientifically useless. You need the second driver to keep it honest.

The solution I landed on was a multi-agent orchestration harness using the Agent Client Protocol (ACP), a JSON-RPC 2.0 protocol over NDJSON stdio that lets coding agents communicate in a structured way:

Agent Role F1 Equivalent
Kiro Executor – writes Rust code, runs tests, reads files Lead driver – pushes the pace, sets fast laps
Copilot Proposer – reviews output, suggests next steps, cross-checks Python Second driver – covers the strategy, watches the gaps
Human Approver – filters proposals before dispatch Team principal – makes the call on when to pit

The workflow per sensor port looks like this:

  1. Human provides --task to the orchestrator (tools/agent_harness.py)
  2. Kiro receives the task via ACP session/prompt and starts writing code
  3. Kiro streams output via session/update chunks
  4. Output goes to Copilot for review against the Python source
  5. Copilot proposes ACTION: lines – “fix the gas transmittance interpolation order”, “the Rayleigh LUT needs pressure stacking”
  6. Human approves or rejects
  7. Approved actions go back to Kiro
  8. Repeat until regression tests pass or maximum cycles reached

This is not vibe coding. This is a two-car team running a split strategy.

Think about how McLaren or Red Bull operate. The lead driver qualifies on pole and sets the pace in clean air. The second driver starts on a different tyre compound, runs a longer first stint, and emerges from the pits into a different part of the field. They are solving complementary problems – one optimises for raw speed, the other for strategic coverage. Neither is redundant.

Kiro is the lead driver. It attacks the Rust implementation aggressively – writing loaders, porting DSF algorithms, wiring up rayon parallelism. It sets fast laps. It also occasionally bins it into the gravel trap by hallucinating a NumPy broadcasting rule that does not exist in ndarray.

Copilot is the second driver. It reads the Python source, cross-references the Rust output, and spots where the gap to parity is growing. “The gas transmittance interpolation order is wrong” is exactly the kind of radio call a second driver makes – not flashy, but it prevents a DNF.

The human is the team principal. You do not override the drivers on every corner, but you make the strategic calls: do we pit now and fix this RMSE regression, or do we push on and address it in the next stint? Is a 0.002 RMSE difference in Sentinel-2 reflectance acceptable? (It is – that is within float32 precision.) When do we switch from tiled DSF to fixed DSF mode for this sensor?

Together, they converge faster than either alone, for the same reason that two cars gathering tyre data in free practice gives the team more information than one car doing twice as many laps.

The Telemetry: Regression Tests Against Real Data

In F1, both drivers generate telemetry. The team overlays their data – braking points, throttle application, cornering speed – to find where one is faster and why. The overlay is the truth. Not the driver’s feeling, not the engineer’s simulation, but what the car actually did on the track.

Regression tests are our telemetry overlay. The Python ACOLITE output is Driver 1’s trace. The Rust output is Driver 2’s. We overlay them pixel-by-pixel, band-by-band, and look at the delta. When the traces diverge, something real has changed and we need to understand whether it is a genuine improvement or an error we need to correct.

There are currently 141 Python regression tests that compare Rust output against Python output pixel-by-pixel across real satellite scenes:

  • Landsat 8/9: 13 regression + 13 Rust-vs-Python + 7 benchmark tests
  • Sentinel-2 A/B: 19 regression + 15 Rust-vs-Python + 9 benchmark tests
  • PACE OCI: 17 regression + 14 Rust-vs-Python + 12 DSF comparison + 12 ROI + 10 full-scene tests

The tolerances are tight. Sentinel-2 achieves RMSE < 0.002 (physics-equivalent). Landsat gets RMSE < 0.02. PACE full-scene (1710 x 1272 pixels x 291 bands) hits mean RMSE of 0.004 with 100% of pixels within 0.05 of Python. Correlation coefficients are R > 0.999 across all sensors.

These are not toy tests on synthetic data. They run against actual L1 scenes downloaded from USGS and NASA. When the tests break, something real is wrong.

The Performance Gap: Where the Seconds Go

Sensor Scene Size Rust Python Speedup
Landsat 8 62M px x 7 bands 66s 180s 2.7x
Landsat 9 62M px x 7 bands 56s 180s 3.2x
Sentinel-2 A 30M px x 11 bands 52s 182s 3.5x
Sentinel-2 B 30M px x 11 bands 64s 173s 2.7x
PACE OCI (full) 1710 x 1272 x 291 bands 84s 230s 2.7x

The PACE result is particularly satisfying. The key optimisation was switching from 291 per-band NetCDF reads to 3 bulk detector reads, then applying rayon-parallel atmospheric correction across tiles. Load is 12 seconds, AC is 34 seconds, write is 35 seconds. That write phase for a 291-band hyperspectral cube goes to GeoZarr V3 with gzip compression – try doing that in a Python event loop without your memory allocator throwing a tantrum.

Energy Efficiency: The Fuel Strategy Nobody Talks About

Here is the part where I get philosophical – and where the F1 analogy turns from metaphor into mirror.

Formula 1 underwent a fuel efficiency revolution in 2014. The FIA introduced hybrid power units, capped fuel flow at 100 kg/hour (monitored 2,200 times per second), and forced teams to extract maximum performance from minimum fuel. The result was not slower cars – it was faster cars that used less. The 2026 regulations go further: fossil carbon is prohibited entirely, the MGU-K will deliver three times the electrical power (350kW vs today’s 120kW), producing up to 1,000 horsepower while burning sustainable fuel. Less fuel, more power. That is not a trade-off – it is an engineering constraint that drives innovation.

The same constraint applies to scientific computing, we just pretend it does not. Cloud computing bills are denominated in dollars, but the underlying unit is energy. Every CPU cycle your atmospheric correction burns is a watt drawn from a power grid somewhere. When you are processing continental-scale Sentinel-2 archives or the full PACE ocean colour mission, those watts add up. Python is the V10 era of scientific computing – glorious, unrestricted, and profligate with resources.

Rust is the hybrid power unit. Its advantage is not just speed – it is energy per unit of work. A 3x speedup roughly translates to using a third of the compute time, which means a third of the energy, a third of the carbon footprint, and a third of your AWS bill. The Rust Foundation and others have pointed to studies showing compiled languages like Rust and C using an order of magnitude less energy than interpreted languages for equivalent workloads. Just as F1 teams discovered that fuel efficiency constraints forced them to build fundamentally better engines, switching to Rust forces you to think about memory layout, allocation patterns, and data flow in ways that Python’s garbage collector lets you ignore – until the bill arrives.

And here is the irony that would make an F1 sustainability officer wince: Earth observation processing is meant to monitor the planet’s health. Burning excess energy to do it is like running your emissions-monitoring car on leaded fuel. F1 recognised that the sport’s 20-car grid is only 1% of its total carbon footprint, but pursued fuel efficiency anyway because the technology trickles down. The same logic applies to EO processing pipelines. The individual savings per scene are modest, but at continental archive scale they compound – just like how F1’s hybrid innovations now power road cars from Ferrari’s SF90 to the electric components in every modern turbo engine.

Static memory profiling makes this tangible. In Rust, I can tell you at compile time that a Sentinel-2 full-scene atmospheric correction will peak at approximately N gigabytes of memory, because the allocations are deterministic. In Python, you find out at runtime – usually when the OOM killer visits your pod. F1 teams know their fuel load to the gram before the formation lap. Rust gives you the same certainty for compute.

Kubernetes 1.35 and Vertical Pod Autoscaling

This deterministic memory behaviour dovetails nicely with Kubernetes 1.35’s improvements to Vertical Pod Autoscaler (VPA). VPA watches your pod’s actual resource usage and adjusts CPU and memory requests/limits accordingly. When your workload has predictable resource usage – as Rust workloads tend to – VPA converges quickly to the right allocation instead of oscillating between OOM kills and wasted headroom.

For a processing pipeline that ingests satellite scenes of varying sizes (a Landsat scene is 62 million pixels across 7 bands; a PACE scene is 2.2 million pixels across 291 bands), VPA can right-size pods per sensor type. Rust’s static memory profile means the VPA recommendations stabilise fast, which means tighter bin-packing, which means more scenes processed per node, which means lower cost per scene.

Compare this to Python pods where memory usage is non-deterministic, garbage collection spikes are unpredictable, and the VPA has to overprovision to avoid OOM. The 2 GiB/core cloud ratio that Rob’s acolite-mp was carefully designed around becomes less of a constraint when your language does not waste half of it on interpreter overhead.

Out-of-Band Development: Preventing Merge Conflicts with Upstream

One design decision I am particularly happy with is keeping the Rust port on a separate feature branch (feature/rust-port) and treating it as out-of-band from the Python codebase. ACOLITE upstream is actively maintained by Quinten Vanhellemont at RBINS, with regular additions of new sensors, algorithm refinements, and bug fixes. A traditional “rewrite in Rust” approach would create an immediate fork that diverges with every upstream commit.

Instead, the Rust code lives in src/, benches/, and tests/ directories that do not exist in upstream Python ACOLITE. The Python code in acolite/ stays untouched. The regression tests are the synchronisation mechanism – they import both the Python ACOLITE modules and the compiled Rust binary, run the same scene through both, and compare outputs.

When upstream adds a new sensor or changes a gas transmittance coefficient, the regression tests fail in the Rust port. That failure is the trigger: it goes into the agent harness as a --task, Kiro investigates the numerical difference, Copilot cross-references the upstream commit, and the fix lands in Rust without touching a single Python file. No merge conflicts. No rebasing nightmares. Just tests that enforce parity.

This is how you keep an acceleration layer in sync with a moving target – you do not try to merge them. You test them against each other.

What Is Next: The Gap to Full Sensor Parity

The roadmap has the current state at 48 Rust tests, 141 Python regression tests, and three sensors fully validated (Landsat 8/9, Sentinel-2 A/B, PACE OCI). The architecture – loader, AC, writer – is clean and extensible. But three sensors out of 30+ is a qualifying lap, not a race win. Here is what closing the gap to full ACOLITE parity actually looks like.

Sensor Coverage: 3 down, 30+ to go

Python ACOLITE supports a sprawling constellation of sensors. The Rust port has ticked off the three highest-priority ones but the remaining fleet breaks into tiers:

Tier 1 – Near-term (shared loader patterns exist):

Sensor Bands Loader Type Blocker
Sentinel-3 OLCI 21 NetCDF Sensor def exists, needs full pipeline
PRISMA 239 HDF5 Shares pattern with PACE
DESIS 235 HDF5 Shares pattern with PACE
EnMAP 224 HDF5 Shares pattern with PACE
EMIT 285 NetCDF Similar to PACE OCI

These are the low-hanging fruit. The PACE port proved out the NetCDF and hyperspectral GeoZarr writer path; PRISMA/DESIS/EnMAP share the HDF5 loader pattern. Each is a well-scoped --task for the agent harness – Kiro writes the loader and wires up the AC pipeline, Copilot validates against Python output on a reference scene.

Tier 2 – Medium-term (new loader work required):

Sensor Bands Notes
Landsat 5 TM / 7 ETM+ 7-8 Older calibration metadata formats
PlanetScope (Dove/SuperDove) 4-8 Commercial format, GeoTIFF based
WorldView-2/3 6-29 Multi-resolution, pan-sharpening
Pleiades 5-7 DIMAP format
QuickBird-2 5 Legacy but still used
VIIRS (NPP/J1/J2) 22 Swath-based HDF5, three platforms
Aqua/Terra MODIS 36 HDF4/HDF-EOS
GOCI-2 12 Korean ocean colour mission

Each of these needs a dedicated loader – different metadata formats, different calibration approaches, different file layouts. The atmospheric correction core (DSF, gas transmittance, Rayleigh, aerosol models) is shared, but getting the radiometrically calibrated top-of-atmosphere reflectance array into the pipeline is the per-sensor work.

Tier 3 – Geostationary and niche (lowest priority for aquatic applications):

GOES ABI, Himawari AHI, MTG-I FCI, SEVIRI, Sentinel-3 SLSTR, AMAZONIA-1 WFI, CHRIS, HYPERION, HICO, HyperField, HYPSO, Tanager. Some of these (HYPERION, HICO) are decommissioned but their archives are still processed. Others (Tanager at 420 bands, HYPSO at 120) are newer hyperspectral missions that would benefit most from Rust’s performance advantage.

Beyond Loaders: The Algorithm Gap

Sensor parity is not just about reading files. Python ACOLITE has several processing features the Rust port does not yet implement:

  • ROI subsetting: Limit processing to a bounding box or polygon – critical for operational workflows that do not need a full scene
  • Ancillary data retrieval: NCEP ozone, pressure, and wind speed from NASA OBPG; currently the Rust port uses default values
  • DEM-derived pressure: Copernicus DEM at 30/90m for surface pressure estimation in mountainous coastal regions
  • Glint correction: Sun glint removal for low-latitude ocean scenes
  • RAdCor adjacency correction: The physics-based adjacency effect correction developed under the STEREO program
  • TACT thermal processing: Surface temperature from Landsat thermal bands via libRadtran – this one is architecturally interesting because it requires calling an external Fortran radiative transfer code
  • Interface reflectance (rsky): Sky reflection correction at the air-water interface
  • L2W water products: Chlorophyll-a (OC algorithms), TSS (Nechad, Dogliotti), turbidity, Secchi depth – the derived products that downstream scientists actually use

The L2W gap is the most consequential. Most ACOLITE users do not care about surface reflectance per se; they want chlorophyll maps or turbidity time series. Until the Rust port can produce L2W outputs, it remains a fast atmospheric correction engine rather than a complete aquatic remote sensing toolkit.

The Realistic Path

Closing this gap is not a sprint, it is an endurance race – appropriately enough. The agent harness makes each sensor port a repeatable, testable unit of work. The pattern is established: write loader, wire to AC pipeline, run regression tests against Python, fix deltas, validate on real data. Each sensor port takes the agents a day or two of focused work plus human review.

At the current pace, Tier 1 sensors are within reach in the near term. Tier 2 will follow as the loader library matures. The algorithm features (ancillary data, glint, TACT) are orthogonal to sensor coverage and can be developed in parallel. L2W is the final milestone – when the Rust port can ingest a Sentinel-2 scene and produce a chlorophyll-a map that matches Python to within measurement uncertainty, the port will be race-ready for production.

Each of these is a --task for the agent harness. Two drivers, one constructor’s championship. The lead driver pushes into unfamiliar sensor territory, the second driver validates against Python, and the telemetry overlay catches every divergence before it compounds into a retirement.

If the intersection of Rust, Earth observation, and AI-assisted development interests you, the code is all on GitHub. Feel free to ping me with ideas, bug reports, or competing approaches – especially if you have a cleverer way to handle the N-dimensional LUT interpolation. That one was a fun 3 days of Rapid Rust Rewrite fuelled by AI Amphetamine Analogs.