Friday, December 12, 2025

Refactoring the Austender Scraper: From Colly to OCDS

The AusTender analyser started life as a straight HTML scraper built with Colly, walking the procurement portal page by page. It worked, but it was always one redesign away from a slow death: layout shifts, odd pagination edges, and the constant need to throttle hard so I could sleep at night.

Then the Australian Government exposed an Open Contracting Data Standard (OCDS) API. That changed the whole game. Instead of scraping tables and div soup, I can treat the portal like a versioned data feed.

Part of why I care: I am kind of fascinated by government spending as a system. Budgets read like a mixture of engineering constraints and political storytelling, and I keep wanting to trace the thread from “budget line item” to “actual contract award” without hand-waving. The Treasurer’s Final Budget Outcome release (2022-23, “first surplus in 15 years”) is exactly the sort of headline that makes me want to drill down into the mechanics: Final Budget Outcome shows first surplus in 15 years.

So the redesign in austender_analyser does three things differently:

  1. Fetch via OCDS, not HTML: Reduce breakage by consuming the API’s canonical JSON, not scraped pages.
  2. Persist to Ducklake: Store releases, parties, and contracts in Ducklake so you can query locally without rerunning the whole pipeline. This does not quite work yet; I am treating it as a learning exercise with Ducklake. It is much easier to learn on a real problem than on toy demo datasets.
  3. Treat caching as optional: Counterintuitively, the local cache is sometimes slower than pulling fresh data. Ducklake’s startup and query overhead can outweigh a simple, parallelized upstream call. The new design keeps the cache but makes it opt-in and measurable.

If you prefer Python, the upstream API team ships a reference walkthrough in the austender-ocds-api repo (see also the SwaggerHub docs and an example endpoint like findById).

Early KPMG scrape results (2023)

Why move off Colly?

  • Scraping HTML is like doing accounting by screenshot. OCDS is the ledger export.
  • Less breakage: OCDS is documented and versioned; DOM scraping is brittle.
  • Faster iteration: You model on structured data immediately, not after a fragile extraction layer.
  • Clear rate behavior: You can respect API limits without guessing at dynamic page loads.

Why keep Ducklake in the loop?

Ducklake is the reproducibility knob. It lets me freeze a snapshot, replay transforms, and run offline queries when I am iterating on analysis (or when the upstream is slow, or when I just do not want to be a bad citizen).

But caches are not free. Ducklake has startup and query overhead, and that can be slower than simply pulling fresh JSON in parallel. So the pipeline treats Ducklake like a tool, not a religion: measure the latency, pick the faster path, keep an escape hatch when you need repeatability.

Reindex disk usage

Current flow

  • Pull OCDS releases in batches, keyed by release date and procurement identifiers.
  • Normalize the JSON into Ducklake tables (releases, awards, suppliers, items).
  • Emit lightweight summaries for quick diffing between runs.
KPMG contracts flood view

Lessons learned

  • A stable API beats heroic HTML scraping almost every time. Even in times of AI and (firecrawl)[https://www.firecrawl.dev/].
  • Caches are not free; measure them. Sometimes stressing the upstream lightly is faster and still acceptable within published rate limits.
  • Keep exit hatches: allow forcing cache use, bypassing it, and snapshotting runs for reproducibility.

Next steps: Going deeper : tighten validation against the OCDS schema, add minimal observability (latency histograms for API vs cache), and ship a “fast path” mode that only hydrates the fields needed for high-level spend dashboards. Going broader : find sites and build API and Web aggregators for Australian state tender sites (e.g. VicTender and international ones.

Saturday, December 6, 2025

Solar Ceilings and Compounding Dreams

It is fashionable to wave away physical constraints with vague references to solar abundance and human ingenuity. Yet every balance sheet eventually meets a balance of energy. Solar photons may shower Earth with roughly 170,000 terawatts, but financial markets expect growth that compounds on top of itself forever. The math linking those stories rarely appears in the same paragraph—so let’s put them together.

Setting the Stage

I keep coming back to Tom Murphy’s dialogue in Exponential Economist Meets Finite Physicist. In Act One, Murphy plots U.S. energy use from 1650 onward and it traces a remarkably straight exponential line at ~3% per year. Economists in the conversation shrug; after all, 2–3% feels modest. But compounding at that pace means energy demand multiplies by ten every century. Our economic models implicitly assume something even more optimistic : 8–10% returns in equity markets, pension targets, and venture decks; without asking what energy supply function supports that.

Thermodynamic Guardrails

Murphy distills the second law of thermodynamics into plain language:

“At a 2.3% growth rate (conveniently chosen to represent a 10× increase every century), we would reach boiling temperature in about 400 years… Even if we don’t have a name for the energy source yet, as long as it obeys thermodynamics, we cook ourselves with perpetual energy increase.”

That thought experiment matters less for the literal 400-year timer and more because it shows energy growth must decelerate to avoid turning Earth into a heat engine. Solar panels, fusion, space mirrors … pick your technology. The waste heat still has to radiate away. We cannot spreadsheet, app and AI our way around Stefan–Boltzmann and Black Body radiation.

Solar Arithmetic vs Demand Curves

Let’s grant the optimists a heroic build-out: cover 5% of Earth’s land area with 20%-efficient photovoltaic arrays, assume a generous 200 W/m² average output, and we net roughly 20 TW—about the entire human primary energy demand today. That is fantastic news for decarbonization, but it is not a blank check for compounding GDP. If demand keeps growing at 3%, we would need 20 TW × (1.03)ⁿ in perpetuity. Within 250 years we’d be trying to harvest thousands of terawatts—orders of magnitude more land, materials, storage, and transmission than our initial miracle project. Solar abundance is real; solar infinity is fiction.

Finance Is an Energy IOU

Money is a claim on future work, and work requires energy. When pensions assume 7–8% annual returns, when startups pledge 10× growth, and when national budgets bake in permanent productivity gains, they are effectively promising that future societies will deliver 2–3 doublings of net energy per century. If we instead hit a solar plateau—because land, materials, or social license cap expansion—those financial promises become unmoored. We can pretend that virtual goods, algorithmic trading, or luxury desserts (to borrow Murphy’s Act Four anecdote) deliver infinite utility without added energy, but the chefs, coders, and data centers still eat, commute, and cool their CPU’s , GPU’s and Tensor processors. The intangible economy rides on a very tangible energy base.

Rewriting the Business Plan

Accepting a solar ceiling does not doom us to stagnation. It just forces different design constraints:

  • grow quality, not quantity—prioritize outcomes per unit energy … do proof of useful work rather that roll the dice and gamble.
  • align finance with expected energy supply rather than mythical exponentials … and I am not talking of wasting energy on crypto.
  • treat efficiency gains as buying time, not as a perpetual motion machine … if you learnt enough physics in high school to reject the perpetual motion machine, but have been lulled into perpetual 8% returns from the finance markets, there is a serious schizophrenia issue.
  • embed thermodynamic literacy in economic education so debates start from the same math.

Murphy ends his essay noting that growth is not a “good quantum number.” It is not conserved. Our job is to craft institutions, portfolios, and narratives that can thrive when net energy flattens, because physics already told us that day will arrive long before our spreadsheets hit overflow errors.

Darwin 2022 - Ruminations Compendium

Collected reflections from the July 2022 Darwin trip, a narrative of adaptation, organisational change, and expansion can live in a single place.

July 19 – Lemmings And Launchpads

There is no exception to the rule that every organic being naturally increases at so high a rate, that if not destroyed the earth would soon be covered by the progeny of a single pair. Even slow breeding man has doubled in twenty five years, and at this rate in a few thousand years there would literally be no standing room for his progeny.Charles Darwin

Like the lemming marching and diving into the ocean to self‑regulate, humanity plunges itself into vices of its own creation: alcohol, drugs, violence, and greed. Perhaps the next plunge is into the real ocean or into the vacuum of space, chasing more room in which to stand or float. Failure in harsh environments creates room by removing weaker individuals, or greater resilience by rewarding the most adaptable. Colonial Australia itself was founded on such selection—the most adaptable individuals and the strictest rule enforcers reshaped an unforgiving frontier.

July 20 – Organisational Evolution In Flight

Seeing that a few members of such water-breathing classes as the Crustacea and Mollusca are adapted to live on the land, and seeing that we have flying birds and mammals, flying insects of vast diversified types, and formerly had flying reptiles. It is conceivable that flying fish, which now glide far through air, slightly rising and falling by the aid of their fluttering fins, might have been modified into perfectly winged animals.Charles Darwin

The ability to skim over water for a few metres comes from external tweaks, but the ability to cross the Pacific like a Godwin Tern comes from internal rewiring: hollow bones, high metabolism, and a brain with a built‑in compass. Organisations face the same distinction. A brief digital-transformation spasm can bolt on an app or a website, yet sustaining that flight demands internal metamorphosis and a sense of direction from leadership. Caterpillars become butterflies through wholesale change—so must companies that aspire to be more than flying fish.

July 23 – Questions For The Corporate Naturalist

  1. Where are the transitional forms?
    Organisations with no lines on the org chart operate as pure adhocracy. Hidden behind corporate veils, they are like pupae in cocoons, waiting to emerge in a more defined shape.
  2. How can specialised organs evolve?
    Marketing machines, technology muscle, sales teeth, enterprise-planning backbone, analyst frontal lobes—each department is an organ honed for a specific survival task.
  3. Is behaviour or instinct inheritable?
    Culture answers this. The rituals, stories, and incentives that survive layoffs and leadership changes become the genetic code of the firm.
  4. Why are some species sterile when crossed, while others are fertile?
    Some mergers and acquisitions thrive; others fail because the two organisational genomes cannot integrate and diverge instead of hybridising.

July 24 – Conquering New Lands

He who believes in the struggle for existence and in the principle of natural selection, will acknowledge that every organic being is constantly endeavouring to increase in numbers; and that if any one being vary ever so little, either in habits or structure, and thus gain an advantage over some of that inhabitant, however different it may be from its own place, it will seize on the place of that inhabitant.Charles Darwin

International expansion is a contest for ecological niches. Bringing hard‑won optimisations from one country to another is a bid to displace incumbents. The organisations that vary—by process, by product, by mindset—claim new ground first.