Sydney's public institutions are sitting on millions of duplicate digital images — redundant scans, re-uploaded photographs, and copied heritage records — and the tools being deployed to clean them up remain years behind those adopted in comparable cities like Amsterdam, Toronto, and Singapore. The gap is widening, and archivists say the cost is more than aesthetic.
The problem crystallised in earnest after the COVID-era digitisation rush. Between 2020 and 2023, Australian cultural institutions accelerated bulk scanning programs to keep collections accessible during lockdowns. The State Library of New South Wales, the City of Sydney's heritage registry, and the National Archives' Sydney repository all expanded their digital holdings rapidly. The inevitable result: vast quantities of near-identical files stored across multiple servers, catalogued under different identifiers, sometimes with conflicting metadata. Researchers and software systems struggle to distinguish the authoritative image from the redundant one.
How Sydney Compares to Amsterdam and Singapore
Amsterdam's Rijksmuseum completed a deduplication overhaul of its public image database in late 2024, deploying perceptual hashing — an algorithm that identifies visually identical or near-identical images regardless of file format or minor compression differences — across roughly 900,000 digitised objects. The process cut their active image repository by an estimated 18 percent and freed up server capacity that now supports higher-resolution public downloads. Singapore's National Heritage Board published a roadmap in early 2025 committing to AI-assisted deduplication across its network of nine museums by the end of 2026.
Sydney has no equivalent published commitment at the state or city level. The City of Sydney Council operates its heritage photo library through its Eora/Sydney portal, but a council spokesperson confirmed in written correspondence earlier this year that no automated deduplication program is currently active. The State Library's digital collections team, based at its Macquarie Street building in the CBD, has acknowledged the problem internally but has not outlined a funded solution in its public annual reporting. The library's 2024-25 annual report, the most recent available, lists digital preservation as a strategic priority without specifying deduplication as a discrete line item or allocating a dedicated budget figure.
Toronto offers the most instructive comparison. The Toronto Public Library partnered with the University of Toronto's iSchool in 2023 to pilot a community-sourced image verification project across its 500,000-item digital archive. Volunteers flagged suspected duplicates through a public interface, with librarians making final calls. The program processed around 60,000 images in its first year at minimal cost — primarily staff time — and the library has since expanded it citywide. The model is low-tech and scalable, and it worked.
Western Sydney's Heritage Records Face Particular Pressure
The issue isn't confined to inner-city institutions. Parramatta City Council, now absorbed into Cumberland and Greater Parramatta council boundaries following boundary changes, generated large volumes of digitised heritage imagery during the lead-up to the Western Sydney Powerhouse museum opening. Those records are held across at least three separate platforms, according to information published in council planning documents. Deduplication has not been listed as a project deliverable in any publicly available council budget.
The financial stakes are real. Cloud storage isn't free. A conservative estimate from infrastructure consultancy firm Gartner — published in its 2025 data management benchmarking report — puts the average cost of storing redundant enterprise data at between 25 and 30 percent of total storage spend. For a mid-sized public archive holding ten petabytes, that translates to hundreds of thousands of dollars annually in avoidable expense.
Archivists and digital preservation specialists who work with NSW institutions point to a structural reason for Sydney's lag: digitisation budgets are typically project-based and tied to specific collections, while deduplication is cross-collection work that doesn't fit neatly into a funding category. No one owns the problem.
The practical path forward isn't complicated. Toronto's volunteer-assisted model can be replicated cheaply. Open-source perceptual hashing tools — including pHash and ImageHash, both freely available — are already used by cultural institutions in Europe and North America. The State Library, the City of Sydney's digital team at Town Hall House on George Street, and Parramatta's heritage office could each begin pilot programs without new legislation or large capital outlays. The question is whether any of them will schedule the work before the next digitisation grant arrives and the duplicate count climbs higher.