Sydney's public institutions are sitting on tens of thousands of duplicate digital images — files copied, re-saved and re-uploaded across shared drives, cloud platforms and legacy archive systems over more than two decades — and the cleanup bill is only growing. The problem, which IT managers at several Greater Sydney councils have been quietly documenting since at least 2024, has become harder to ignore as storage costs rise and digital transformation projects accelerate ahead of expected infrastructure investment linked to the Western Sydney Airport precinct at Badgerys Creek.
The timing matters. The NSW Government's digital records obligations under the State Records Act 1998 require agencies to maintain accurate, retrievable archives. But when a single photographic asset exists in four or five near-identical versions across different departmental folders — a common scenario in organisations that transitioned from physical to digital workflows without a unified naming convention — retrieval becomes slower, audit trails become murkier, and storage budgets stretch further than they need to.
What the Data Actually Shows
Research published by the Australian Computer Society in 2025 found that duplicate files — images in particular — account for between 20 and 30 per cent of total data volume in typical government and not-for-profit digital repositories. Apply that range to a mid-sized Sydney council managing, say, 40 terabytes of archived records, and you are looking at roughly 8 to 12 terabytes of redundant data. At current enterprise cloud storage pricing on Australian-hosted platforms, that translates to hundreds of thousands of dollars in avoidable annual expenditure.
The City of Sydney Council's digital asset management program, centred on its operations at Town Hall House on George Street in the CBD, has been working through a rationalisation process since mid-2025. The council's records management framework — publicly documented in its annual reports — identifies photographic assets relating to public events, infrastructure inspections, and heritage documentation as the categories most likely to contain duplicates. Heritage photography is a particular problem: images of terraces in Surry Hills or industrial buildings in Alexandria get re-scanned, re-cropped and re-saved every time a planning application references them.
The State Library of NSW on Macquarie Street holds one of the largest publicly accessible photographic collections in the Southern Hemisphere. Its digitisation program has processed more than 500,000 images since 2010. Librarians and archivists working on the collection have described — in publicly available project documentation — a persistent challenge with near-duplicate images: photographs taken seconds apart on the same roll of film, or scanned multiple times at different resolutions, creating storage and cataloguing overhead that compounds over time.
The Fix Is Algorithmic, But Implementation Is Patchy
Automated deduplication tools have existed for years. Software using perceptual hashing — a technique that identifies visually similar images even when file names differ — can process thousands of images per hour and flag near-duplicates for human review. Several Australian technology vendors, including companies with offices in the Ultimo and Pyrmont tech corridor, offer these tools as part of broader digital asset management suites.
The problem is not the technology. It is the workflow that comes after the algorithm flags a match. Someone has to decide which version of a duplicate is the canonical one, update all internal links and metadata references, and then formally archive or delete the redundant file under records retention schedules. For organisations running lean IT teams — which describes most of the 128 councils across NSW — that human review step has no dedicated budget line.
The NSW Digital Information Security Policy, updated in 2022, encourages agencies to adopt data minimisation practices, which in principle covers removing unnecessary duplicates. But the policy does not set mandatory reduction targets or timelines for image-specific deduplication, leaving individual agencies to move at their own pace.
Practically, organisations that want to start cutting into the problem now should begin with their highest-volume, highest-cost storage buckets — typically image libraries tied to planning, events, or marketing functions — and run a perceptual hash audit before the end of the 2026 financial year. The longer the delay, the more deeply embedded the duplicates become in cross-referenced databases, making later removal progressively more expensive and time-consuming.