Author: Kate Follington
A pilot project using Google Gemini Pro to describe and caption old photographs
In 2024 Public Record Office Victoria (PROV) began a proof-of-concept project to determine the viability of using Artificial Intelligence (AI) with human collaboration, to help transcribe, caption and describe historic photographs at scale. My team and I explore innovative ways to enable access to digitised records, and we are often frustrated by the limited results our photographic search page delivers when searching by subject and, the inaccessibility of some of the images. In the past, collecting agencies have sometimes crowdsourced improvements to their photographic metadata using public volunteers, but generative AI (tools that generate content) may offer a similar solution, at scale. The importance of human review coupled with automation is the key to an efficient process with quality output.
Photographic records of government, like the one above, are sometimes transferred in their thousands to state archives like PROV with limited metadata. It was, and still is, common for major capital projects like a new bridge or an underground rail loop to be photographed in detail by Government staff photographers for reporting or engineering purposes, but the resulting collection of images are not always described at item level with helpful keywords. For example, searchable metadata like 'old cargo truck', or even, dare I say, old fashioned 'turn signals' (note the metal hand at the window in the image above) aren't usually listed with the photos.
Archivists or volunteers are then needed to transcribe photographic notes or add keywords manually while digitising the photos in order to assist researchers to find them when they search the online catalogue, and this step may or may not be done if the resources aren't available.
Adding keywords and description at scale
This article explains how we have used the image annotating application Labellerr, produced by the development team Tensor Matics, integrated with the AI tool Google Gemini Multi-modal Pro, into our workflows and how we're using it in a new pilot project.
Released in 2023, AI multi-modal models from Google (Gemini Pro) were released to enable a range of automated functionality which can process object identification from images, generate image description and keywords, complete data analysis and even transcription, from the same prompt or request. In short, enabling a more complex query to deliver a more complex response. The output format of the response (e.g. JSON or text) can also be requested, so we can transfer the new data into our catalogue more easily. This project is a pilot at PROV to see if we can produce helpful descriptive metadata, at scale, coupled with human review, to describe photos using an online application with AI integrated. Following ethical considerations and documentation required under Government AI policies we kicked off with a collection of photos about Melbourne's port.
Melbourne Harbour Trust photo collection of Melbourne port activity
The pilot is centred around photos from the Melbourne Harbour Trust, who once governed the Melbourne port and produced an historic collection of photographs now held at the state archives. One photographic series, VPRS 8363, is a good example of undescribed images. It was chosen for this project because it includes 4,000 photographs with very little descriptive data listed or supplied with the photos. A subject listing is about all we have. The images are of ships and port activity within Melbourne's docks from the 1950s to the 1970s. There is, however, information scribbled by hand directly onto the negative index sheet.
The handwritten information on the negative sheet index is insightful, it includes: a subject classification taken from a list of 150 subjects ranging from A1: Aerials to VC: Victoria Dock; the date; a negative number; a location; and notes are also to be transcribed: the notes will often name the ship or vessel in the photo.
A concurrent project is digitising the photos so the metadata produced by the AI can be paired with the photos by matching the negative number.
Why Labellerr?
Labellerr is not as sophisticated, nor as pretty, as some off-the-shelf image annotation workflow applications. We did trial others and found that the inability to curate the AI prompt to suit the image requirements, and the complexity of the workflow process in some cases, would be a barrier for our volunteers using an AI prompt for the first time. We also needed a workflow that allowed us to modify the prompt on-the-fly to correct the image caption per photo if need be, a user management system that could handle many users, while also applying an AI prompt by 'batch' across a large set of hundreds or even thousands of photos.
While in future projects we may wish to train an AI tool with our own data sets, and create our own Large Language Model LLM (as we have with our handwriting transcription project) for this pilot our needs were simpler. We wanted a workflow to ingest large data sets, curate a prompt per project, batch apply the AI prompt, review and correct the output with multiple users, and then export as CSV or JSON. Labellerr is a bit clunky but overall it suited our needs and matched the user story we had co-designed with our volunteers.
We also selected Labellerr because of the Multi-Modal generative AI model it uses, Google Gemini Pro, it's currently considered equal in quality output to other generative AI tools like ChatGPT. It is part of a suite of products produced by Google DeepMind and comes with considerable documentation articulating its ethical and quality output standards. Although, the training data set details are scant. In addition, The Tensor Matics Labellerr development team were keen to learn about the needs of archives that manage image collections and agreed to add new functionality to the interface to enable large scale batch application of the AI prompt (modifiable at item level) and importantly, also display existing metadata fields from the catalogue data, to help inform the AI generated result. Even though all we may have is the name of the producing agency it still informs the AI describing tool the location is in Melbourne.
The user experience improvements were a minimal cost on top of the monthly licensing agreement and the Tensor Matics team were co-operative and efficient to work with.
Many of the image annotation workflow applications on the market are being used more commonly by medical imaging companies or insurance companies for mass identification of similar subject matter, so this way of using their tool was a novel approach.
Large scale ingest
The ingest process can be via a cloud account, via manual transfer, or by using our public API data directly from our collection management system; this was our preferred method. To enable an API ingest process you must work with the Labellerr development team to supply the correct query string and preferred metadata fields. We chose five additional fields out of more than a hundred available in our archival management system such as existing description, titles or date. Any contextual information related to the image will improve the keywords and description generated by the AI caption. Manual upload, which we had to do for the Melbourne Harbour Trust project, because there was no transcribed data or metadata, only displays the filename in the interface. We found during our testing phase any additional metadata markedly improved the AI generated description.
Designing the workflow and curating the prompt
After importing your data set you can design your labelling workspace and curate your ideal prompt, and the generated metadata will export that new description as a CSV file (among other export formats). Our datasets are roughly 4000 images per project.
We divided our project into sub-sets of 200 images and, while this is small, it has turned out to be helpful as we modify the prompt to best suit the images within a set. A good example of this is the 'Class' field on the top left of each image (see above) which refers to a subject; Each AI prompt could include within it either all the subject listings we had or simply those relevant to a small subset. So we decided to create subsets based on those subjects only and modified the prompt for each group e.g. A1 are aerial photographs, and C1 are cargo related etc.
Multiple requests per prompt
The curated prompt for this pilot project includes seven specific requests and even for the result to be written in key-value pairs (divided by a semi-colon) so the exported content can be more easily imported into the collection management system.
The sample prompt we are using:
- Give detailed keywords about the photo in Melbourne.
- Give a short description maximum 40 words.
- Then, transcribe all the headings written on the photograph and the words underneath each heading. For the date use this format YEAR-MONTH-DAY e.g. 1964-9-19.
- Then, find the classification code written on the photograph and identify the matching code from the subject listings provided below, and identify the subject e.g. Aerials A1, Cargo C1
- Always write using the en-GB variety of English
- Write the response in a text format
- Format the response as key-value pairs with line breaks between each pair.
Sample result
The AI generated description has proven to be pretty good at delivering keywords and searchable description, as well as transcribing the photographer's hand-scribbled notes like the date and location or ship name (note: the result below is not specific to the image above but a typical result pre-review)
Keywords: Victoria, Melbourne, Yarra River, cityscape, industry, aerial, port, rail yard, 1970s, black and white, archive, history, heritage. Description: Black and white aerial photograph of Melbourne and the upper reaches of the Yarra River towards the port in February 1975. Class.: A. 1 Neg. No.: NONEG. 75058 Date: 1975-02-27 Location: UPPER REACHES OF PORT. Notes: Aerials A.1: Aerial views Subject: Aerial views
Hallucinatory description and learnings
The natural concern on this project is the creation of hallucinatory description. In other words, wildly inaccurate description generated by AI. So far, the keywords and the transcription of the hand-written data is proving to be pretty accurate overall, except for the odd misinterpretation of cursive writing.
However in one set, a significant number of images (1 in 10) were described with the wrong city name. The AI generated content made an assumption of location based on the ship name, for example, the cargo ship on the negative index may have been Tokyo Bay and the AI model presumed the port was in Japan and described a Japanese harbour.
To fix this error we modified the prompt to include a location name ‘Give detailed keywords about the photo in Melbourne’ and then reapplied it across the whole set of photos and this improved the description. Simple location data improved the names of other locations identified on the photo and along the coastline.
Human review
The human review process allows up to ten volunteers to work on a single project, either as a labeler or a reviewer and, the images appear one by one for their review. The first person will review and correct the AI generated content and the reviewer will review that person's first pass. Any description which is under question from both can be 'rejected' and the project manager will determine the best caption to use.
One of the other key learnings is to have very detailed instructions on the ideal format for volunteers or staff and which content can be ignored, even if its wrong. AI generated content will produce inconsistent data, for example sometimes calling Port Phillip a bay or occasionally a harbour, or transcribing the headings on the negative index in full or abbreviated e.g. Class. or Classification. For our purposes, these aren't as important as making sure the keywords are right or the date, location or ship name is correct. The metadata field headings can be cleaned up quickly pre-ingest, but to volunteers, who are used to structured data sets, these details became lengthy discussion points.
Timing and cost
Two volunteers are reviewing roughly 30 sets (100-200 per set) of photographs. Each volunteer is reviewing and correcting around 100 photos per person, per sitting. The batch application of the AI tool, to generate the description, takes around 12 minutes to process 200 photos. So, an estimated return on investment might be around 10 days to describe 4,000 images, realistically longer on volunteer time (which is broken up with coffee and conversation). The license to use Labellerr is annual at over $5000USD a year, with an additional fee for the use of the AI models based on tokens, this was estimated by Labellerr to be $2500USD for 75,000 images. Overall we had to balance the costs for developing a product ourselves with user management, API AI integration and multiple workflows, vs an off the shelf product we could update with our own functionality needs fit for purpose for archival images.
For any further inquiries about this project please don't hesitate to contact Head of Audience Engagement and project manager Kate Follington or volunteer manager Amanda Nichols via the media inbox, media@prov.vic.gov.au
Get started with Labellerr here.
Material in the Public Record Office Victoria archival collection contains words and descriptions that reflect attitudes and government policies at different times which may be insensitive and upsetting
Aboriginal and Torres Strait Islander Peoples should be aware the collection and website may contain images, voices and names of deceased persons.
PROV provides advice to researchers wishing to access, publish or re-use records about Aboriginal Peoples