Kate Follington

Author: Kate Follington

A pilot project using Google Gemini Pro 

 

In 2024 Public Record Office Victoria (PROV) began a project to explore the viability of using Artificial Intelligence (AI) with human collaboration. The aim was to transcribe, caption and add keywords to historic photographs at scale. 

 

My team and I explore innovative ways to enable access to digitised records, and we are often frustrated by the limited results our photographic search page delivers when searching by subject and the inaccessibility of some of the images. In the past, collecting agencies have sometimes crowdsourced improvements to their photographic metadata using public volunteers, but generative AI (tools that generate content) may offer a similar solution, at scale. The importance of human review coupled with automation is the key to an efficient process with quality output. 

 

VPRS 8363 Cargo 1966 Black and white image Melbourne Docks
Sample image from Melbourne Harbour Trust negatives index of their historic photographic collection VPRS 8363, of a cargo truck on the docks from 1966. Arriving at PROV with no transcribed data or other descriptive listing.

 

Government photographic collections

 

Photographic records of government, like the one above, are sometimes transferred in their thousands to state archives like PROV with limited metadata. It was, and still is, common for major capital projects like a new bridge or an underground rail loop to be photographed in detail by Government staff photographers for reporting or engineering purposes, but the resulting collection of images are not always described at item level with helpful keywords. For example, searchable metadata like 'old cargo truck', or even, dare I say, old fashioned 'turn signals' (note the metal hand at the window in the image above) aren't usually listed with the photos.

Archivists or volunteers are needed to transcribe photographic notes or add keywords manually while digitising the photos in order to assist researchers to find them when they search the online catalogue, and this step may or may not be done if the resources aren't available. 

 

Adding keywords and description at scale to historic photographs

 

This article explains how we have used the image annotating application Labellerr, produced by the development team Tensor Matics, using Google Gemini Multi-modal Pro, to increase the pace of our workflows and how we're using it in a new pilot project. 

This project is a pilot at PROV to see if we can produce helpful descriptive metadata, at scale, coupled with human review, to describe photos using an online application with AI integrated. Following ethical considerations and documentation required under Government AI policies we kicked off with a collection of photos about Melbourne's port.

Released in 2023, AI multi-modal models from Google (Gemini Pro) were released to enable a range of automated functionality which can process object identification from images, generate image description and keywords, complete data analysis and even transcription, from the same prompt or request. 

In short, enabling a more complex query to deliver a more complex response. The output format of the response (e.g. JSON or text) can also be requested, so we can transfer the new data into our catalogue more easily.   

 

Melbourne Harbour Trust Collection

 

The pilot is centred around photos from the Melbourne Harbour Trust, who once governed the Melbourne Port and produced an historic collection of photographs now held at the state archives. 

One photographic series, VPRS 8363, is a good example of undescribed images. It was chosen for this project because it includes 4,000 photographs with very little descriptive data. A subject listing is about all the collections team have to work with. The images are of ships and port activity within Melbourne's docks from the 1950s to the 1970s. 

There is, however, information scribbled by hand directly onto the negative index sheet. 

The handwritten information on the negative sheet index is insightful, it includes: a subject classification taken from a list of 150 subjects ranging from A1: Aerials to VC: Victoria Dock; the date; a negative number; a location; and notes are also to be transcribed: the notes will often name the ship or vessel in the photo. 

A concurrent project is digitising the photos so the metadata produced by the AI can be paired with the photos by matching the negative number.

 

VPRS 8363 Aerial view over Yarraville Black and white 1963
A sample of an aerial view photograph over Yarraville from the Melbourne Harbour Trust collection of historic photographs. The Photographer wrote information about the image on the photographic negative, 1968.

 

Why are we using Labellerr? 

 

Labellerr is not as sophisticated, nor as pretty, as some off-the-shelf image annotation workflow applications. 

  • We needed a workflow that allowed us to modify the prompt on-the-fly to correct the image caption individually
  • A user management system that could handle many users reviewing the AI description 
  • Manual metadata input 
  • Applying the AI prompt by 'batch' across a large set of hundreds or even thousands of photos. 
  • Ingest large data sets manually or using our public API
  • Export the data as a CSV or JSON file 

While in future projects we may wish to train an AI tool with our own data sets, and create our own Large Language Model LLM (as we have with our handwriting transcription project) for this pilot our needs were simpler. 

 Labellerr is a bit clunky but overall it suited our needs and matched the user story we had co-designed with our volunteers. 

In addition, The Tensor Matics development team were keen to learn about the needs of archives that manage image collections and agreed to add new functionality to the interface to enable large scale batch application of the AI prompt and to display existing metadata fields from the catalogue data, to help inform the AI generated result.

 

Why are we using Google Gemini Multi-Modal Pro? 

 

We also selected Labellerr because of the Multi-Modal generative AI model it uses, Google Gemini Pro, it's currently considered equal in quality output to other generative AI tools like ChatGPT

It is part of a suite of products produced by Google DeepMind and comes with considerable documentation articulating its ethical and quality output standards. Although, the training data set details are scant. 

 

 

Sample image from VPRS 8363 black and white image of port
Sample image from Harbour Trust Collection of an aerial photograph within the Labellerr workflow application with additional metadata displayed to the right side of the image.

 

Large scale ingest

 

The ingest process can be via a cloud account, via manual transfer, or by using our public API data directly from our collection management system; this was our preferred method. 

To enable an API ingest process you must work with the Labellerr development team to supply the correct query string and preferred metadata fields. We chose five additional fields out of more than a hundred available in our archival management system such as existing description, titles or date. 

Any contextual information related to the image will improve the keywords and description generated by the AI caption. 

 

 

Ingest process for data sets using Labellerr application
You can ingest data using a variety of import methods, including your own public API.

 

 

Designing the workflow and curating the prompt

After importing your data set you can design your labelling workspace and curate your ideal prompt, and the generated metadata will export that new description as a CSV file (among other export formats). Our datasets are roughly 4000 images per project. 

We divided our project into sub-sets of 200 images and, while this is small, it has turned out to be helpful as we modify the prompt to best suit the images within a set. 

Multiple requests per prompt 

The prompt we landed on to deliver an accurate result was trialed over a month (the subject classifications were modified per set) 

  1. Transcribe all the headings written on the photograph and the words underneath each heading. For the date use this format YEAR-MONTH-DAY e.g. 1964-9-19.
  2. Then, find the classification code written on the photograph and identify the matching code from the subject listings below, and identify the subject

    Cranes, Floating C.4a 

    Cranes, Container C.4b

  3. Then, give detailed keywords about the photo taken in Melbourne.
  4. Then, give a short description maximum 40 words.
  5. Always write using the en-GB variety of English.
  6. Write the response in a text format. 
  7. Format the response as key-value pairs with line breaks between each pair.

 

Sample curated prompt and designing Labellerr workflow page
This is the annotation page where a user can add metadata fields for manual input and include a field for the AI description

 

Sample result 

The AI generated description proved to be good at delivering keywords and searchable description, as well as transcribing the photographer's hand-scribbled notes like the date and location or ship name. 


 

Sample image from series 8363 with an AI generated description using original and generated description

 

An unreviewed sample result 

Class: C. 4 B Neg. No: 71077 Date: Location: I WEST SWANSON Notes: FLINDERS BAY Classification Code: C.4b Subject: Cranes, Container Keywords: Container Crane, Container Ship, Flinders Bay, West Swanson Dock, Melbourne, Port, Shipping, Maritime, Cargo Description: Black and white photograph of a container crane loading containers onto the container ship "Flinders Bay" at West Swanson Dock in Melbourne.


Hallucinatory description and learnings 

 

The natural concern on this project is the creation of hallucinatory description. Wildly inaccurate descriptions generated by AI. So far, the keywords and the transcription of the hand-written data is accurate overall, except for the odd misinterpretation of cursive writing. 

However in one set, a significant number of images (1 in 10) were described with the wrong city. The AI generated content made an assumption of location based on the ship, for example, if the ship was Tokyo Bay the AI model presumed the port was in Japan. 

To fix this error we modified the prompt to include a location 'Melbourne’ and reapplied it across the same set of photos and this improved the description. 

Simple location data improved the names of other correct locations identified on the photo and along the coastline. 

 

Human review 

 

The human review process allows up to ten volunteers to work on a single project, either as a labeler or a reviewer and, the images appear one by one for their review. 

One of the other key learnings is to have very detailed instructions on the ideal format for volunteers or staff and which content can be ignored, even if its wrong. 

AI generated content will produce inconsistent data, for example sometimes calling Port Phillip a bay or occasionally a harbour, or transcribing the headings on the negative index in full or abbreviated e.g. Class. or Classification. 

For our purposes, these errors aren't as important as making sure the original description is accurate. The metadata field headings can be cleaned up quickly post review, but to volunteers, who are used to structured data sets, these details became lengthy discussion points.  

Sample AI generated description reviewed by volunteers ready for export to the catalogue

 

Business case 

Two volunteers are reviewing roughly 30 sets (100-200 per set) of photographs. Each volunteer is reviewing and correcting around 100 photos per person, per sitting. The batch application of the AI tool, to generate the description, takes around 12 minutes to process 200 photos. So, an estimated return on investment might be around 10 days to describe 4,000 images, realistically longer on volunteer time (which is broken up with coffee and conversation). 

The license to use Labellerr is annual at over $5000USD a year, with an additional fee for the use of the AI models based on tokens, this was estimated by Labellerr to be $2500USD for 75,000 images. 

Overall we had to balance the costs for developing a product ourselves with user management, API integration and multiple workflows, vs an off the shelf product we could update with our own functionality needs fit for purpose for archival images. 

For any further inquiries about this project please don't hesitate to contact Head of Audience Engagement and project manager Kate Follington or volunteer manager Amanda Nichols via the media inbox, media@prov.vic.gov.au 

Get started with Labellerr here.

 

Material in the Public Record Office Victoria archival collection contains words and descriptions that reflect attitudes and government policies at different times which may be insensitive and upsetting

Aboriginal and Torres Strait Islander Peoples should be aware the collection and website may contain images, voices and names of deceased persons.

PROV provides advice to researchers wishing to access, publish or re-use records about Aboriginal Peoples