Documenting automatic text transcription tools in catalogues/metadata/displays?

I've been thinking about principles for documenting where AI was used to create or enhance metadata records. Ideally, one could also document the tool and version used, which is always helpful context and might also be used to note e.g. where re-doing automatic text transcription (ATR) with newer tools might make a big difference.

It's become timely in conversations with the oral history team and digital curator for ATR at the BL. I'd posted about it in various places, including the jiscmail 'AI in CH' list (AI in cultural heritage, not to be confused with AI4LAM!). It's hard to link to threads on jiscmail so as most of the posts are mine, and the other is a helpful post from the always-helpful Stephen McConnachie, I've copied them here to make them shareable:

Mia Ridge, 2025-11-13: Documenting automatic text transcription tools in catalogues/metadata/displays?

I have a question from our Digital Curator for Automatic Text Recognition that we're hoping others working with digitised texts have some thoughts on.
Do you know of any standards or processes for providing information about the use of AI/ML tools to create transcriptions for metadata for collections management or public interfaces? Or, given the wider focus of this group, the use of any other AI/ML tools to create or enhance metadata records?

I know that some libraries (like the BnF's Gallica) share OCR error rates in public interfaces, and others include information about the software/software version/date of processing in ALTO files, but is everyone doing this in a slightly bespoke way, or are any shared conventions or standards emerging?

I’ve had related conversations with the British Library's Metadata Standards team about recording the use of AI/ML to enhance metadata in MARC fields for printed heritage items, but many of the items we're looking at might be newspapers and periodicals catalogued differently, as well as sound/AV and manuscript/archive files.

Stephen McConnachie, 2025-11-13 Re: Documenting automatic text transcription tools in catalogues/metadata/displays?

I'm not aware of any emerging or established standards for documenting and contextualising text extraction via OCR or similar – would certainly be keen to find out though, so thanks very much for raising it.

One thing that I am aware of, and aiming to implement in our speech-to-text subtitle output for our a/v collection, is the FADGI recommendations for VTT file creation: https://www.digitizationguidelines.gov/guidelines/FADGI_WebVTT_embed_guidelines_v0.1_2024-04-15.pdf

Some of its modelling might be portable to text-extraction metadata capture, I guess.

Mia Ridge, 2026-01-09 Re: Documenting automatic text transcription tools in catalogues/metadata/displays?

On a slightly related note, I realised that Axiell have been generating 'disclaimers' when writing back records updated with AI / named entity recognition with Wikidata, as on https://help.collections.axiell.com/en/Topics/AI%20and%20Collections.htm

'A disclaimer is added to every keyword extracted in this way, indicating that AI was used in the matching process: the disclaimer identifies the origin of the keyword, the field and character position it was extracted from, its confidence score, and the date and time of extraction.'

The sample text says: 'This was added using Al-based entity linking. Entity text between character 66 and 71 in the 'description' field, occurrence 1. Entity label: PERSON. Confidence score:
97.85%.
Date/time of extraction job creation:2025-02-12T08:46:25.
Entity UUID: a63853c8-dca4-4dc9-9ec0-07eb36eb8843.'

Mia Ridge, 2026-01-22 Re: Documenting automatic text transcription tools in catalogues/metadata/displays?

Adding to this thread in case people are interested…

Owen King got in touch via the AI4LAM Slack (join link) to share an update from the AI4LAM Speech-to-Text WG: 'we have been trying to find a standard way to record metadata about the provenance, especially the AI tooling used, of our audio transcripts. It seems to me that almost all of it applies to ATR and HTR transcripts as well. This is our current draft. Is this of any help?'

And while I'm here, is anyone working on the ethics of oral history recordings online and AI transcription? I had an interesting conversation with a European university librarian who reported that their oral history recordings were behind a login, but that browsers were now offering to transcribe the recordings, raising new questions about data scraping etc.

Leave a Reply Cancel reply