Don Matteo(2000)137 Available Subtitles BETTER
Public broadcasters face a great responsibility in providing services to a wide and diverse audience. One of the cornerstones of public services is accessibility. While accessibility is often seen as services to people with disabilities (e.g. visual or hearing impairments), it also includes linguistic accessibility, which means content needs to be provided in multiple languages (Hirvonen and Kinnunen 2021). However, linguistic accessibility is limited by necessity, as public broadcasters do not have the resources to make all their content multilingual. One possible way to approach linguistic accessibility in media could be to use technology like automatic speech recognition (ASR) and machine translation (MT) to provide automatic interlingual subtitles for audiovisual content.
Don Matteo(2000)137 Available subtitles
Automatic subtitle translation has been explored since the 1990s (e.g. Popowich et al. 2000; Piperidis et al. 2004; Volk et al. 2010), but has received increased research interest in recent years following developments in neural MT and automatic speech translation. Much of early work focused on machine translating intralingual subtitles (or closed captions) created by humans, although some initiatives like the MUSA project combined MT with ASR (Piperidis et al. 2004). While text-based MT of intralingual subtitles has also been examined in more recent work (e.g. Bywood et al. 2017; Matusov et al. 2019; Koponen et al. 2020a, 2020b), research has been increasingly turning towards automatic speech translation and subtitling (e.g. Di Gangi et al. 2019; Karakanta et al. 2020). Some have suggested that subtitles are particularly suited for MT because they generally consist of short sentences and relatively simple structures (Popowich et al. 2000; Volk et al. 2010). Volk et al. (2010) present a successful example of implementing a system for machine translating subtitles in the workflows of a subtitling company. However, the closely related language pairs (Swedish into Norwegian and Danish) made the situation particularly favourable (Volk et al. 2010: 56; see also Bywood et al. 2017: 499-500 for observations regarding different language pairs).
Conversely, some subtitling features may be challenging for automation. Subtitles are intended as a written representation of speech, and therefore often contain idiomatic and colloquial expressions, grammatical irregularities, hesitations, interruptions and ellipsis which have been found problematic for MT (see Popowich et al. 2000; Burchardt et al. 2016; Bywood et al. 2017). Subtitles also involve different genres covering a wide range of domains, and data from one genre may not be directly applicable to another (Burchardt et al. 2016). For example, automatic subtitle translation may struggle more with unscripted broadcasts compared to scripted dialogue (Bywood et al. 2017) or content involving creative use of language, such as comedies (Volk et al. 2010; Matusov et al. 2019). Features like unscripted speech also pose difficulties for ASR when used to create intralingual subtitles (e.g. Vitikainen and Koponen 2021), which can be another source for accuracy errors in the final translated subtitles.
Difficulties are also caused by technical restrictions regarding the number of characters displayed on screen and the display speed. For example, the Finnish guidelines specify that subtitles should contain one or two lines, and the subtitle speed should be no greater than 12-14 characters per second (Käännöstekstitysten laatusuositukset 2020). This generally requires condensing the speech, which verbatim transcriptions generated by ASR cannot provide (Karakanta et al. 2020: 210). Segmenting subtitles along syntactic and semantic boundaries to support readability is also challenging to automate (Matusov et al. 2019: 85). Based on their comparison of different approaches, Karakanta et al. (2020) propose that end-to-end systems, which generate interlingual subtitles directly from the speech rather than through an intermediate ASR transcript, have promise for producing subtitles that meet the guidelines.
Both the focus group discussions and the questionnaire were structured around short video clips of Finnish-language news and current affairs programmes that were automatically subtitled in English. The first focus group was shown one five-minute clip from the beginning of a current affairs programme. In the questionnaire, the respondents were shown two three-minute clips: a news broadcast segment and a shortened version of the current affairs clip from the first focus group. The second focus group was shown the same two clips as were used in the questionnaire, although with slightly different subtitles (see below).
An in-depth discussion of the interlingual subtitle generation pipeline1 is not within the scope of this article, rather, we provide a brief top-level view of the approach. First, we used the ASR system by the project partner Lingsoft combined with automatic timecoding and segmentation by the project partner Limecraft to automatically generate intralingual Finnish subtitles, which were then pre-processed to reconstruct sentences, standardise punctuation and casing, and fix some ASR features (e.g. abbreviations) which caused recurring translation problems. For MT into English, we used a subtitle translation model based on the transformer implementation of Marian2 and trained on all available Finnish-to-English data from OPUS3 (excluding a small development set sampled from OpenSubtitles). Finally, machine-translated sentences were post-processed and fit back into the original timed segments. For details, see Laaksonen et al. (2021).
For the first focus group and the questionnaire, subtitles were generated without any human input. It was apparent from the quality of the subtitles that they were not professionally created. The subtitles were frequently out of synchrony with the spoken dialogue, and their segmentation did not follow sentence structures, making them challenging to follow and creating a sense of rush, even though the duration and display speed of the subtitles were largely within the norms for professional subtitles. In addition, some of the language was awkward and confusing, and there were occasional mistranslations or other distortions of meaning. Nevertheless, the subtitles provided the gist of the clips, and it was possible to follow the narrative with their help.
In the following analysis, we first discuss responses to questions regarding comprehension and appreciation of the automatic interlingual subtitles in both the focus groups and the questionnaire. Then, we explore how the cognitive load caused by reading these subtitles was addressed by the study participants. Finally, we discuss the acceptability of the subtitles. On the one hand, we describe potential use contexts suggested by the study participants and reasons why they might use automatic subtitles. On the other hand, we examine potential obstacles that may limit the usefulness of automatic subtitles.
The questionnaire respondents were asked three Likert-scale questions concerning their appreciation of the subtitles for each clip. The current affairs video (questions 12-14) received the mean scores of 2.7 (median 3) for pleasantness, 3.6 (median 4) for usefulness and 3.0 (median 3) for accuracy. The news video (questions 21-23) was rated more positively, with mean scores of 3.5 (median 4) for pleasantness, 3.8 (median 4) for usefulness, and 3.6 (median 4) for accuracy. In other words, the experience of watching particularly the current affairs clip was not very enjoyable, but the respondents rated the subtitles as reasonably useful and accurate. Responses to the open questions reinforce the sense that the viewing experience was uncomfortable. Some responses stated that automated subtitles could be useful, but a larger number of comments expressed misgivings about the quality and the technology, and many suggested that subtitles always need human involvement, such as post-editing. The following improvement suggestion demonstrates some of these negative feelings:
One recurrent theme arising from the focus group and questionnaire data is that viewers assess the cognitive load caused by automated subtitles as higher than that of professionally made subtitles. Focus group participants frequently mentioned that viewing the video clips was demanding, because it required concentration, and because it was challenging to divide attention between the subtitles and the rest of the programme. One participant in the second focus group described this process as follows:
Although many study participants mentioned potential uses for automatic interlingual subtitles, many also pointed out that the quality was not yet sufficient. In addition, many focus group participants simply stated that they would prefer subtitles that have been created or at least post-edited or checked by humans. Most questionnaire respondents also indicated a preference for human-made subtitles (question 28), as seen in Figure 1.
The cognitive load caused by the subtitles emerged as another obstacle for their use. Research on subtitle synchronisation has found that breaking the synchrony (delayed, extended or shortened display time) increases cognitive load of subtitle processing even if viewers are not consciously aware of the problems (Lång et al. 2013). In our study, participants appeared very aware of this, commenting on synchrony issues and the ensuing mental effort frequently in both the focus groups and the questionnaire. It should be noted that their observations are based on short clips: trying to follow a long programme would certainly be even more mentally taxing. Improving subtitle timing as well as other factors affecting cognitive load is therefore an important direction for future research.
The viewer reactions were somewhat ambivalent, even contradictory, and did not suggest an easy answer to how well automatic interlingual subtitles could facilitate linguistic accessibility. While viewers expressed some conditional acceptance, additional work is needed to reach genuinely usable levels of quality. For any scenario where the intention is to provide subtitles in fully automated form, further work should prioritise technical solutions that affect the cognitive load caused by subtitles, such as reducing the amount of text, improving segmentation and synchronisation, and smoothing out target language readability issues. Ensuring the accuracy of the translation is of course vital, but even a factually accurate translation would not be fully acceptable if the cognitive load remains high. However, reaching suitably high quality for all use purposes may not be feasible through automation alone, and the need for post-editing needs to be considered carefully. 041b061a72