Julia Ive – Research Associate @ University of Sheffield
External Knowledge in Machine Translation: Case Studies of Multimodal Machine Translation and Automatic Post-Editing (presentation)
In this talk, we will look into two representative cases of using external information in Machine Translation (MT): (1) the case of Multimodal MT (MMT) where external context comes from an additional modality, usually images, and (2) the case of Automatic Post-Editing (APE) where the system learns to correct MT using human post-edits. In both cases this additional information can be helpful in only specific cases: e.g., ambiguous source words for MMT and terminological choices for APE. Hence we will talk about two deep learning approaches to use this information in a selective fashion: (1) an approach based on deliberation networks for MMT that takes into account the strengths of a text-only model and only refines its translations when needed based on left and right side target context as well visual information, and (2) a dual-source transformer-based APE architecture with a copying mechanism, where the network can generate new words, or copy words from either the source language or the original machine translation. The approach helps to avoid overcorrecting high quality machine translations and to preserve terminology (or jargon) across languages.
David Filip – Research Fellow @ ADAPT
Inline Markup (presentation)
There seems to be a divide between the content management (including translation and localisation) and publishing industry world and the NLP research world. In simplest possible terms, the industry do worry about inline markup and the regular NLP researcher, including most of the MT folks, doesn’t. What is bitext and what is parallel data? How are these handled in the different worlds? What does it mean to “clean” parallel data? Unfortunately, almost invariably, cleaning parallel data, in the NLP world, means just to kill and throw away all inline markup. But if someone cared to specify a format with inline markup, chances are that there is critical data in that markup. It can be terminology or entity annotations, it can be linked to useful resources or references, it can be a thousand things. If you don’t care for it in your process, there are better ways to handle it than throwing it away (and losing the capability to reintroduce it). What about masking it, remembering the inline positions and being able to reintroduce when it turns out that someone does care? Too wild an idea? Let’s try and discuss 😉
On the Integration of (extra-) Linguistic Information in Neural Machine Translation: A Case Study of Gender (presentation)
Establishing the discrepancies between the strengths of statistical approaches to machine translation and the way humans translate has been the starting point of our research. During this talk, we will cover some topics related to the integration of specific (extra-) linguistic features into Neural Machine Translation, with a focus on gender-related problems.
Our work addresses research questions that revolve around the complex relationship between linguistics and machine translation in general. By taking linguistic theory as a starting point we examine to what extent theory is reflected in the current systems. We identify linguistic information that is lacking in order for automatic translation systems to produce more accurate translations and integrate additional features into the existing pipelines. We use the lack of necessary gender information in translation as a starting point but believe the issues touched upon extend to many other translational difficulties. While discussing the gender-related issues, we identify overgeneralization or `algorithmic bias’ as a potential drawback of neural machine translation and link it to more general issues related to the implicitness/explicitness of different languages.