Blogging the thesis entry 3: How are people adapting machine translation?

In a series of posts summarizing my PhD thesis, “Domain adaptation for neural machine translation”, here I review existing ways to adapt translation systems.

Danielle Saunders
6 min readMar 3, 2021

--

In the previous post in this series, I talked about how to put together a machine translation system, from start to finish. In brief:

  • Prepare your training data — your example sentences and translations.
  • Choose a neural network structure that can map from a sentence in one language to a sentence in another.
  • Decide on a training method for that neural network.
  • Use the trained model to perform inference and translate new sentences.

What if we already have a neural machine translation model, and we want to adapt it to a new domain? Say, from translating scientific papers to translating Medium articles.

For each one of those elements— data preparation, neural network structure, training scheme, and inference scheme — there are many options for effective adaptation that have been explored by researchers. I review these options in my second lit review chapter, and in this post.

Data preparation

Maybe the most important element of adapting a machine translation model is choosing the adaptation data. If you don’t have any examples of Medium article translations for the scientific-paper translator to learn from, there’s little hope of adapting it to translate Medium articles.

The possible data scenarios are as follows, in approximate order of most to least straightforward:

  • Do you have plenty of reliable “in-domain” examples? These would be sentences with good translations that you know are from your target domain — Medium articles in this example. This is the most convenient situation.
  • Maybe you don’t have reliable examples, but you do have lots of noisy in-domain examples. “Noise” could mean some example sentence pairs contain features you don’t want to translate, like numbers or URLs, or sentences that aren’t actually good translations of each other. You can use data cleaning approaches to make good use of this noisy data.
  • What if you only have very few in-domain examples? You probably trained your original model with lots of translation examples from various domains. You can use data extraction techniques on this much bigger set to find any sentences which overlap vocabulary with the few in-domain examples. Those might be good enough to improve adaptation.
  • Say you only have a handful of in-domain examples, and either you’re lacking “bilingual” training data (sentences and translations), or what you have is definitely not in-domain. It’s often much easier to access monolingual data: sentences in one of the languages you’re translating between. Extract some in-domain monolingual sentences, and then use machine translation to either back translate or forward translate to give synthetic sentences in the other language. These won’t be perfect, but they may be sufficient to adapt the model.
  • What if you have absolutely no data that is right for your in-domain purpose? If you know exactly what you want and you only need a small number of examples, you can construct fully synthetic datasets to cover the domain, for example by using dictionary lookup to fill in simple templates. These are limited in application, but can be useful for pinpointing problem scenarios, like translating specific terminology.

One more thing to note is that even if you’re in the best-case scenario — lots of trustworthy examples of data from your target domain — you may still want to explore other options on this list. Why? Because having more data is usually good for quality, and synthetic data in particular can be carefully constructed to address specific translation problems, like gender bias.

Neural network structure

At this point you have a choice to make about your neural network: are you going to treat all domains the same, or try to handle them differently within your model?

If you want to handle each domains differently, that might involve changing the neural network itself. For example, you might need to:

  • Have the network read in a domain label for each sentence
  • Define a discriminator subnetwork that can learn to classify each input into the right domain
  • Use a different set of parameters when adapting to in-domain sentences

It’s worth bearing in mind that these approaches all implicitly assume that domains are distinct. It might be that for some domains, learning to translate one might help others.

It could also be that a new sentence to translate doesn’t fall exactly into one of the categories previously seen, in which case trying to label it or use the “right” part of the network might cause problems.

My work focuses on cases where domains of language are overlapping and difficult to pinpoint, so architectural changes are not my focus. However, they are still a growing area of research.

Training scheme

Once data is chosen and the neural network is defined, the process of adapting the model to the data itself can involve some design choices.

The risk when adapting a neural network to new data is what is known as the catastrophic forgetting problem. If a neural network’s parameters are adjusted to sentences that are too different from what it has seen before, it can simply “forget” its previous abilities.

There are a few options here:

  • Keep all the parameters close to their original values. They can vary, but not too much. Intuitively, if the network parameters stay similar, the network shouldn’t behave too differently.
  • Keep some parameters close to their original values, and let others vary. Choose what to vary through experimentation, or by a measure of how important the parameters are to current translation performance.
  • Define a “curriculum” for adaptation. When a human learns, they are normally presented with a series of related ideas. There’s some suggestion that a curriculum can also work for machine translation adaptation: gradually varying the domain of data shown to a model according to how well the model is performing.

Test-time inference scheme

Say you’ve taken the most straightforward scheme and just chosen to adapt each model to a separate domain. You can still drastically improve in-domain performance with how you approach translating each unseen test sentence.

There are a few approaches:

  • Pick one model and use it to translate the sentences that seem to be from the same domain. But it’s not necessarily obvious which model will do the best job.
  • Use one generic model to translate a sentence, and then “rescore” it using another in-domain model. In a sense the in-domain model gets to have a second opinion about the translation.
  • Use all of your models in an ensemble, combining their abilities to produce one translation. But how much weight should you put on the abilities of any one model? A standard approach effectively “averages” the translations from a number of models, but this makes little sense when the models have very different domain translation capabilities.

The standard solution to both these problems is to use the model(s) matching the domain of a given test sentence. But how do you know which model that is?

In some cases you can rely on some external label. For example, you might be translating a set of sentences that you happen to know are all from Medium articles. In other cases, you might have a “validation set” of sentences that you know are similar to the ones you will be translating, and you can use that to decide which models to use. Neither of these are very common scenarios for commercial or industry machine translation products. I’ll talk about my own approaches to the problem in a later post.

So far, I’ve reviewed the literature behind both neural machine translation generally and domain adaptation specifically. The rest of the thesis — and the rest of this series — covers original research into domain adaptation for NMT. The next post will focus on the possibilities and pitfalls of adapting translations simply by changing the adaptation data.

--

--