dcyphr | Multilingual Unsupervised Sentence Simplification


Sentence Simplification has been unprogressive due to the lack of supervised data (particularly in other languages than English). Martin et al. sought to change that by employing unsupervised mining techniques to automate creation of training corporas for simplification in multiple languages from raw Common Crawl web data. The authors used a controllable generation mechanism that adjusts attributes such as length and lexical complexity, leading to mined paraphrase corpora that can be used to train the simplification system in any language. They also used multilingual unsupervised pretraining methods to create strong models that show training on mined data outperforms the best of supervised corpora results. The authors have performed their totally unsupervised approach on English, French, and Spanish simplification benchmarks and have reached state-of-the-art performance. 


The authors aim to improve and expand the Sentence Simplification model in not only English but every language.


Text Simplification is the act of reducing the complexity of a sentence's vocabulary and structure without losing its original meaning, but simultaneously improving readability and understanding of the sentence. Simplification has many societal impacts such as increasing accessibility for those with cognitive disabilities, non-native speakers, and children with reading difficulties. Research in simplification has mostly focused on English simplification with data being hard to find in languages other than English. This research sought to leverage unsupervised data to train sentence simplification systems in multiple languages. Researchers train simplification models by controlling attributes such as length, lexical complexity, and syntactic complexity. By employing these techniques, researchers are able to reach state of the art in multiple languages using only mined data which alleviates the need for language-specific supervised corpora.


A Transformer sequence-to-sequence model compared to the mined corpus, the researchers' models performed better by 5.32 SARI on ASSET and 2.25 SARI on TURKCORPUS. Previous simplification models did not make enough modifications to the source sentence, however, the researchers' model showed that making models rewrite the input was beneficial.

BART and ACCESS was used to show that their models performed the best (44.15 on ASSET and 42.62 SARI). Using unsupervised data was able to achieve a +2.52 SARI over the previous model and 40.85 SARI on the TURKCORPUS using random seeds versus the previous model's 41.38 SARI using best seeds.

The model has successfully reduced sentence length, split complex sentence into multiple shorter sentences, and use simpler vocabulary. It also removed unnecessary content and content within parentheses.

A Transformer sequence-to-sequence model was trained on researchers' mined data and achieved strong results in French and stronger results in Spanish.

MBart is used to show that there was a +8.25 SARI increase in languages other than English, but a loss in English which may be due to the fact that the model must handle 25 languages instead of one.

Researchers found that large amounts of data must be mined (millions of paraphrases) to achieve more accurate results. Researchers also found that pretraining led to increased fluency and meaning preservation in the output generations and achieve state-of-the-art results more quickly.


These results mean that the researchers' models have achieved superior results than previous models which came from using their way of training models, mined paraphrases, pretraining, and unsupervised data. Their models were also able to work better in previously studied languages other than English (French and Spanish). However, in order to achieve these results, researchers must mined large amounts of paraphrases in order to achieve efficient performance for their models. 


Researchers utilized the controllable sentence simplification model ACCESS, which learns how to control length, amount of paraphrasing, lexical complexity and syntactic complexity to train on mined paraphrase corpora and dynamically create simplifications at inference time. These sentences are then constantly rewritten by rewriting operations (ex. sentence splitting or fusion), and then further filtered to remove noisy text (too much punctuation or low language model probability). 

The paraphrase corpora is automatically mined by first computing n-dimensional sentence embeddings for each of the sequences using LASER, a multilingual sentence embedding model that is trained to map sentences of a similar meaning to the same embedding space. Faiss, a data structure that can store large amounts of vectors and search them quickly, is utilized to create indexes of all of these sentence embeddings.

Models are then trained on paraphrased data to produce simplifications by applying the ACCESS control mechanism to leverage advancements in controllable text generations. Control Tokens are tokens that provide oracle information on the target sequence such as amount of compression relative to the original sequence and are given to models at training time. These models are encouraged to rely on these tokens which gives researchers the freedom to select control values that achieve the best SARI on the validation set, and keep those values fixed for the test set.

The models are also trained with noising functions such as span-based masking or shuffling sentence ordering. This is to leverage the models to further extend the unsupervised approach to text simplification.


Researchers proposed an unsupervised approach to text simplification that utilized controllable generation mechanisms and pretraining with large scale mining of paraphrases from the web. These help achieve state-of-the-art performance results that surpass previous models in previously published English, French, and Spanish. Researchers plan to further expand how to approach more languages and types of simplifications.