Download CEFR readability datasets

As part of our pilot project for the European Language Grid, EDIA has developed several datasets that can be used for training AI models on CEFR readability classification. These datasets consist of texts from various sources, labelled on CEFR readability level.

Please fill in the form below to get access to the datasets. The datasets are available for non-commercial, academic purposes (CC-BY-NC) only.

Citing

When citing these resources in your research, please use:

Breuker, M. (2023). CEFR Labelling and Assessment Services. In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_16

Readability API

Based on the datasets, we have created several CEFR text classification models which can be used through our Readability API. For more information see our developer documentation.