Image by Joshua Hoehne

Detecting syntactic differences

This was the PhD project of Martin Kroon, embedded in the Leiden Data Science Research Programme, co-supervised by prof. dr. Sjef Barbiers, prof. dr. Jan Odijk and me. The goal was to detect cross-linguistic syntactic differences automatically.

Summary

We investigated how parallel corpora (e.g. translations of the Bible or texts from the European Parliament) can be used to automatically detect syntactic differences. 

We first developed a filter, to filter out 'free translations'. We then took an approach based on the minimum description length principle, and one based on alignment. Both approaches turned out to be effective at detecting syntactic differences.

Related papers

Kroon, M., Barbiers, S., Odijk, J., & van der Pas, S. (2020). Detecting syntactic differences automatically using the Minimum Description Length principle. Computational Linguistics in the Netherlands Journal, 10, 109-127. [link]

Kroon, M., Barbiers, S., Odijk, J., & Van Der Pas, S. (2019). A filter for syntactically incomparable parallel sentences. Linguistics in the Netherlands, 36(1), 147-161. [link]