The Linguistic Annotation of the Russian-Chinese Parallel Corpus

Russian-Chinese parallel corpus is a database of Russian and Chinese sentences with their mutual translations. Such a database allows searching for words and collocations in context. The Russian-Chinese parallel corpus of RNC (or ruzhcorp, is the only Russian-Chinese parallel corpus in the Russian web, which has linguistic annotation and a friendly user interface. It is a powerful tool for both teaching the Russian and Chinese languages and conducting their linguistic analysis. We will present the two aspects of enhancing the linguistic annotation, which we are currently working on.

1. Kirill I. Semenov (IITP RAS), Aleksandra O. Piskunova (NRU HSE).
Chinese word segmentation.

Word segmentation of the Chinese sentence is an ultimate task for the automatic analysis of a Chinese text. This task is challenging because the notion of "word" is less traditional for Chinese than for the European languages. We will talk about our comparative analysis of the widespread standards of word segmentation: which criteria they are based on, how linguistically consistent they are, and where they are applied. Moreover, we will present our research on how the word segmenters cope with the detection of the Russian loanwords in Chinese texts, which are abundant in our corpus.

2. Anastasia A. Politova (Nanjing University).
Word-to-word alignment of Russian-Chinese parallel corpus.

The algorithm that has been developing in the Corpus will show the most likely matches for the words and expressions entered in the search bar. The most accurate matches will be displayed first, and the approximate matches, i.e. words and expressions, which depending on the context can be translated in the same way but for which the searched translation is not the main one, will be displayed below. We will talk about what the team of "aligners" is doing, how the algorithm is developed, how the "gold" standard for alignment, which is used to evaluate the quality of the translation of the algorithm, is created, what difficulties our team has encountered and what linguistic solutions we have already come to.