Letter Frequency Analysis of Languages Using Latin Alphabet

  • Gintautas Grigas Vilnius University
  • Anita Juškevičienė Vilnius University Institute of Data Science and Digital Technologies
Keywords: diacritics, keyboard layout, Latin script, letter frequency, language statistics, language similarity


The evaluation of the peculiarities of alphabets, particularly the frequency of letters is essential when designing keyboards, analysing texts, designing alphabet-based games, and doing some text mining. Thus, it is important to determine what might be useful for designers of text input tools, and of other technologies related to sets of letters. Knowledge of common features among different languages gives an opportunity to take advantage of the experience of other languages. Nowadays an increasing amount of texts is published on the Internet. In order to adequately compare the frequencies of letters in different languages used in the online space, Wikipedia texts have been selected as a source material for investigation. This paper presents the Method of the Adjacent Letter Frequency Differences in the frequency line, which helps to evaluate frequency breakpoints. This is a uniform evaluation criterion for 25 main languages using Latin script in order to highlight the similarities and differences among them. Research focuses on the letter frequency analysis in the area of rarely used native letters and frequently used foreign letters in a particular language. The frequency of the letters is one of the factors that determines the location of the keys for the language specific letters on the keyboard.

Comparison of Danish and Irish language unigrams