QUALICO 2018

International Quantitative Linguistics Conference

July 5-8, Wroclaw, Poland

Keynote lectures


prof. Łukasz Dębowski
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland

The Puzzling Entropy Rate of Human Languages

Abstract
In this talk, we will look back into research in the entropy rates of human languages. Entropy rate is the limiting amount of unpredictability of a random process, measured in bits per unit of the process. Shannon (1951) provided the first estimate of the entropy of texts in English, the famous 1 bit per letter. His method based on guessing by human subjects was followed by many researchers and improved by Cover and King (1978). In contrast, Ziv and Lempel (1977), Brown et al. (1992), and Gao et al. (2008) proposed computational methods of entropy rate estimation based on universal data compression, statistical language models, and match lengths, respectively. Computational methods of entropy estimation were applied to large corpora (Takahira et al., 2016) and many languages (Bentz et al., 2017). This success story takes some twist, however. Hilberg (1990) supposed that Shannon’s estimate of the entropy rate of English is an artifact caused by a slow power-law convergence of the estimates, whereas the actual entropy rate of human languages could equal zero. Some versions of this hypothesis were considered by Ebeling and Nicolis (1991) and Crutchfield and Feldman (2003). Also Dębowski (2015) observed experimentally that texts in human languages obey a power-law logarithmic growth of maximal repetition, which implies that conditional Rényi entropy rates are zero, as proved for stationary processes by Dębowski (2017). This property does not generalize to the entropy rate defined by Shannon. Indeed, Takahira et al. (2016), investigating very large corpora for several languages, announced that the entropy estimates follow a power-law convergence but the limiting Shannon entropy rate is close  to Shannon’s original estimate. Thus, constructing mathematical models of processes with a positive Shannon entropy rate and a zero Rényi entropy rate is an interesting open problem with  possible applications to linguistics.

Keywords:
Shannon entropy rate, Rényi entropy rate, human languages

About the author

Łukasz Dębowski received the M.Sc. degree in theoretical physics from the Warsaw University, Warsaw, Poland, in 1994, the Ph.D. degree in computer science from the Polish Academy of Sciences, Warsaw, Poland, in 2005, and the habilitation degree in computer science from the Polish Academy of Sciences, Warsaw, Poland, in 2015. He visited the Institute of Formal and Applied Linguistics at the Charles University in 2001, the Santa Fe Institute in 2002, and the School of Computer Science and Engineering at the University of New South Wales in 2006. Moreover, he held a post-doctoral research position with the Centrum Wiskunde & Informatica from 2008 till 2009 and a visiting professor position with the Department of Advanced Information Technology at the Kyushu University in 2015. He is currently an Associate Professor with the Institute of Computer Science of the Polish Academy of Sciences. His research interests include information theory and statistical modelling of natural language.