QUALICO 2018

International Quantitative Linguistics Conference

July 5-8, Wroclaw, Poland

Keynote lectures


prof. Łukasz Dębowski
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland

The Puzzling Entropy Rate of Human Languages

Abstract
In this talk, we will look back into research in the entropy rates of human languages. Entropy rate is the limiting amount of unpredictability of a random process, measured in bits per unit of the process. Shannon (1951) provided the first estimate of the entropy of texts in English, the famous 1 bit per letter. His method based on guessing by human subjects was followed by many researchers and improved by Cover and King (1978). In contrast, Ziv and Lempel (1977), Brown et al. (1992), and Gao et al. (2008) proposed computational methods of entropy rate estimation based on universal data compression, statistical language models, and match lengths, respectively. Computational methods of entropy estimation were applied to large corpora (Takahira et al., 2016) and many languages (Bentz et al., 2017). This success story takes some twist, however. Hilberg (1990) supposed that Shannon’s estimate of the entropy rate of English is an artifact caused by a slow power-law convergence of the estimates, whereas the actual entropy rate of human languages could equal zero. Some versions of this hypothesis were considered by Ebeling and Nicolis (1991) and Crutchfield and Feldman (2003). Also Dębowski (2015) observed experimentally that texts in human languages obey a power-law logarithmic growth of maximal repetition, which implies that conditional Rényi entropy rates are zero, as proved for stationary processes by Dębowski (2017). This property does not generalize to the entropy rate defined by Shannon. Indeed, Takahira et al. (2016), investigating very large corpora for several languages, announced that the entropy estimates follow a power-law convergence but the limiting Shannon entropy rate is close  to Shannon’s original estimate. Thus, constructing mathematical models of processes with a positive Shannon entropy rate and a zero Rényi entropy rate is an interesting open problem with  possible applications to linguistics.

Keywords:
Shannon entropy rate, Rényi entropy rate, human languages

About the author

Łukasz Dębowski received the M.Sc. degree in theoretical physics from the Warsaw University, Warsaw, Poland, in 1994, the Ph.D. degree in computer science from the Polish Academy of Sciences, Warsaw, Poland, in 2005, and the habilitation degree in computer science from the Polish Academy of Sciences, Warsaw, Poland, in 2015. He visited the Institute of Formal and Applied Linguistics at the Charles University in 2001, the Santa Fe Institute in 2002, and the School of Computer Science and Engineering at the University of New South Wales in 2006. Moreover, he held a post-doctoral research position with the Centrum Wiskunde & Informatica from 2008 till 2009 and a visiting professor position with the Department of Advanced Information Technology at the Kyushu University in 2015. He is currently an Associate Professor with the Institute of Computer Science of the Polish Academy of Sciences. His research interests include information theory and statistical modelling of natural language.

 


 

prof. Nicola Ferro
Department of Information Engineering of the University of Padua, Italy

From Systems to Components: Breaking-down Performance and Discovering Interactions

Abstract:
Information Access Systems are pipelines typically constituted by several components  developed by neighbouring disciplines, e.g. information retrieval, natural language processing, computational linguistics. When it comes to evaluate such components, you are often faced with two choices: either evaluate them in isolation for some specific feature, e.g. precision in over/under-stemming for a stemmer, or evaluate them in full pipelines, e.g. the effectiveness of a whole IR systems when using a stemmer. In both cases, you can neither determine the contribution and importance of the single components for the overall performance nor properly study and assess the interaction among components.
We will discuss a new methodology based on Grid-of-Points and General Linear Models which allows us to break-down overall system performance into those of the constituting compenents and to study their interaction. We will show how to apply this methodology firstly to the case of English retrieval and then to multilingual retrieval. We will then discuss how this methodology could be exploited to better understand how components from different disciplines work together, e.g. Word Sense Disambiguation with IR pipelinse. Finally, we will present a visual analytics tool to study to explore and better understand how components work together.
 
 

About the author

Nicola Ferro (http://www.dei.unipd.it/~ferro/) is associate professor in computer science at the University of Padua, Italy. His research interests include information retrieval, its experimental evaluation, multilingual information access and digital libraries. He is the coordinator of the CLEF evaluation initiative, which involves more than 200 research groups world-wide in large-scale IR evaluation activities. He was the coordinator of the EU Seventh Framework Programme Network of Excellence PROMISE on information retrieval evaluation. He is associate editor of ACM TOIS and was general chair of ECIR 2016.