Can computers learn language like humans? This study explores how unsupervised learning, specifically Minimum Description Length (MDL) analysis, can be used to model the morphological segmentation of natural language. Focusing on European languages, the research utilizes corpora of varying sizes to develop a set of heuristics that rapidly builds a probabilistic morphological grammar. The modifications proposed by these heuristics are evaluated using MDL, determining whether they should be adopted. The generated grammar closely mirrors analyses developed by human morphologists, suggesting the potential of this approach. MDL analysis offers a powerful tool for rapidly developing a probabilistic morphological grammar. By applying MDL, the study efficiently navigates the vast possibilities in language structure. The research examines the relationship between this method of grammatical analysis and evaluation metrics used in early generative grammar, bridging computational and theoretical linguistics. This research demonstrates that MDL analysis can effectively model unsupervised learning of morphological segmentation, providing valuable insights into how machines can learn language structures without explicit instruction. The findings have implications for natural language processing, computational linguistics, and our understanding of the cognitive processes involved in language acquisition. The success of MDL offers avenues for future research in automated language learning and grammatical analysis.
Published in Computational Linguistics, a leading journal covering the field, this paper is highly relevant due to its focus on natural language processing. The journal addresses computational approaches to language, a central theme of this work. By exploring unsupervised learning techniques, this research builds upon existing literature in the field, offering novel insights into morphological grammar development and its relationship to early generative grammar.