Can machines learn to categorize text as well as humans? This study presents extensive experiments on automated rule-based induction methods for large document collections, aiming to discover classification patterns for document categorization and personalized filtering. The research demonstrates that machine-generated decision rules can achieve performance comparable to human-engineered systems, while using the same rule-based representation. Results on the Reuters collection benchmark reveal a significant performance gain compared to other machine-learning techniques, achieving an 80.5% recall/precision breakeven point, a substantial improvement over the previously reported 67%. The study also explores methodological alternatives, including universal versus local dictionaries and binary versus frequency-related features, in the context of high-dimensional feature spaces. This work highlights the potential of machine learning to automate text categorization tasks, reducing the need for extensive human involvement. These findings have implications for information retrieval, document management, and the development of intelligent systems.
Published in ACM Transactions on Information Systems, this research aligns with the journal's focus on information retrieval, text processing, and intelligent systems. By presenting an automated approach to text categorization, the study contributes to the advancement of information systems technologies and their applications, which is central to the journal's scope.