\title{What Can We Do With Small Corpora? Document Categorization Via Cross-Entropy} \author{\large \bf Patrick Juola \\ Department of Experimental Psychology \\ University of Oxford \\ ({\sc patrick.juola@psy.ox.ac.uk}) } \begin{abstract} A possible problem with many of the large-corpus techniques used for document categorization or similarity judgements is the very fact that they require large corpora for reliability. A powerful test against the distilled wisdom of hundreds of millions or billions of words may be of limited use when only a few thousand characters are available. This paper describes an information-theoretic model based on a new method for estimating entropy that is able to produce remarkably accurate judgements of language or even of authorship based on relatively tiny corpora. Based on a sample of a single document not much longer than this abstract, this technique is capable of error-free inference of the authorship of some of the {\em Federalist Papers} by estimating the similarity between these samples and the documents in question. The efficiency and generality of this technique suggests that it might be applied with good effect to a horde of other problems. \end{abstract}