20,000 Ways Not to Do Authorship Attribution -- and a Few that Work Patrick Juola // Duquesne University // juola@mathcs.duq.edu Authorship attribution is of course a critical problem in validating documents, not only for forensic purposes, but for many other scholarly and investigative purposes as well. "Statistical" analysis (attribution by statistical identification and analysis of patterns, whether of rare forms, function word distribution, or whatever) has been deployed using literally thousands (Rudman, 1998) of different methods, and results have generally been "better than chance," but little comparative data is available to identify the methods that work particularly well and under what conditions. (See also Juola, 2006). The JGAAP (Java Graphical Authorship Attribution Program, www.jgaap.com) system is a modular software system for authorship analysis that has been specifically designed to allow large-scale comparative testing of authorship attribution methods. Using a variety of (user-selectable) preprocessors, feature/event sets, and analytic methods, we estimate that the current version of JGAAP is capable of analyzing documents in more than 20,000 different ways. We are in the process of testing (using the Ad-hoc Authorship Attribution Competition test corpus, see Juola 2004) to see which of these ways qualify as "best practices" and the accuracy that can be expected under a variety of circumstances. At this writing, we have tested more than 300 of these methods and research continues. We are also exploring aspects of particularly well-performing methods in the hopes of providing guidance for the development of new methods. Some preliminary findings include: * The seminal word list of Mosteller and Wallace does not generally perform well for texts other than the Federalist Papers * Introduction of a small number of character errors (as exemplified by modern OCR systems) does not substantially reduce accuracy with most methods. * Symmetric ("commutative") distance-based methods tend to outperform asymmetric ones. * Linear classifiers such as LDA tend to outperform nonlinear classifiers despite the apparent oversimplicity of the underlying model * Character-based methods tend to outperform word-based ones for authorship attribution in Chinese * Both cosine distance (normalized dot product) and simple event-based Kullback-Leibler divergence tend to be the best-performing methods for distance-based nearest-neighbor methods. We hope to share these findings as well as demonstrate the program in the hopes of making a useful tool freely available to the forensic linguistics community.