"JGAAP4.0 -- A Revised Authorship Attribution Tool" Patrick JUOLA; John NOECKER, Jr; Mike RYAN; and Sandy SPEER Duquesne University // juola@mathcs.duq.edu (Software Demo/Poster) Authorship Attribution (Juola, 2006) can be defined as the inference of the author or her characteristics by examining documents produced by that person. For some time, we have been working on a system (JGAAP -- Java Graphical Authorship Attribution Program) to use advanced statistics to perform this task while not demanding a high degree of expertise from the user (Juola, et al., 2008). With the recent release of JGAAP 3.2 and the near-term planned release of JGAAP 4.0, we are finally confident that we have a production quality system for general-purpose use. We now report (and demonstrate) these recent improvements. JGAAP now incorporates nearly 20 different analytic methods (including eight different distance-based nearest-neighbor algorithms), more than 20 different event sets and models ranging from character- and word- based N-grams to reaction times, and several different preprocessors incorporating a wide variety of different document types including remote (Web-accessible) files and text extraction from different formats. We estimate that JGAAP is capable of performing more than 20,000 different types of analysis for authorship attribution or similar text classification tasks, with more being added as development continues. Other improvements include: * GUI improvements to enhance user-friendliness * Enhanced graphical output capabilities * Full report generation capacity for scholarly inspection of the results * Creation of a command-line interface * Automatic batch processing capacity for large-scale comparative testing * Incorporation of the AAAC (Juola, 2004) test corpus into the demo for comparative testing purposes * Dynamic loading of new methods to encourage new development We are finally able to perform large-scale comparative analyses of different processing methods. We include here a short list of some JGAAP-related findings (published, submitted, or in preparation) : * Introduction of a small number of character errors (as exemplified by modern OCR systems) does not substantially reduce accuracy with most methods. * Symmetric ("commutative") distance-based methods tend to outperform asymmetric ones. * Linear classifiers such as LDA tend to outperform nonlinear classifiers despite the apparent oversimplicity of the underlying model * Character-based methods tend to outperform word-based ones for authorship attribution in Chinese * Both cosine distance (normalized dot product) and simple event-based Kullback-Leibler divergence tend to be the best-performing methods for distance-based nearest-neighbor methods. * The seminal word list of Mosteller and Wallace does not generally perform well for texts other than the Federalist Papers Some of our findings have been submitted under separate cover to this conference, but we hope to present a summary of major results that have been achieved by June 2009 along with a demonstration of the newest version of the program. We also hope to provide examples of the sort of analysis that have been performed by JGAAP (and invite cooperation from interested researchers for further study). Finally, we hope to demonstrate some example ad-hoc analyses during the session; it should be possible, for example, to demonstrate that "document length" or "words that are palindromes" do not perform well as Event/feature sets in less than ten minutes. While this is perhaps not interesting (no sensible person has proposed palindromes for authorship attribution), this clearly illustrates the ease-of-use and of result generation.