Cross-linguistic Transferrence of Authorship Attribution, or Why English-Only Prototypes Are Acceptable Patrick JUOLA Duquesne University Authorship Attribution (Juola, 2008) can be defined as the inference of the author or her characteristics by examining documents produced by that person. It is of course fundamental to the humanities; the more we know about a person's writings, the more we know about the person and vice versa. It is also a very difficult task. Recent advances in corpus linguistics have shown that it is possible to do this task automatically by computational statistics. A key question, however, in any statistical study (not just of text statistics) is whether the data or methods will transfer from one domain to another. Statistical analyses hinge on assumptions which may or may not be met by different languages, and of course, data which is representative of one domain is highly unrepresentative of any other, by definition. A method that performs well in one area may fail miserably in another. As an example, a part-of-speech tagger with 96% accuracy on newswire data may achieve only 50% on chat logs. (Craig Martell, p.c., 2008) In light of this finding, cross-problem transferrence can be a major problem for statistical authorship attribution. Should we expect an authorship attribution system that performs well on, say, English, to also perform well on Dutch, Serbian, or Chinese? Alternatively, is it reasonable for a scholar of, say, Polish poetry to have confidence in methods that have been tested to perform well, but only on English documents? This, of course, is a major problem, especially with problems of "forensic" interest where the accuracy rate is one of the major considerations regarding the evaluation or even admissibility of evidence. The JGAAP software framework (Juola et al., 2006) in conjunction with the Ad-hoc Authorship Attribution Competition corpus (Juola, 2004) provide us with some preliminary results. JGAAP (described elsewhere in this volume) is a modular, Java-based program capable of performing thousands of different types of authorship attribution methods on a well-defined corpus. The AAAC comprises 13 authorship attribution problems in a variety of languages and genres. This setup makes large-scale performance comparisons among categories practical. For example, 8 of the 13 AAAC problems involved English text (in some form or another), but 5 involved other languages such as French, Dutch, Latin, or Serbian/Slavonic. If authorship attribution did not transfer well, we would expect to see little correlation between the average performance of a method on English texts and its performance in other languages, as high-performing methods would not necessarily remain high-performing in other environments. Conversely, if we see a high degree of correlation across languages, this argues for a high degree of transfer. As part of some other large-scale technical comparisons (this volume), we have gathered 281 separate analyses of the AAAC data using a variety of preprocessors, event set models, and analytic methods, ranging from simple lexical statistics or nearest neighbor histogram measures to complex machine learning models such as support vector machines on word or character n-grams of various sizes. In this database, the correlation between a method's average performance in English and non-English was 0.6680, a highly significant (p < 0.0001) result. More tellingly, the coefficient of determination (r^2) was 0.4462, meaning that approximately 45% of the variation in performance of an algorithm across non-English data could be explained simply by a measure of its performance on English-only data; variations in genre, register, or even variations across the broad category of languages-that-aren't-English have only a little more effect in total. (We should add that work is ongoing and the 281 analyses will undoubtedly be expanded in the next several months; these numbers are therefore only preliminary and updated results will be presented.) Similarly, we can divide the AAAC problems into "large" problems (those with training samples of more than 100,000 characters each) and "small" ones. (Of the 8 English problems, 4 were large; of the five non-English problems, 3 were large.) Across the same 281 analyses, the correlation between performance on large problems and small problems was again highly significant (r=0.6061, r^2 = 0.3674), meaning that almost 37% of the variation in performance was explained by predicted performance on other sizes. This provides strong evidence, then, that good algorithms for authorship in one domain will also be good algorithms for authorship in other domains. In particular, we have hopes that a set of best practices established by looking at one particular data set will be a set of at least "good" practices in other domains, and can be a useful starting point in a search for domain-specific best practices in other, less studied or novel, domains. Unfortunately, the way that the AAAC data was structured prevents direct comparisons of accuracy (although it is hard to imagine ways to establish that two authorship attribution tasks are "comparably difficult" to enable such direct comparisons). Of course, at one level, one could make the argument that a bad algorithm for English should not be expected miraculously to perform better when transferred to a language the original designer cannot speak or read. But it is still encouraging to find that a good algorithm for English can be expected to perform well in that same unknown language. REFERENCES: Juola, Patrick. (2004). "Ad-Hoc Authorship Attribution Competition" ALLC/ACH 2004 Conference Abstracts. Gothenberg: University of Gothenberg. Juola, Patrick. (2006/8). Authorship Attribution. Foundations and Trends in Information Retrieval 1:3. Delft:NOW Publishing. Juola, Patrick, John Sofko, and Patrick Brennan. (2006). "A Prototype for Authorship Attribution Studies." Literary and Linguistic Computing 21:169-78