"Conjecture Generation in the Digital Humanities" Patrick JUOLA, Ashley BERNOLA Duquesne University // juola@mathcs.duq.edu Digital scholarship has been very helpful in the development of humanities research (Juola, 2008a), primarily by automating processes such as communication, text processing, and search and permitting scholars to concentrate on analysis and explanation. However, when computers attempt to generate meaning, to perform analysis themselves, the results are usually less than satisfactory and don't actually explain much. An example of analytic failure can be seen in the reception of "nontraditional" (i.e. statistical, computer-aided) authorship attribution. It is now unquestionable that computers can infer authorship attributes with high accuracy (see Juola, 2008b), but the accurate inference processes tend not to inform us about the actual authors (Craig, 1999). Argamon (2006) has provided a theoretical analysis of one particular method, but in the unfamiliar and ``inhuman'' language of statistics, which again sheds little light on authorial language and authorial thought. By contrast, studies of gender differences in language (e.g., Coates, 2004) offer not only lists of differences, but explanations in terms of the social environment. In fact, the interesting part of scholarship is not in mere observation, but in the refinement and explanation. This suggests a relatively novel model for computer/human interaction in scholarship, one in which the computer is used to identify patterns that are passed to human experts for validation and explanation. While not widely used, this model has been successfully applied in mathematics by the _Graffiti_ program (Fajtlowicz, 1986). This program generates conjectures (randomly) of the form X < Y (or similar forms such as X < Y + Z) where X, Y, and Z are "graph invariants," simple numeric properties of graphs such as average number of edges per node, number of colors necessary to color the graph like a map, average distance between nodes, number of different paths between nodes, and so forth. Graffiti then compares this conjecture against a library of graphs to see if it is true for all the graphs in the library. Of course, "true for all the graphs in the library" is not proof of universal truth, but it provides evidence in support. If the conjecture is thus supported, Graffiti publishes the conjecture, and professional mathematicians are invited to prove (or disprove) it. Since inception, Graffiti has developed and published more than 1000 conjectures and inspired more than 100 papers. This model can easily be applied to the humanities. Application in the field of text analysis is straightforward; we need analogues to "graph invariants." Such "text invariants" might include the frequency of specific words, phrases, or structures in particular text types. A simple conjecture might be that "color adjectives" are more common than "size adjectives," or that "verbs of motion" are more common in male-written novels than in female-written poetry. Of course, text invariants are not limited to token frequency analysis; any "property" that can be assessed via computer analysis would be a possibility. If one can find a way to determine a text's eroticism, degree of animacy, personification, etc., any of those would be potential features. In addition to text invariants, we need a framework for conjectures (in analogy to the X < Y + Z framework described above). Simple comparisons are likely to find uninteresting conjectures (prepositions are more common than proper nouns). A more interesting conjecture (in the author's opinion) would be something like "color adjectives are more common than size adjectives in women's writing, but the reverse holds in men's writing." Such a finding would clearly show a relationship, as yet unexplained, between adjective choice and gender. If this were true -- why? It obviously says something about gender and culture, but what? Here is where a traditional humanist could take advantage of the ability of a computer to read and analyze a huge amount of data very quickly. In general, conjectures of the form "X > Y in texts of category A, but Y > X in texts of category B" (where A and B are non-overlapping categories, ideally pulled from the document's metadata) are likely to be of interest to category A/B specialists, especially if X and Y are themselves interesting natural properties. Another framework would be that "X is more common in A-texts than B-texts," and it is this framework that we use in our prototype. To illustrate this, we have built a simple version of this conjecture generator ("conjecturator") using standard Java technology, much of it drawn from the JGAAP project (Juola, 2007). The Moby Thesaurus II lists more than 30,000 different synonym sets : for example, the word group "raft" (as in "a raft of money") includes words/phrases such as "barge," "boat," "pile," "pot," and "quite a little." The word group "take back" includes terms like "abjure," apologize," "renege," "disown," and "nullify." We have also collected eight separate (English) translations of the Bible ranging from the Authorized (King James) Version to Revised Standard Version and the Bible in Basic English. Our program selects one synonym set and two Bible versions (at random), then counts every appearance of each word token listed in the synonym set. Our conjectures are therefore of the form: "Words in appear at least 50% more frequently in Bible translation than in one." Our prototype strips punctuation and case differences, but does not perform morphological analysis or even word sense disambiguation. Despite this limitation, simple word-counting reveals that the word group "take back" appears approximately twice as often in the RSV as it does in Young's Literal Translation. We have similarly found that the word group "rhythmical" occurs substantially more often in the KJV than in Darby's Translation. We have at this writing no explanation, but offer them (along with many other findings) to interested Biblical scholars as a potentially unexplored facet of the differences among different translations. (A list of conjectures will be available both electronically and at the conference -- about 40% of conjectures appear to be valid, a percentage we find surprising.) Testing these conjectures has been relatively easy (if time-consuming); We simply attach the program to a large database (in this case, of Bibles) and allow it to sample from the database until it has either confirmed or rejected the hypothesis to its satisfaction. (For example, there does not appear to be a significant difference in the word group "unauthorized" between the RSV and the American Standard Version.) We can extend such a program even to help solve the "how do you read a million books" problem, since the program could not only do the bulk of the initial reading to see if the hypothesis is true in the first place, but would automatically generate a reading list for scholars interested in following up on the conjecture. (At a second per book, a computer could analyze all million texts and deliver a list of how each work fared vs. the conjecture in less than two weeks. By contrast, a human closely reading one book per day would require 3000 years to read a million books.) Even our prototype system can examine many more categories and hypotheses than even the most avid and interested human reader -- in its first 24 hours alone, it found more than 200 possibly interesting differences between Bible versions. As a further extension, we have extended the Conjecturator to include multiple documents and a more robust form of statistical analysis. Using a collection of more than 100 Victorian novels (courtesy of David Hoover, NYU), we now observe mean word usage within a group such as bildungsromanen or gothic novels, compute variance and t-statistics and accept a conjecture if the computed p-value is sufficiently low (in either direction). Results of this further experiment will also be presented. Although our prototype is limited to text analysis, the possibility of automatic conjecture generation may extend further. A large and rich database of GIS and/or census information may be able to support, for example, conjectures of the form " is more common in than ." An example of such a conjecture would be a relationship previously unimagined between the number of veterinarians and Methodist churches in coastal counties. What are the benefits of such a program? This conjecture generator can deliver a set of (partially) validated observations about easily observable, superficial properties of the texts in the library (or points in the database more generally defined). By construction, all published conjectures are more or less guaranteed to describe something true, at least about the library. These partial truths, to humanists interested in the study of the library, may represent insights that they have not considered and a the particular hypothesis under study. Indeed, the scholars may lack the time to familiarize themselves with every volume in the library, and may not even be "digital" enough to understand the computer analysis, but who may be interested enough to see out the new material that they now know is there. At the simplest possible level, 1000 validated conjectures are 1000 topics for student projects, research papers, or Ph.D. theses, a partial solution to the "I need to do a term paper but don't know what I want to do it on" question that plagues all supervisors. More generally, however, this program would also allow humanists to concentrate their efforts on what is generally the most interesting and rewarding part of humanities research; the search for an explanatory theory of human behavior. By giving scholars a list of statements that are probably true, they can concentrate their efforts on producing statements and theories that are genuinely meaningful.