Friday 25 May 2018

Plagiarism software hardly finds significant overlap between Darwin (1859) and Matthew (1831)

[Update 26.5.2018: Now with control assay (blind sample) at the end, in order to render the conclusion valid.]

Some folks incessantly claim that that Charles Darwin (1859: On the Origin of Species) plagiarized Patrick Matthew (1831: On Naval Timber and Arboriculture) by touting similarities in usage in combination with findings that merely show that Victorian naturalists and publishers were connected by less than six degrees of separation. The latter is only to be expected, given that science was a much smaller endeavor back then. Mere connections leading from Matthew to Darwin via acquaintance or citation, along x degrees of separation, does not prove conscious plagiarism by Darwin. Here, I check both books with a plagiarism-detection software, in order to test the claim about similarities in usage proving plagiarism.

The software
WCopyfind is an easy to use software that compares text-files. In order to compare vintage books, like Darwin's On the Origin of Species (1859) and Matthew's On Naval Timber and Arboriculture (1831), you only need to download them as plain text from sites like archive.org, or Project Gutenberg and save them in a format that the program can process (e.g., .docx). The program is very lean, fast and easy to use. The download takes a mere second and getting started a mere half an hour. Just click on the link above and try it yourself.

The interface looks like this.


In the image shown above, the parameters are set to report matching strings with a length of four words or more ("Shortest Phrase to Match: 4 Words"). The parameter "Most Imperfections to Allow: 2" means that, here, the program will bridge up to two non-matching words within a phrase thus allowing for some editing. The parameter "Minimum % of Matching Words: 80%" also allows for editing in that only 80% of two passages or phrases need to be identical and will still be reported. "Fewest Matches to Report: 1 Words" actually means that even 1 matching 4-word phrase (not "1 words") will be reported. This is the lowest bar and one match can be expected by mere chance in book of such length.

Basic check
All the above parameter settings are as recommended by the maker of WCopyfind, except the string-length to be matched. Here, the maker recommends a string-length of 6 matching words for reporting. With this default setting, however, there will be 0% overlap between Darwin's and Matthew's book.


There must be some rounding involved in this reported 0% overlap, because some hits can still be found in viewing both texts side-by-side. But these are entirely fortuitous and would occur in any two texts of a certain length. They signify no plagiarism as the immediate context of the matching phrases shows.

For example, such a search yields matches of the following kind: "we have also seen in the," which Darwin continues with "second chapter" and Matthew with "moss of Balgowan."


Other matches require some more context, in order to get the fortuitous triviality of the match:
"in many cases it is most difficult to" (Darwin 1859) vs

"in cases where it is difficult to" (Matthew 1831).
The green words are the ones that have been bridged and the red words constitute the 6-word match. However, the immediate context of this match is completely different in both books:
"Although in many cases it is most difficult to Conjecture by what transitions an organ could have arrived at its present state yet, considering that the proportion of living and known forms to the extinct and unknown is very small, I have been astonished how rarely an organ can be named, towards which no transitional grade is known to lead." (Darwin 1859)

"Forests of Ficus sylvestris are sometimes destroyed by insects under the bark, in cases where it is difficult to decide whether external circumstances, such as a dry warm season, has been promotive of the increase of the insect itself, or has induced some disorder in the plant, rendering the juices more suitable aliment to the worm." (Matthew 1831)
One actually needs to reduce the string-length down to 4 words, in order to get to the point, where the phrase that Mike Sutton, criminologist of Nottingham Trent University, takes to signify that Darwin plagiarized Matthew:
"this process of natural selection" (Darwin 1859) vs

"this natural process of selection" (Matthew 1831).
The immediate contexts:
"Therefore, I can see no difficulty, more especially under changing conditions of life, in
the continued preservation of individuals with fuller and fuller flank-membranes, each modification being useful, each being propagated, until by the accumulated effects
of this process of natural selection, a perfect so-called flying squirrel was produced." (Darwin 1859, p. 181)

"Mans interference, by preventing this natural process of selection among plants,
independent of the wider range of circumstances to which he introduces them, has increased the difference in varieties, particularly in the more domesticated kinds and even in man himself, the greater uniformity, and more general vigour among savage tribes, is referrible to nearly similar selecting law the weaker individual sinking under the ill treatment of the stronger, or under the common hardship."(Matthew 1831, p. 308)
It is true, both contexts are evolutionary, but does that prove plagiarism by Darwin of Matthew? The perfect matches amount to 2% of Darwin's book and 4% of Matthew's, most of it the trivial stuff that is clearly coincidental. And allowing for the bridging of words and only 80% matching words in phrases did not change this for the overall matches.


Control assay (blind sample)
The problem with purely random word-salads is that they are not grammatical. That is, two word-salads will show less matches, by pure chance, than two independent texts, because grammar, rules, as well as an era's fashion of speech force language into similar strings of words at times. A blank sample or control assay must, therefore, compare two texts that are from the same time and culture, whose authors did not imitate each other. I assume (with no expertise in English literature on my part)* that Oscar Wilde's (1854-1900) tastes were sufficiently different from that of Amanda Minnie Douglas (1831-1916), author of juvenile fiction like the Little Girl... and Helen Grant... series, and that they did not copy each other. I compared their novels The Picture of Dorian Gray (1890) and A Little Girl in Old Boston (1898) using identical parameter settings as above.

-----
* I assumed that a homoerotic book by an openly homosexual author would have little in common with the girly romance novel of a spinster.
-----


Here, the perfect matches amount to 3% in Douglas's book and 4% in Wilde's and, respectively, the overall matches amount to 3% and 5%. This is slightly higher than in the comparison of Darwin's Origin with Matthew's Naval Timber (see above) suggesting that there is no more reason to assume that Darwin plagiarized Matthew than to assume that spinster Douglas plagiarized scandalous Wilde. Surely, the 2% and 4% overlap between Darwin's and Mathew's book, respectively, is within the range of what is to be expected from any pair of books of roughly the same age and culture.

A note of caution to those who are fond of relying on algorithms rather than reading and thinking themselves, the program missed a four-word-shuffle, because it found an overlapping perfect three-word match first. That is, the program overlooked the following matches:
"but it seems somehow" (Douglas 1898, p. 11) vs

"but somehow it seems" (Wilde 1890, p. 14).
Instead, Wilde's sentence is matched with another of Douglas, because of a three-word match that required no bridging:
"Now, it seems to me I never could learn French." (Douglas 1898, p. 99)
"It is a silly habit, I daresay, but somehow it seems to bring  great deal of romance into one's life." (Wilde 1890, 14)
Obviously, neither the perfect three-word match nor the four-word-shuffled match has any significance concerning plagiarism. Apparently, four-word-shuffled matches do occur by mere coincidence at times. One should, therefore, take the occurrence of such a match between Darwin (1859) and Matthew (1831) with a grain of salt.

Some readers might think that the string "natural process of selection" is charged with meaning in a way that "but somehow it seems" is not and, therefore, the former string carries biological significance in a way that the latter does not. However, the first time that the string "natural process of selection" has been used in publication (as far as the digitized record goes) was by Sigmund Spaeth (1829. The Encyclopedia Americana, Vol. 19, pp. 636-639). This entry in an encyclopedia, however, carries no significance concerning natural history. It is about music appreciation and contains the four words in a context that has nothing to do with biology:
"Program music will generally be found easier to grasp than absolute music, and this is again a natural process of selection." (Spaeth 1829, p. 638)