WCopyfind is an easy to use software that compares text-files. In order to compare vintage books, like Darwin's On the Origin of Species (1859) and Matthew's On Naval Timber and Arboriculture (1831), you only need to download them as plain text from sites like archive.org, or Project Gutenberg and save them in a format that the program can process (e.g., .docx). The program is very lean, fast and easy to use. The download takes a mere second and getting started a mere half an hour. Just click on the link above and try it yourself.
The interface looks like this.
In the image shown above, the parameters are set to report matching strings with a length of four words or more ("Shortest Phrase to Match: 4 Words"). The parameter "Most Imperfections to Allow: 2" means that, here, the program will bridge up to two non-matching words within a phrase thus allowing for some editing. The parameter "Minimum % of Matching Words: 80%" also allows for editing in that only 80% of two passages or phrases need to be identical and will still be reported. "Fewest Matches to Report: 1 Words" actually means that even 1 matching 4-word phrase (not "1 words") will be reported. This is the lowest bar and one match can be expected by mere chance in book of such length.
All the above parameter settings are as recommended by the maker of WCopyfind, except the string-length to be matched. Here, the maker recommends a string-length of 6 matching words for reporting. With this default setting, however, there will be 0% overlap between Darwin's and Matthew's book.
There must be some rounding involved in this reported 0% overlap, because some hits can still be found in viewing both texts side-by-side. But these are entirely fortuitous and would occur in any two texts of a certain length. They signify no plagiarism as the immediate context of the matching phrases shows.
For example, such a search yields matches of the following kind: "we have also seen in the," which Darwin continues with "second chapter" and Matthew with "moss of Balgowan."
Other matches require some more context, in order to get the fortuitous triviality of the match:
"in many cases it is most difficult to" (Darwin 1859) vsThe green words are the ones that have been bridged and the red words constitute the 6-word match. However, the immediate context of this match is completely different in both books:
"in cases where it is difficult to" (Matthew 1831).
"Although in many cases it is most difficult to Conjecture by what transitions an organ could have arrived at its present state yet, considering that the proportion of living and known forms to the extinct and unknown is very small, I have been astonished how rarely an organ can be named, towards which no transitional grade is known to lead." (Darwin 1859)One actually needs to reduce the string-length down to 4 words, in order to get to the point, where the phrase that Mike Sutton, criminologist of Nottingham Trent University, takes to signify that Darwin plagiarized Mattehew:
"Forests of Ficus sylvestris are sometimes destroyed by insects under the bark, in cases where it is difficult to decide whether external circumstances, such as a dry warm season, has been promotive of the increase of the insect itself, or has induced some disorder in the plant, rendering the juices more suitable aliment to the worm." (Matthew 1831)
"this process of natural selection" (Darwin 1859) vsThe immediate contexts:
"this natural process of selection" (Matthew 1831).
"Therefore, I can see no difficulty, more especially under changing conditions of life, inIt is true, both contexts are evolutionary, but does that prove plagiarism by Darwin of Matthew? The perfect matches amount to 2% of Darwin's book and 4% of Matthew's, most of it the trivial stuff that is clearly coincidental. And allowing for the bridging of words and only 80% matching words in phrases did not change this for the overall matches.
the continued preservation of individuals with fuller and fuller flank-membranes, each modification being useful, each being propagated, until by the accumulated effects
of this process of natural selection, a perfect so-called flying squirrel was produced." (Darwin 1859)
"Mans interference, by preventing this natural process of selection among plants,
independent of the wider range of circumstances to which he introduces them, has increased the difference in varieties, particularly in the more domesticated kinds and even in man himself, the greater uniformity, and more general vigour among savage tribes, is referrible to nearly similar selecting law the weaker individual sinking under the ill treatment of the stronger, or under the common hardship."
The problem with purely random word-salads is that they are not grammatical. That is, two word-salads will show less matches, by pure chance, than two independent texts, because grammar, rules, as well as an era's fashion of speech force language into similar strings of words at times. A blank sample or control assay must, therefore, compare two texts that are from the same time and culture, whose authors did not imitate each other. Assuming (with no expertise in English literature on my part*) that Oscar Wilde's tastes were sufficiently different from Henry James's, and that they did not copy each other, I compared their novels The Picture of Dorian Gray and Daisy Miller using identical parameter settings as above.
* Please tweet a comment (button left), if there's reason to chose some other texts as control assay.