What is similarity? How do we categorize "similar" stuff? Seems pretty obvious, right? We see similar things, and they are properly categorized because, well, they're just similar! That seems correct until you look down at the fine print and try to see the intricate cognitive details involved. Our minds have concepts, and these concepts sing and dance according to the tune of the moment. Perhaps the brain's motto should be: our criteria for classification and categorization is simple: there is no criteria.
Take, for instance, the question "What is similarity?", and consider a text document. Let's think like a computer: If the Google search engine found two documents with thousands of words and a mere comma of difference between them, should Google classify the pages as similar? If you think so, look at it from a human's (or perhaps one might better put it as a lawyer's) eyes.
Here's a story from the New York Times:
The Comma That Costs 1 Million Dollars (Canadian)
OTTAWA, Oct. 24 — If there is a moral to the story about a contract dispute between Canadian companies, this is it: Pay attention in grammar class.
The dispute between Rogers Communications of Toronto, Canada’s largest cable television provider, and a telephone company in Atlantic Canada, Bell Aliant, is over the phone company’s attempt to cancel a contract governing Rogers’ use of telephone poles. But the argument turns on a single comma in the 14-page contract. The answer is worth 1 million Canadian dollars ($888,000).
Citing the “rules of punctuation,” Canada’s telecommunications regulator recently ruled that the comma allowed Bell Aliant to end its five-year agreement with Rogers at any time with notice.
Rogers argues that pole contracts run for five years and automatically renew for another five years, unless a telephone company cancels the agreement before the start of the final 12 months.
The dispute is over this sentence: “This agreement shall be effective from the date it is made and shall continue in force for a period of five (5) years from the date it is made, and thereafter for successive five (5) year terms, unless and until terminated by one year prior notice in writing by either party.”
Look at that last comma. How long should the contract last? Without the comma, it's pretty clear, right? It must last at least a full 5 years. It is beyond the point whther the
Let's now get back to Google's way of looking at things. There are two 14-page documents, one has a single comma that the other lacks. Should Google classify them as "similar"? It seems clearly obvious that it must be the case: to Google's eyes, these are 99,9999% similar. After all, under what circumstances should the algorithms in a search engine perceive the semantic dangers that lie within a single comma, given thousands and thousands and thousands of exactly-matching-words-and-paragraphs documents?