Michael Eriksson's Blog

A Swede in Germany

Search-engines missing the point

with 4 comments

Search mechanisms on the Internet are another ([1], [2]) common source of “missing the point”, be they global search-engines or site-internal:

When a user searches, it is not to find hits—but to find relevant hits. However, most* search-engines, etc., appear to focus on maximizing the number of hits and to consider relevance a secondary criterion—often combined with an attitude of “we know better than the user what he wanted to search for”.

*In my experiences over the last few years: I have, obviously, only used a minority of the world”s searches (and avoid Google for reasons of privacy) and things were a bit better in the past.

Wikipedia is a good example: Almost every time I search for something obscure, I am met with a message of “Showing results for [less obscure word]. Search instead for [original search word].”—utterly unacceptable! If I search for X, Wikipedia should non-negotiably show the results for X. The first assumption should always be that the user knows what he wants. If there is reason to expect a mistake, e.g. if someone searches for “reciever”, then it is acceptable, often beneficial, to also make a suggestion “Did you want to search for ‘receiver’?”—but the hits for “reciever” should be shown by default. Presuming to override the search punishes those who actually know what they do and do so correctly, while assisting those who do not…* (However, in the specific example used, some room for display of both can be available as per excursion below. In many or most other cases this is not so, because the actual word used is correctly spelled, but just happens to not have an article on Wikipedia.)

*I stress that I do not claim perfection in this regard: I often make mistakes that involve turning two letters around, hitting return before typing the last latter, and similar. However, I have no objections to paying the price of a second search when I actually have made a mistake—having to pay that price when I did not make one, that is what annoys me.

My current main search-tool, duckduckgo*, is truly horrible in this regard. It does normally show hits for what I searched for, but very often in a form so diluted as to make the results useless. This includes an ever recurring “Not many results contain [specific search term]. Search only for [the original search terms]?”, which amounts to “if this specific search term is included, we do not find many hits, so we have chosen to consider it secondary”. However, the terms falling victim to this are usually those that I very, very deliberately included in order to ensure that the relevance of the hits was high enough! In effect, the better my original choice of terms, the stronger they filter, the more valuable they would have made the resulting hits—the less likely they are to be taken at face-value by duckduckgo. Even a single relevant hit is better than a hundred irrelevant hits! This misbehavior is especially annoying when the space of potential hits has to be reduced by several types of criteria, each essential for the hits to be relevant.**

*A good choice in terms of claimed philosophy with regard to e.g. anonymity, and often the default with e.g. the Tor Browser. However, due to the poor search results, I will almost certainly look for a replacement in the near future. Notably, duckduckgo is yet another tool that has grown worse over time. This lately includes ever more “paid hits”.

**Consider e.g. searching for information on installing a certain piece of software, which requires at least three types of information: (1) Installation instructions are different for different platforms, implying that the platform is needed, preferably fairly specifically. Even with Linux there are often differences from distribution to distribution, and the instructions for Windows or MacOS are highly unlikely to be helpful. (2) These instructions are different for different pieces of software, implying that the current software is needed. (3) The fact that an installation is concerned (and not e.g. general product information or information on trouble-shooting a run-time problem) is needed…

However, even when this dreaded message does not appear, the actual use of the search terms appears to be fairly arbitrary and hits are very often of low quality. This to the point that I suspect that duckduckgo internally uses an “or”* search and then delivers the hits based on some ranking** where “and”*** is just a secondary criterion. The result is that I often have to repeatedly manipulate my query using additional instructions,**** which can waste quite a lot of time and be very frustrating, when it occurs repeatedly in a short span of time. It is far better to be honest, deliver the few relevant hits, and suggest a less stringent search, than to pretend that an ocean of relevant hits were found—but which actually are an ocean of irrelevant hits.

*A query like “a b c” should obviously per default be interpreted as “give me hits that match ‘a’ and ‘b’ and ‘c’ and nothing else”—not as “give me hits that match ‘a’ or ‘b’ or ‘c’ or otherwise only partially match my criteria”. The key to good searches is cutting out the irrelevant, not grabbing anything even remotely plausible looking. (The one case for “or”, as a default, that once could have been made, is long outdated. Cf. excursion.)

**Using a ranking is not a problem, but could even be seen as a necessity. (Another problem with Wikipedia is the weak or absent ranking of hits.) However, this ranking must have relevance as the most important criterion. If this is ignored in favor of criteria like popularity, a very popular page that deals extensively with “a” (but not “b” and “c”) might be ranked over an unpopular page that actually deals with all three.

***More strictly speaking, the textual relevance for the search terms.

****Specifically, the use of “+” to force the use of a given search term and quotation marks to ensure that a certain term is taken literally, e.g. “a “b” c” (but see excursion below).

Excursion on synonyms and similar:
An acceptable, normally highly beneficial, exception to the use of the user”s literal query is the application of synonyms and similar fuzziness. For instance, if a search for “horse” does not include pages that use “horses”, “equine[s]”, and whatnot (but fail to use “horse”), there would be a considerable additional burden on the user, including the need for repeated searches or searches that make heavy use of “or” instructions. In the minority of cases where the specific, literal, search-term is required, the user still has the option of being explicit* about this. Some degree of such fuzziness has been a quasi-standard since the late 1990s or the early 2000s.

*Typically, through the use of “””.

However, even this can be taken too far. For instance, I recall once searching for information on User-Mode Linux: Knowing that using the typical abbreviation, “UML”, might give me many irrelevant hits on Unified Modeling Language, I very, very deliberately spelled the phrase out—and found hits dominated by … Unified Modeling Language! Apparently, the search-engine had reasoned that “Hmm, ‘User-Mode Linux’ is the same as ‘UML’, ‘UML’ is the same as ‘Unified Modeling Language’; ergo, I should show hits for ‘Unified Modeling Language’!”, despite the two having nothing to do with each other.

A particular problem with too much fuzziness is that countering it with a literal search will be an all-or-nothing deal (for the search-term in question): For instance, if an over-eager search-engine includes results for “address Doctor John Smith” for the search “address Professor John Smith”, this can be corrected by searching for “address “Professor John Smith””. However, this will also exclude references to “Pr. John Smith” and “Prof. John Smith”. Here, the attempt to combat too much fuzziness requires throwing the baby out with the bath water.

Remark on quotes:
Note that there are three types of quotes used in this text, regular double (“/”), regular single (‘/’), and the-ones-on-the-keyboard (“/”).


Written by michaeleriksson

September 29, 2018 at 12:03 pm

4 Responses

Subscribe to comments with RSS.

  1. Just had a rude awakening with duckduckgo. Searched for ‘purdue dean of students office’. Literally no hits. Tried with google: first hit, with multiple subsections.

    Bruce Peary Solomon

    September 29, 2018 at 5:33 pm

  2. […] because there the fancy quotes are simply not equivalent. Indeed, this was specifically in a text ([1]) where I needed to use three types of quotation marks to discuss search syntax in a reasonable […]

  3. […] Excursion on missing the point: I have written on the strongly overlapping issue of “missing the point” repeatedly in the past, e.g. in [2], [3], [4]. […]

  4. […] on searching: I have repeatedly written about poor search result (most notably in [2] and [3]). Trying to find some of my own old texts by using the W-rdpr-ss search, for mention above, has […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: