First-of-its-kind Transparent Multi-lingual Stemming in PatSeer
Share:
Stemming for patent search is helpful but can be a black box at times with the user not sure whether the stem is correctly being applied or not. If a database is stemming words at index time, there can be chances of false positives also appearing in the results.
The most important thing to keep in mind is that you should only apply stemming to words which you are sure about and _never_ apply a blanket “stem all” to your search query. Of course for this to be possible the particular database must allow for specifying stemming on a per search term basis. To understand why – let’s take a very simple example. Most commonly used stemming algorithm in English is the Porters Stemming Algorithm which is usually the stemmer used by Search databases. Now if you search for “Coronary stent insertion” and apply a stem-all then what actually gets searched is – coronari* stent* insert* and that’s because the stem of the word “Coronary” is “coronari”. So you actually miss out on records having the word coronary!
While you may be able to locate that something is wrong simply based on the low result count in the current example but this may not be easily possible in many other cases. Say what if you had combined “coronary stent insertion” with a IPC class corresponding to stents?
There are many such cases in Information Retrieval where stemming can be useful and risky at the same time. Since professional patent searchers need to know exactly what is being searched and what is not, we have adopted a unique transparent stemming approach in PatSeer that allows you to apply stems on a per word basis and immediately know what exactly will be searched.
The stemming character PatSeer uses is # and you can simply add it at the end of the word you want stemmed. This can be done when searching for patent content in six different languages and language specific stemming rules are applied in each case. We call it transparent because just when you add # to the end of the term PatSeer instantly shows you what will be searched and you can decide if you still want to keep stemming applied to the term. So for example when you type scheduling# it shows schedul* and you can be sure if that works for you or not.
Transparent stemming strikes the right balance between the pros and cons of stemming. Try it out and let us know your feedback!