If you want to cleanup assignee names that exist in a specific report, then Auto Cleanup feature allows you to merge/group together (normalize) similar or synonymous Assignee/Inventor/ Current Assignee/ Attorney names which appear in your report.
You can use the different built-in cleanup tools to work with the data in the project and save that as thesaurus.
You can select the respective field (Assignee/Inventor/ Current Assignee/ Attorney) from the cleanup drop down menu and click on View. Click on Start Over to view the fresh list of assignee names in the project. You will see all the assignee names from this project in the right window and on the left different cleanup tools
There are multiple ways to cleanup above fields:
- Cleanup using smart fuzzy match
- Cleanup using regular expression match
- Cleanup using thesaurus match
- Cleanup using merging different names
The cleanup procedure in either case results in groups that have a Group Head. These groups are shown in the right table with all names that have been grouped appearing as child nodes to the Group Head.
Cleanup using Smart Fuzzy Matching
Fuzzy matching involves grouping together similar names (i.e., names that share a high percentage of overlap). You need to specify the percentage of overlap (default range set is 65% to 85%). However, you can edit the range and then click on Create Groups to run the fuzzy match algorithm. You will see the different groups created. When you click on show groups, you will see only the groups that have been created.
Important point to note here is the lower the percentage, the higher the chances of false names getting merged in a single group. However, if you keep a very high percentage then names that should have been merged do not get merged. So a balance is needed here and usually 80-85% is recommended for cleaning up assignee names.
Manually improving and correcting created groups
Once you have created the first set of groups, you have the option of improving them manually and then saving them to a thesaurus file (for reuse later) before finally saving them to the report.
As shown in the figure above, the right table shows various groups created. Since the fuzzy matching algorithm can results in incorrect groups beyond a certain point, you have different options to quickly review and touch-up the groupings that have been created. These options are provided by as drag-n-drop and right-click options within the table. The following actions are possible:
- Creating a new group: You can drag any member name other than Groups heads and drop it on another member to make the former a child node of the latter. For instance in the figure above you can drag “ABB Technology AB” to “ABB Ltd” and the former will get grouped under ABB Ltd.
- Adding members to existing groups: Simply drag the member to an existing group and that member gets added as a child node to the group. Group heads cannot be dragged.
- Renaming a member: Any member whether an independent name or a Group Head can be renamed. For this right-click on the member and select Rename. Once you rename the member, the original name becomes a child node of the new name which implies that patents belonging to the original name will now be grouped under the new name.
- Remove a member from a Group: In case any member has been incorrectly placed in a group, right-click on that member and select “Remove from Group” to set the member as an independent name. You can also drag the child member to another group to add it to move it from one group to another.
- Replace the Group Head with a Group member: Since the Group Head is chosen alphabetically, in some cases you may want to replace it with another name present as a member. For this right-click on the member and select “Set Head Node”.
- Breaking a Group: If a group is incorrectly formed then you can right-click on the Group Head and select “Break Group” to remove the groups and set the Group Head and all its members as independent names.
Once you have prepared the final set of groups, you can save them to a thesaurus using the Save to Thesaurus button. The thesaurus can be applied to the project.
Cleanup using regular expressions
Regular Expressions is a syntax for pattern matching. (See Wikipedia article on Regular expressions: http://en.wikipedia.org/wiki/Regular_expression). If you understand the syntax of regular expressions, then you can use this particular tool to match patterns and then take actions on those names that match the pattern. For e.g., you can use regular expressions to match portions of the uncleaned labels (Assignee/Inventor/Current Assignee/Attorney) and then either decide to Cut, Replace or Insert the portion. For example while cleaning up Assignee names you may want to cut occurrences of Co, Ltd, Corp, Inc, BV, SA, Pte, Pty or even pin codes occurring at the end of Assignee names. For this you can match these strings using suitable regular expression pattern and then select the Cut option.
When using Regex, there are 3 options
Replace: Enter a Regular expression pattern and then enter a word that will be replaced in place of the pattern. The software will then match each label for the pattern and upon occurrence, it will replace it with the word you have provided. This is then shown on the list on the right with the original label linked under the new label as the group head. If you leave the word with which to replace as blank then the original pattern will be cut from the labels.
For e.g., if you want to replace Technology appearing at the end of any assignee name with Tech, you can enter TECHNOLOGY$ in ‘what’ text box and TECH in ‘with’ text box
Cut: Enter a Regular Expression pattern and all matching patterns within the labels will be cut. If you want a pattern to be cut from the end of the label then enter a $ after it and if you want it to be cut only from the beginning of the label then enter a ^ before it. For example: Corp$ – All occurrences of „Corp‟ at the end of labels will be removed ^The – All occurrences of „The‟ at the beginning of labels will be removed
Insert: Enter a Word you want to insert into the labels and then enter a pattern to be matched and the location (Before or After) you want to insert the word in when a match occurs. At position refers to the exact position within the pattern where the word will be inserted.
Cleanup using thesaurus match
If you have a prior thesaurus, then you can use that to cleanup the data. While applying the thesaurus you can even specify the match percentage to cover for variations in the new data and the names in the Thesaurus. It is advisable to however keep the matching percentage at 85-90% or higher. Select stemming to cover for variations of words in your thesaurus. For this, select Cleanup using Thesaurus match, provide the thesaurus file and Click on Create Groups. Click on Save Groups to save the groups to the portfolio.
Cleanup using merging different names
It is a simple tool where you can define the exact name under which you want to merge all the matching names.
- Contains: Enter a text which needs to be merged and give a group name under which it needs to merged
- Starts With: For e.g., if you wanted to merge all names which start with University or Univ and merge that into Univ
- Ends With
- OR (separated by OR)
Apply Thesaurus to all records in project
You can apply created thesaurus to custom fields. This is useful if you already have predefined set of rules and would like to imply those in your analysis.
You can apply a thesaurus to a specific record or all records within a project. Clicking on the custom field name (highlighted in above snapshot) will launch an interface which will allow you to select the thesaurus type (Assignee/Inventor/Attorney), name of the thesaurus and field in which you want to apply thesaurus. In case you have selected thesaurus type as Assignee, you have additional options to choose from (Normalized Assignee/ Current Assignee) wherein you want to apply the thesaurus.
When you select Current Record, thesaurus will be applied only to custom field of the selected record.
When you select All Records, there are 2 options:
- You can apply thesaurus to records which do not have any assigned values
- Overwrite predefined custom data of all records