How it works?
Process of generating the taxonomy from a raw text is presented by the following steps:
-
Reduce the inflected or derived words to their root form. After this step the input raw text will be called a normalized text. Example:
Nike Women’s shirt red . - -> Nike Women shirt red.
-
Mark every word in the raw string with the corresponding part of speech:
Nike- generic name, Women’s - adjective, shirt – substantive, red – adjective
- Match the marked words from normalized text with the ones from the user predefined taxonomy.
Predefined taxonomy
Predefined taxonomy is just a collection of excel columns. Each column have to have a header which describes properties of values inside its rows. Syntax is as follow:
*<column_name>:<part_of_speech>
* - asterisk at the beginning of the column definition informs algorithm to propose some value from input in case when none of the rows from that column can be matched.
<column_name> - name of a column.
<part_of_speech> - part of speech tag which filters words from normalized text according to its part of speech type. Only filtered words will be used in matching process.
Possible values for <part_of_speech> are:
- subst - substantives
- adj - adjectives
- verb - verbs
- all - all words – use for generic names which are not list in dictionary, for example a name of company etc.
- rgx - regular expression in .Net format. Values will be matched according to the provided patterns. If a regex group will be specified inside pattern, with same name as the <column_name> then only matched group will be returned.
- rgxbin – same as rgx with this difference that in case when the pattern is matched, it returns “YES” and if not then “NO” is returned.