I have five columns, which determine an article's ID and the categories with which the article is associated. An example of the data as below:
article_id category_id subcategory_id 2nd_category_id 2nd_subcategory_id
94 C02 M1001
96 C06
98 C06
101 C03 M1001
108 C01 M1001
110 C01 M1001
111 C03 M1003 C02 M1001
114 C01 C02
115 C01 M1001 C01 M1002
From the above presentation, it appears that an article can be assigned to four categories. In reality, it is assigned to one or two categories, each with an optional subcategory. (There are six parent categories. Each category can have up to four subcategories. There are approximately 11,000 entries (i.e., rows/articles) in the file.) Unfortunately, the subcategory code names are not globally unique. For example, category C01 is "Trees" and category C02 is Fruits. But C01 subcategory M1001 is Evergreens, while C02 subcategory M1001 is Apples. Note that an article can be assigned to the same category twice if at least one of the assignments is coupled with a subcategory – in the example above, article 115 is assigned to C01 twice.
What I need to do is create a formula that will aggregate these into a single, comma-separated field value that lists the identified categories and subcategories.
How could I achieve this? I guess there are three parts to this:
Find+Replace on category_id column and 2nd_category_id column to replace the parent category value with the appropriate string name. Thus, C01 becomes Trees.
Use a formula, of some sort, to replace the subcategory_id values with their names, dependent on the string value in category_id. Repeat for 2nd_subcategory_id. If no subcategory_id exists, then leave the value blank.
Use another formula to copy the values into the new column, eliminating duplicate entries where possible (for example, an article may be assigned to C01 (parent category inherited from subcategory) and M1001 (subcategory) and C01 (2nd category). In this case, the formula should provide the value in the new column as "Trees, Evergreen, Trees". There is no need for duplicate entries of "Trees", so only "Trees, Evergreen" need to exist in the new column value.
Perhaps I am over-complicating things and there is a very easy way to achieve this. Perhaps not. Any pointers?
An example of what I would be trying to create is as below:
article category subcategory category2 subcat2 categories
94 C02 M1001 Fruits, Apples
96 C06 Seeds
98 C06 Seeds
101 C03 M1001 Plants, Shrubs
108 C01 M1001 Trees, Evergreens
110 C01 M1001 Trees, Evergreens
111 C03 M1003 C02 M1001 Plants, Climbers, Fruits, Apples
112 C06 Seeds
113 C01 Trees
114 C01 C02 Trees, Fruits
115 C01 M1001 C01 M1002 Trees, Evergreens, Deciduous