0

I have five columns, which determine an article's ID and the categories with which the article is associated. An example of the data as below:

article_id   category_id   subcategory_id   2nd_category_id   2nd_subcategory_id

94           C02           M1001        
96           C06
98           C06
101          C03           M1001        
108          C01           M1001        
110          C01           M1001        
111          C03           M1003            C02               M1001
114          C01                            C02
115          C01           M1001            C01               M1002

From the above presentation, it appears that an article can be assigned to four categories. In reality, it is assigned to one or two categories, each with an optional subcategory. (There are six parent categories. Each category can have up to four subcategories. There are approximately 11,000 entries (i.e., rows/articles) in the file.) Unfortunately, the subcategory code names are not globally unique. For example, category C01 is "Trees" and category C02 is Fruits. But C01 subcategory M1001 is Evergreens, while C02 subcategory M1001 is Apples. Note that an article can be assigned to the same category twice if at least one of the assignments is coupled with a subcategory – in the example above, article 115 is assigned to C01 twice.

What I need to do is create a formula that will aggregate these into a single, comma-separated field value that lists the identified categories and subcategories.

How could I achieve this? I guess there are three parts to this:

  1. Find+Replace on category_id column and 2nd_category_id column to replace the parent category value with the appropriate string name. Thus, C01 becomes Trees.

  2. Use a formula, of some sort, to replace the subcategory_id values with their names, dependent on the string value in category_id. Repeat for 2nd_subcategory_id. If no subcategory_id exists, then leave the value blank.

  3. Use another formula to copy the values into the new column, eliminating duplicate entries where possible (for example, an article may be assigned to C01 (parent category inherited from subcategory) and M1001 (subcategory) and C01 (2nd category). In this case, the formula should provide the value in the new column as "Trees, Evergreen, Trees". There is no need for duplicate entries of "Trees", so only "Trees, Evergreen" need to exist in the new column value.

Perhaps I am over-complicating things and there is a very easy way to achieve this. Perhaps not. Any pointers?

An example of what I would be trying to create is as below:

article   category   subcategory   category2   subcat2   categories
94        C02        M1001                               Fruits, Apples
96        C06                                            Seeds
98        C06                                            Seeds
101       C03        M1001                               Plants, Shrubs
108       C01        M1001                               Trees, Evergreens
110       C01        M1001                               Trees, Evergreens
111       C03        M1003         C02         M1001     Plants, Climbers, Fruits, Apples
112       C06                                            Seeds
113       C01                                            Trees
114       C01                      C02                   Trees, Fruits
115       C01        M1001         C01         M1002     Trees, Evergreens, Deciduous
4
  • "I am over-complicating things", at least your question is very long... Please add a column to your sample data where you show desired output, that would help a lot. Commented Dec 1, 2015 at 17:03
  • Welcome to Super User. We are not a script writing service. We expect users to tell us what they have tried so far (including any scripts they are using) and where they're stuck so that we can help with specific problems. Questions that only ask for scripts are too broad and are likely to be put on hold or closed. Please read How do I ask a good question?.
    – DavidPostill
    Commented Dec 1, 2015 at 17:17
  • Mate, I have added a second example which would reflect the expected result.
    – Grant G
    Commented Dec 1, 2015 at 17:22
  • David, what I have tried so far would not work. I completed my step 1 (find and replace), step 2) and 3) is where I am stuck with as I am unsure of what formulas are available to do this - a 'string compare' as it were.
    – Grant G
    Commented Dec 1, 2015 at 17:23

1 Answer 1

0

I’ll give you some pieces of the answer:

  1. Create two lookup tables somewhere in your Excel workbook (possibly on a different sheet):

    C01   Trees
    C02   Fruits
    C03   Plants
     ⋮     ︙ 
    C06   Seeds
    

    and

    C01_M1001   Evergreens
    C01_M1002   Deciduous
        ⋮         ︙ 
    C02_M1001   Apples
        ⋮         ︙ 
    C03_M1001   Shrubs
    C03_M1003   Climbers
        ⋮         ︙ 
    
  2. Set cells W2-Z2 to something like the following:

    • W2=B2
    • X2=IF(C2="", "", B2 & "_" & C2)
    • Y2=IF(D2=B2, "", D2)
    • Z2=IF(E2="", "", D2 & "_" & E2)
  3. Now change the above to translate them into category/subcategory names using the lookup tables.  I won’t explain the details of this because they are exhaustively covered in both Excel documentation and Super User answers.

  4. See Generate a comma-separated list of cell contents, excluding blanks for ways to produce your categories list.

0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .