1

The general question entails sorting a large Excel 2007 list to find entries that match smaller subset list.

I have a couple of ideas on how to approach the problem, but I lack the technical sophistication to implement those ideas. I will outline my specific use-case requirement to make the question more clear.

Specific Example:

I have a master list of company names that I manage for my sales territory (approximately 1000 customer accounts.) Every week my company publishes a list of all transacted business across every sales territory in the U.S. (mine and hundreds of other territories.) This transaction log is 10,000+ lines so scanning by eye to find transactions associated with my accounts is nearly impossible.

My current inadequate solution is to highlight my account list in yellow, copy that highlighted list, then paste that highlighted list at the bottom of the weekly transaction log, then sort A-Z, then scroll through manually to the highlighted items. If the transaction log contains one of my accounts, the transaction log entry will be directly above or below the highlighted entry that I inserted. This method is effective but extremely time consuming.

I know how to eliminate duplicates in Excel. Is there a way to eliminate everything BUT duplicates? This would make visually scanning the list easier.

Another problem remains because data inconsistency has limited the use of simple macros, filters, or the "find duplicates" button. Transaction log names are often spelled slightly different than on my master list.

Ex: Acme Widget Company, Inc.; Acme Widget Inc; Acme Widget; 
Ex: United States Hand-ball Organization; U.S. Handball Org; U S Handball; USHO

I know there are some third-party apps that can use fuzzy logic to match non-exact entries. However, I cannot run plug-ins on my enterprise machine. (Unless there is a very compelling case...)

Is there a macro that could 'normalize' the transaction log by eliminating spaces and punctuation? Is there a macro that can match the first X number of characters (more characters = higher accuracy, but greater chance of missing a near-duplicate entry...)? Is there a macro that can output or filter the resulting 'match' list?

If those tasks are too complicated, I have a much simpler idea. After merging my highlighted account list into the transaction log, it would be nice to be able to hide all other transaction log lines that are less than 5 lines above or below my highlighted items. This would allow some flexibility for non-standard spellings, but greatly simplify the task of visual inspection through the list.

Any input on how to implement these ideas - or completely different approaches - would be greatly appreciated. I think the general answer to this question will be valuable to others beyond the narrow use-case that I have described.

Thanks!

2
  • 2
    @Chris, this post has too many questions. In the future, you should ask each one separately.
    – hyperslug
    Commented Aug 27, 2009 at 5:17
  • In example 2 (US handball), even removing whitespace + special chars and matching first X chars would produce low confidence matches: "US" is not close to "United". But a decent macro might be able to pull out some matches. Not sure if that would reduce your workload or still leave too much manual process. Again, you should break this up or indicate in which area would a solution save you the most time: remove non-duplicates? fuzzy matching? hide lines outside +- 5?
    – hyperslug
    Commented Aug 27, 2009 at 5:32

4 Answers 4

1

There are definitely too many questions to be answered here (as hyperslug comments). I have an very similar situation and found that for finding dupes I just had to do it manually since there was too much variety to encode.

All the macros you suggest can be written, if you decide which one will be most effective then ask for that as a separate question, and we'll do what we can. The last one is simple to implement and will save you scroll-time. I would create that macro, and then after the dupes are hidden just click and drag the 'standard' entry over the others.

1

I would use Excel's MATCH function to get the data you need, instead of copying and sorting.

Let's say your master list is in a named range called Master and the company name in the transaction log is in column D. Somewhere on the transaction's row, enter the following formula: =IF(ISNA(MATCH(D1,Master,0)),0,1) and copy it to all the rows in the transaction table. This formula will result in 1 if the company name matches and 0 otherwise.

This will only match exact names. What you'll have to do is add alternate names to the Master range (make sure to sort it after adding names) to get all the possible versions.

0

I agree with the approach of adding alternative spellings to your master list (you might have a second column to tell you which one is your preffered format for mailing etc, and which is just to match the company data). You might have some success using successive SUBSTITUTE functions to generate an alternate verion of the names. eg

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(LOWER(A1)," inc",""),".","")," ","")...

So each substitution replaces any instance of the selected text with the replacement - nothing in our case here. From my experience of similar fuzzy matching between names from disparate systems, you may have to drop things like inc, corp, plc etc to get matches. While you can use SUBSTITUTE for this, you could get some odd results with things like "Income Corporation" becoming "omeorporation", so it may be safer to use this sort of thing:

IF(RIGHT(lower(A1),4)="corp",left(lower(A1),len(A1)-4)),lower(A1)).

Do the substitute for spaces last.

You could use MATCH or COUNTIF with similar results to give a column showing which transactions match up to your list.

An alternative would be to use you master list as the criteria to base an advanced filter from, which would enable you to very easily take a copy of the transaction list entries which match your customer names, and place this filtered copy elsewhere (eg off to one side, or on another sheet). Just as with the above, you would still need to add variants where they are too distant from your original name.

0

Just wondered if you'd tried using a Pivot Table. I wrangle a lot of data using PT's and they help me look at problems in multiple ways very quickly and with complete data integrity.

Highlight all you data and select insert Pivot Table. You'll now be able to review your data in lots of interactive ways that will allow you to narrow down any pesky double entries, mispellings, etc. You can then sort using custom sorts, etc as well as A-Z.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .