1

We receive product data from vendors on a regular basis to be incorporated into our catalog. The data looks like this:

[
 {
  id: 123,
  collection: Spring,
  name: New Beginnings,
  size: 8,
  price: 29.99,
  ...
 },
 ...
 {
  id: 456,
  collection: SUMER,
  name: The Escape,
  size: 6,
  price: 49.99,
  ...
 }
]

There will often be typos, so before writing to the database, we want to make the correction from "SUMER" to "Summer".

The problem with simply making the change on our end in the database is that next time we receive the list of products, the typo will be there again and it will overwrite the corrected entry.

What is an elegant and efficient way to handle this kind of problem on our end? Getting the vendors to correct the typos on their end is not feasible. Flagging the corrected row in the database to avoid writing to it is also not feasible because we may receive new/updated fields in a row (e.g. price) and we have to have the latest changes.

1
  • What about flagging the corrected field, instead of flagging the whole row? It may even be appropriate not just to flag the specific field to prevent updates, but also store the value that was corrected (so that a fundamentally changed value is not automatically modified, but the same old spelling mistake is automatically corrected).
    – Steve
    Commented Jun 5, 2020 at 21:25

2 Answers 2

3

The solution to this is quite simple: once you identified the typos manually, automate the correction in some program or script and make it part of the "Transform" step in your ETL pipeline.

Think of what you would do manually: lets say you have learned the collection field can contain names of seasons like Summer, Winter, Autumn, Spring, and you have seen that Summer is sometimes written as "SUMER" - then you would check this "collection" field for exactly this typo and correct it. Or, you have learned that this typo only occurs in specific records, which can be identified by their specific ID, which the vendor does not change. Put exactly this logic into a simple replacing tool, written in your favorite scripting language, then you can repeat the same typo correction automatically in each ETL cycle.

Of course, for certain kind of typos, you may have to identify additional constraints for not replacing correctly data by wrong data, but that is in no way different from a manual typo correction.

1

The answer given by Doc Brown shows some good ways to deal with this in your ETL process, but I would question the assumption that getting the vendors to correct their typos is not feasible.

What about typos in names? If the vendor uses "The Baginnings" for a women's handbag, how would you know whether that's a typo or a wordplay?

What about prices or item sizes? Some kinds of mistakes might be identified by consistency checks (for example, same size given for two items, just different prices.)

You need a process for notifying the vendor about inconsistent data and ask for correction. You may not get an answer, but you will have documentation that you tried to get a resolution with them which makes your position much more robust in case you "fix" data but the vendor later claims that you did it wrong.

1
  • In general; I think it is a good idea not to throw the possibility over board to notify a vendor about typos, and at least give it a try. But there are definitely cases where the OP could be perfectly right, where it is a hopeless waste of time. There are vendors who don't have a process for getting such corrections from their recipients, or maybe they have one but it takes years. The problem can get really hard when there are several vendors involved, as the OP wrote.
    – Doc Brown
    Commented Jun 6, 2020 at 7:36

Not the answer you're looking for? Browse other questions tagged or ask your own question.