5

I have a bunch of strings that look like this:

mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00

What is the best way to parse this? You'd figure the people who created it would have put some kind of break in it...

Anyhow, any help would be greatly appreciated.

Edit:

I appreciate everyone's post. I was wondering if I could do something like this:

  1. Create a list of tags. ex. mc_gross=, first_name=, ...
  2. Do a replace in the string: thestring.replace("first_name","\r\nfirst_name") I'm thinking this will give me the breaks I need to parse it further.

What do you think?

8
  • 1
    Wow. What were they thinking?
    – BoltClock
    Commented May 9, 2011 at 4:51
  • 2
    Check with the people that created this, there must be something wrong. Are you sure there is not a CR/LF between each key/value pair? Commented May 9, 2011 at 4:52
  • 3
    so this is a list of name/value pairs but there isn't any kind of separators between the pairs??? Do you have the option of going back to the people that gave you this and ask: 1) if they can put a delimiter and 2) what were they smoking when they created this.
    – DXM
    Commented May 9, 2011 at 4:53
  • i don't see any way you can parse this as there is no deliminator between field name and data. But if you HAD to parse these as is, the only suggestion i could make would be to use a dictionary to find word boundaries and work back from the "=" to figure where the field names start, since all the field names seem to have "_". Commented May 9, 2011 at 4:53
  • 3
    Wow, this would be no simple task to parse. You need to go find out who wrote this and start cracking skulls. Commented May 9, 2011 at 4:53

4 Answers 4

3

Unless this is fixed width (highly doubt it), I would say you are going to need to get a list of the keywords that indicate a field. Put them in a database (SQL, XML, CSV, etc. - doesn't really matter where) and then use them to parse the file. Hopefully this will come in the same order and it won't leave any tags out. If so, do a Substring that finds the value from the end of the equals sign after your tag to the beginning of the next tag in line. That will give you the value that corresponds to the appropriate tag.

So, for example, if we take just the first part mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmed, our tags would be mc_gross, invoice, protection_eligibility, and address_status We would then start with mc_gross=, find it in the string using Substring. For the length to give it, we would go until we found our next tag, invoice. The Substring line would be complicated but it should do the job. Loop through each tag. When you get to the last tag, you would need to find the end of the string instead of another tag.

2
  • this is what gave me the idea to insert breaks into the string, not sure if it works yet though.
    – ErocM
    Commented May 9, 2011 at 12:00
  • This gave me a great start, I'm able to break it into individual line and then split on the '='. Thanks!
    – ErocM
    Commented May 9, 2011 at 16:23
3

As others have stated, unless you can get the original data to include line breaks in the appropriate areas then the next best thing is to get the list of key names.

I assume that the 60K other lines have the same key names as the one sample line you provided? If so, if someone can't provide you the list, then manually (not programmatically) identifying the key names yourself seems to be the only way.

I tried it myself. It did not seem too bad to do (a few minutes at most) but probably still needs someone knowledgeable to confirm that the key list is correct.

Once you have the list, then you can split by the keys and then recombine them into a new list:

string rawData =
    "mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";

string[] keys = {
                    "mc_gross", "invoice", "protection_eligibility", "address_status", "payer_id", "tax",
                    "address_street", "payment_date", "payment_status", "charset", "address_zip",
                    "first_name", "mc_fee", "address_country_code", "address_name", "notify_version",
                    "custom", "payer_status", "business", "address_country", "address_city", "quantity",
                    "verify_sign", "payer_email", "txn_id", "payment_type", "last_name", "address_state",
                    "receiver_email", "payment_fee", "receiver_id", "txn_type", "item_name",
                    "mc_currency", "item_number", "residence_country", "handling_amount",
                    "transaction_subject", "payment_gross", "shipping"
                };

string[] values = rawData.Split(keys, StringSplitOptions.RemoveEmptyEntries);

IEnumerable<string> parsedList = keys.Zip(values, (key, value) => key + value);

foreach (string item in parsedList)
{
    Console.WriteLine(item);
}

This will output the data in this format:

mc_gross=22.99
invoice=ff1ca57d9fa80cf93e6b300dd7f063e1
protection_eligibility=Ineligible
address_status=confirmed
payer_id=SGA8X3TX9HCVY
tax=0.00
address_street=155 5th ave se
payment_date=16:08:28 Nov 15, 2010 PST
payment_status=Completed
charset=windows-1252
address_zip=98045
first_name=jackob
mc_fee=1.08
address_country_code=US
address_name=john martin
notify_version=3.0
custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1
payer_status=unverified
[email protected]
address_country=United States
address_city=north bend
quantity=1
verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYL
[email protected]
txn_id=4DU53818WJ271531M
payment_type=instant
last_name=Martin
address_state=WA
[email protected]
payment_fee=1.08
receiver_id=QG8JPB4RZJGG4
txn_type=web_accept
item_name=Some item of consequenceSpecifie
mc_currency=USD
item_number=G10W151
residence_country=US
handling_amount=0.00
transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1
payment_gross=22.99
shipping=0.00    

You can further parse the list by splitting each item by the equal sign ("=") or replace the original data string with one that now contains the missing line breaks:

string newData = parsedList.Aggregate((data, next) => data + Environment.NewLine + next);
1
  • 1
    Same concept on what I did but yours is more elegant. Thanks for the tip!
    – ErocM
    Commented May 20, 2011 at 2:15
2

Look into using System.Text.RegularExpressions they can be very helpful.

But an easy way to do it would be to use a split function from the string class.

string head = "mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";

string splitStrings[] = new string[2];
splitString[0] = "mc_gross";
splitString[1] = "invoice";
string headArray[] = head.Split(splitStrings, StringSplitOptions.RemoveEmptyEntries);

You get the idea, it breaks everything into parts.

1
  • 3
    But it's no good if there's no well defined pattern to the string. Commented May 9, 2011 at 4:58
-1

Equal signs are a very good indicator. Between the equal signs, then I'd suggest using some lexical tool with some type inferencing engine.

Not the answer you're looking for? Browse other questions tagged or ask your own question.