Example: Application generates large text log file A with many different messages. It generates similarly large log file B when does not function correctly.

I want to see what messages in file B are essentially new, i.e. to filter-out everything from A.

Trivial prototype is:

  1. Sort | uniq both files
  2. Join files
  3. sort | uniq -c
  4. grep -v "^2"

This produces symmetric difference and inconvenient. How to do it better? (including non-symmetric difference and preserving of messages order in B)

Program should first analyse A and learn which messages are common, then analyse B showing with messages needs attention.

Ideally it should automatically disregard things like timestamps, line numbers or other volatile things.

Example. A:

0:00:00.234  Received buffer 0x324234
0:00:00.237     Processeed buffer 0x324234
0:00:00.238     Send buffer 0x324255
0:00:03.334  Received buffer 0x324255
0:00:03.337     Processeed buffer 0x324255
0:00:03.339     Send buffer 0x324255
0:00:05.171  Received buffer 0x32421A
0:00:05.173     Processeed buffer 0x32421A
0:00:05.178     Send buffer 0x32421A


0:00:00.134  Received buffer 0x324111
0:00:00.137     Processeed buffer 0x324111
0:00:00.138     Send buffer 0x324111
0:00:03.334  Received buffer 0x324222
0:00:03.337     Processeed buffer 0x324222
0:00:03.338     Error processing buffer 0x324222 
0:00:03.339     Send buffer 0x3242222
0:00:05.271  Received buffer 0x3242FA
0:00:05.273     Processeed buffer 0x3242FA
0:00:05.278     Send buffer 0x3242FA
0:00:07.280     Send buffer 0x3242FA failed


0:00:03.338     Error processing buffer 0x324222 
0:00:07.280     Send buffer 0x3242FA failed

One of ways of solving it can be something like that:

  1. Split each line to logical units: 0:00:00.134 Received buffer 0x324111,0:00:00.134,Received,buffer,0x324111,324111,Received buffer, \d:\d\d:\d\d\.\d\d\d, \d+:\d+:\d+.\d+, 0x[0-9A-F]{6}, ... It should find individual words, simple patterns in numbers, common layouts (e.g. "some date than text than number than text than end_of_line"), also handle combinations of above. As it is not easy task, user assistance (adding regexes with explicit "disregard that","make the main factor","don't split to parts","consider as date/number","take care of order/quantity of such messages" rules) should be supported (but not required) for it.
  2. Find recurring units and "categorize" lines, filter out too volatile things like timestamps, addresses or line numbers.
  3. Analyse the second file, find things that has new logical units (one-time or recurring), or anything that will "amaze" the system which has got used to the first file.

Example of doing some bit of this manually:

$ cat A | head -n 1
0:00:00.234  Received buffer 0x324234

$ cat A | egrep -v "Received buffer" | head -n 1
0:00:00.237     Processeed buffer 0x324234

$ cat A | egrep -v "Received buffer|Processeed buffer" | head -n 1
0:00:00.238     Send buffer 0x324255

$ cat A | egrep -v "Received buffer|Processeed buffer|Send buffer" | head -n 1

$ cat B | egrep -v "Received buffer|Processeed buffer|Send buffer"
0:00:03.338     Error processing buffer 0x324222 
0:00:07.280     Send buffer 0x3242FA failed

This is a boring thing (there are a lot of message types); also I can accidentally include some too broad pattern. Also it can't handle complicated things like interrelation between messages.

I know that it is AI-related. May be there are already developed tools?

  • Correct me if i'm wrong but aren't these type of questions supposed to be posted at SO???
    – Avis
    Commented Jul 28, 2010 at 3:52
  • @Avis, I think questions like "what program to use to do ...." is SU. Of course such program will be also useful for users of SO and SF.
    – Vi.
    Commented Jul 28, 2010 at 9:40
  • @Vi: You seem to be heading towards a programming solution rather than trying to find a hypothetical appropriate program. Asking for design tips on SO seems appropriate. Commented Jul 28, 2010 at 20:20
  • @Gilles, My questions often starts with "what program should I use to do this and this" and ends with a some script I (or other user) has written. I don't know whether such things should be migrated to SO. /* Going to ask at Meta */
    – Vi.
    Commented Jul 29, 2010 at 8:07

3 Answers 3


diff (and its various options) will show you differences both ways, and preserve message order. It will not, however, remove duplicates of differences (for that you can apply uniq afterwards) or deal with varying order. Is that good enough ?

  • No. Files are not that similar. There are a lot of things like timestamps, different order and quantity of similar messages.
    – Vi.
    Commented Jul 27, 2010 at 18:43

Use diff (in normal output mode, i.e., no -c or -u). The new lines will be prefixed with >.

diff A B | sed -ne 's/> //p'

If the logs contain time stamps, you'll have to strip them off first.

Sometimes it's nicer to see the new/changed bits in context, with highlighting of the difference and navigation between differing chunks. Emacs has a nice interface for this (Tools | Compare menu, M-x ediff-files). There are also many standalone tools (often with “diff” or “compare” in their name).

Incidentally, if you weren't interested in the order of the lines, then sorting both files followed by comm would be easier and nicer than the process you give in your question.

  • Will diff really work if files have no large common parts, only similar, e.g. contain the same messages types (except of some additional messages in B)? Files are really big, and there are really many various message types; application itself has timers and measure it's performance and depends on server (which also has a timer). P.S. It is gstreamer verbose debug log.
    – Vi.
    Commented Jul 27, 2010 at 18:54
  • @Vi: No, diff needs the common part to be identical except for whitespace. The general problem of comparing such application traces can be very hard. Given the sample output you've now posted, ignoring timestamps looks ok but buffer numbers seem to convey information you wouldn't want to lose. Do A and B contain identical sequences of buffer sends and receives (with differing numbers and success/failure), or does the correspondance break down quickly? Commented Jul 27, 2010 at 19:24
  • I need to know what messages are essentially new in B. /* already done the particular task by hand by constructing very long cat A | egrep -b 'msg1|msg2|msg3|...|msgN' line and issuing it on b */. I think it should be something a bit AI-powered, like auto-categoriser or keyword-extractor things.
    – Vi.
    Commented Jul 28, 2010 at 0:59
  • s/egrep -b/egrep -v/
    – Vi.
    Commented Jul 28, 2010 at 1:07

This is a difficult problem, and in a somewhat general form an active research problem. I don't think there now exists a program into which you'd just have to plug in a few regexps.

I'd formulate your program as trying to compare traces of a networked program. I suspect that people who compare traces of networked or concurrent programs have faced this problem and written their own tools, but I don't have a specific example in mind.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .