9

I have a text with marker lines like:

aaa
---
bbb
---
ccc

I need to get a text from the last marker (not inclusive) to EOF. In this case it will be

ccc

Is there an elegant way within POSIX.2? Right now I use two runs: first with nl and grep for the last occurrence with respective line number. Then I extract the line number and use sed to extract the chunk in question.

The text segments may be quite large, so I'm afraid to use some text-adding method like we add the text to a buffer, if we encounter the marker we empty the buffer, so that at EOF we have our last chunk in the buffer.

5 Answers 5

6

Unless your segments are really huge (as in: you really can't spare that much RAM, presumably because this is a tiny embedded system controlling a large filesystem), a single pass is really the better approach. Not just because it'll be faster, but most importantly because it allows the source to be a stream, from which any data read and not saved is lost. This is really a job for awk, though sed can do it too.

sed -n -e 's/^---$//' -e 't a' \
       -e 'H' -e '$g' -e '$s/^\n//' -e '$p' -e 'b' \
       -e ':a' -e 'h'              # you are not expected to understand this
awk '{if (/^---$/) {chunk=""}      # separator ==> start new chunk
      else {chunk=chunk $0 RS}}    # append line to chunk
     END {printf "%s", chunk}'     # print last chunk (without adding a newline)

If you must use a two-pass approach, determine the line offset of the last separator and print from that. Or determine the byte offset and print from that.

</input/file tail -n +$((1 + $(</input/file         # print the last N lines, where N=…
                               grep -n -e '---' |   # list separator line numbers
                               tail -n 1 |          # take the last one
                               cut -d ':' -f 1) ))  # retain only line number
</input/file tail -n +$(</input/file awk '/^---$/ {n=NR+1} END {print n}')
</input/file tail -c +$(</input/file LC_CTYPE=C awk '
    {pos+=length($0 RS)}        # pos contains the current byte offset in the file
    /^---$/ {last=pos}          # last contains the byte offset after the last separator
    END {print last+1}          # print characters from last (+1 because tail counts from 1)
')

Addendum: If you have more than POSIX, here's a simple one-pass version that relies on a common extension to awk that allows the record separator RS to be a regular expression (POSIX only allows a single character). It's not completely correct: if the file ends with a record separator, it prints the chunk before the last record separator instead of an empty record. The second version using RT avoids that defect, but RT is specific to GNU awk.

awk -vRS='(^|\n)---+($|\n)' 'END{printf $0}'
gawk -vRS='(^|\n)---+($|\n)' 'END{if (RT == "") printf $0}'
6
  • @Gilles: sed is working fine, but I can't get the awk example to run; it hangs... and I get an error in 3rd example: cut -f ':' -t 1 ... cut: invalid option -- 't'
    – Peter.O
    Commented Mar 31, 2011 at 19:33
  • @fred.bear: I have no idea how that happened — I tested all my snippets, but somehow messed up the post-copy-paste edit on the cut example. I see nothing wrong with the awk example, what version of awk are you using, and what is your test input? Commented Mar 31, 2011 at 19:41
  • ... actually the awk version is working.. it is jut taking a very long time on a large file.. the sed version processed the same file in 0.470s .. My test data is very weighted... only two chunks with a lone '---' three lines from the end of 1 million lines...
    – Peter.O
    Commented Mar 31, 2011 at 19:51
  • @Gilles.. (I think I should stop testing at 3AM. I somehow tested all three of the "two pass" awks as a single unit :( ... I've now tested each individually and the second one is very fast at 0.204 seconds ... Howerver, the first "two-pass" awk outputs only: "(standard input)" (the -l seems to be the culprit) ... as for the third "two-pass" awk, I doesn't output anything... but the second "two-pass" is the fastest of all presented methods (POSIX or otherwise :)...
    – Peter.O
    Commented Apr 1, 2011 at 9:42
  • @fred.bear: Fixed, and fixed. My QA is not very good for these short snippets — I typically copy-paste from a command line, format, then notice a bug, and try to fix inline rather than reformat. I'm curious to see if counting characters is more efficient than counting lines (2nd vs 3rd two-pass methods). Commented Apr 1, 2011 at 11:34
4
lnum=$(($(sed -n '/^---$/=' file | sed '$!d') +1)); sed -n "${lnum},$ p" file 

The first sed ouputs line numbers of the "---" lines...
The second sed extracts the last number from the first sed's output...
Add 1 to that number to get the start of your "ccc" block...
The third 'sed' outputs from the start of the "ccc" block to EOF

Update (with ammended info re Gilles methods)

Well I was wondereing about how glenn jackman's tac would perform, so I time-tested the three answers (at the time of writing)... The test file(s) each contained 1 million lines (of their own line numbers).
All answers did what was expected...

Here are the times..


Gilles sed (single pass)

# real    0m0.470s
# user    0m0.448s
# sys     0m0.020s

Gilles awk (single pass)

# very slow, but my data had a very large data block which awk needed to cache.

Gilles 'two-pass' (first method)

# real    0m0.048s
# user    0m0.052s
# sys     0m0.008s

Gilles 'two-pass' (second method) ... very fast

# real    0m0.204s
# user    0m0.196s
# sys     0m0.008s

Gilles 'two-pass' (third method)

# real    0m0.774s
# user    0m0.688s
# sys     0m0.012s

Gilles 'gawk' (RT method) ... very fast, but is not POSIX.

# real    0m0.221s
# user    0m0.200s
# sys     0m0.020s

glenn jackman ... very fast, but is not POSIX.

# real    0m0.022s
# user    0m0.000s
# sys     0m0.036s

fred.bear

# real    0m0.464s
# user    0m0.432s
# sys     0m0.052s

Mackie Messer

# real    0m0.856s
# user    0m0.832s
# sys     0m0.028s
3
  • Out of curiosity, which of my two-pass versions did you test, and what version of awk did you use? Commented Mar 31, 2011 at 23:02
  • @Gilles: I used GNU Awk 3.1.6 (in Ubuntu 10.04 with 4 GB RAM). All tests have 1 million lines in the first "chunk",then a "marker" followed by 2 "data" lines...It took 15.540 seconds to process a smaller file of 100,000 lines, but for the 1,000,000 lines, I'm running it now, and it has been more than 25 minutes so far. It is using one core to 100% ... killing it now... Here are some more incremental tests: lines=100000 (0m16.026s) -- lines=200000 (2m29.990s) -- lines=300000 (5m23.393s) -- lines=400000 (11m9.938s)
    – Peter.O
    Commented Apr 1, 2011 at 5:05
  • Oops.. In my above comment, I missed your "two-pass" awk reference. The above detail is for the "single-pass" awk... The awk version is correct... I've made further comment re the different "two-pass" versions under your answer (an modified the time results above)
    – Peter.O
    Commented Apr 1, 2011 at 9:45
3

A two pass strategy seems to be the right thing. Instead of sed I would use awk(1). The two passes could look like this:

$ LINE=`awk '/^---$/{n=NR}END{print n}' file`

to get the line number. And then echo all text starting from that line number with:

$ awk "NR>$LINE" file

This should not require excessive buffering.

4
  • and then they can be combined: awk -v line=$(awk '/^---$/{n=NR}END{print n}' file) 'NR>line' file Commented Mar 31, 2011 at 20:59
  • Seeing that I've been time testing the other submissions, I've now also tested "glen jackman's" above snippet. It takes 0.352 seconds (with the same data file mentioned in my answer)... I'm starting to get the message that awk can be faster than I originally thought possible (I thought sed was about as good as it got, but it seems to be a case of "horses for courses")...
    – Peter.O
    Commented Apr 1, 2011 at 9:58
  • Very interesting to see all these scripts benchmarked. Nice work Fred. Commented Apr 1, 2011 at 10:39
  • The fastest solutions use tac and tail which actually read the input file backwards. Now, if only awk could read the input file backwards... Commented Apr 1, 2011 at 10:46
3

Use "tac" which outputs a file's lines from end to beginning:

tac afile | awk '/---/ {exit} {print}' | tac
1
  • tac isn't POSIX, it's Linux-specific (it's in GNU coreutils, and in some busybox installations). Commented Mar 31, 2011 at 18:54
1

You could just use ed

ed -s infile <<\IN
.t.
1,?===?d
$d
,p
q
IN

How it works: t duplicates the current (.) line - which is always the last line when ed starts (just in case the delimiter is present on the last line), 1,?===?d deletes all lines up to and including the previous match (ed is still on the last line) then $d deletes the (duplicate) last line, ,p prints the text buffer (replace with w to edit the file in place) and finally q quits ed.


If you know there's at least one delimiter in the input (and don't care if it's also printed) then

sed 'H;/===/h;$!d;x' infile

would be the shorter.
How it works: it appends all lines to the Hold buffer, it overwrites the hold buffer when encountering a match, it deletes all lines except the la$t one when it exchanges buffers (and autoprints).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .