2

I have several thousand (well formed) XML files of the following template :

<?xml version="1.0" ?>
<queries>
  <statement name="foobar">
    <body><![CDATA[
      Several lines
      worth of
      text goes
      in here 
    ]]></body>
  </statement>
  <statement name="whatever">
    [... snip ...]
  </statement>
</queries>

I need to get a list of those statements for which the text content of the body span over 10 lines. Short of writing a python script to do that, is there a simple way to use grep or other standard tools to look into each file and return the statements that span many lines? At the very least, I'd be happy with something that would return a list of filenames for which there is one such statement.

2 Answers 2

2

Short of using a real xml library and/or awk/perl/python/ruby this is quite close to what you want (if I understood you right) just using common bash commands.

Please note that this is really specific to xml files used and should not be encouraged as a general purpose xml parser/splitter.

You'll need output directory for the splitted files. I used /tmp/out for this example:

mkdir -p /tmp/out 

You'll have to clean /tmp/out before each run. Otherwise you'll get result's that don't make sense.

cat /path_to_xml_files/*.xml | \
egrep -v '<?xml version="1.0" \?>|<queries>|</queries>' | \
csplit -q -z - '/statement name/' '{*}' --prefix=/tmp/out/splitout- && \
for x in /tmp/out/splitout-* ; do \
[[ $(wc -l "$x"|cut -d" " -f 1) -gt 10 ]] && \
echo "$x" && \
cat "$x" ; \
done
  1. cat the xml files
  2. Use egrep to remove unwanted lines
  3. split input to multiple files based on your example 'statement name'
  4. loop results
  5. count lines for each file and require it to be more than 10
  6. print output filename
  7. print output lines

As I said, this is not meant to be a general xml splitter, but should be treated as an example of different shell commands.

Note: '\' -sign followed by line break means that the line continues without line breaks. This just makes it easier to read.

1

I can only do Ruby, with the nokogiri Gem installed. I don't think using grep would be that straightforward here, but maybe somebody has a better solution. The syntax is:

ruby scriptname.rb <directory> <number-of-lines>

So, for example:

ruby find.rb . 10

This will list all .xml documents that

  • contain statements
  • with a CDATA text
  • that's in body
  • which has more than <number-of-lines> lines of text (>, not ≥)

There's no exception handling though.


require 'nokogiri'
dir, lines = ARGV
@result = []
Dir.glob("#{dir}/*.xml") do |entry|
  Nokogiri::XML(File.open(entry)).xpath("//statement/body").each { |b| (@result << entry and break) if b.text.lines.count > (2+lines.to_i) }
end
puts @result

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .