3

In a directory there are many folders .html inside, I would like all HTML parsed as new .txt with the name of the parent directory.

Example1/ Index.html>Example1.txt

Example2/ Index.html>Example2.txt

4
  • How would you like the tags removed?
    – suspectus
    Commented May 9, 2013 at 7:01
  • Something like this works with individual files using sed, cat file | sed 's/<b>.*</b>//g'
    – z4nb0t
    Commented May 9, 2013 at 7:19
  • 2
    @z4nb0t it's generally accepted that using regex to parse HTML will lead to elder gods from before the start of time awakening from their eternal slumber to consume your computer and all of humanity.
    – evilsoup
    Commented May 9, 2013 at 10:50
  • @evilsoup: Haha, that's a great one.
    – mpy
    Commented May 9, 2013 at 11:09

1 Answer 1

3

Obviously, you want to convert some HTML pages into plain text. Therefore I wouldn't strip the tags with a custom built solution (e.g. with some sed magic), but use a tool designed for that purpose like html2text; from its webpage:

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

To address your question of batch renaming:

find . -maxdepth 1 -type d -print0 | while IFS= read -r -d '' dirname
 do python path/to/html2text/html2text.py "${dirname}/index.html" > "${dirname}/${dirname}.txt"
done

Here the find command lists all directories only located in the current directory (i.e. not recursive) and the read command (in the while condition) assigns the values to the variable $dirname. Finally, the command(s) between do and done get(s) executed, here it converts the files according to your request. As pointed out by @slhck you need to use such a complex command, so that dirnames with whitespaces won't break anything.

[Edit]: Another variant to convert all HTML files under the current directory:

find . -iname "*.html" -print0 | while IFS= read -r -d '' filename
  do python path/to/html2text/html2text.py "${filename}" > "${filename%.*}.txt"
done

iname searches case insensitive for *.html.

${filename%.*}.txt strips the extention and appends .txt, i.e. if filename is some/path/index.html, ${filename%.*} is some/path/index and finally ${filename%.*}.txt is some/path/index.txt.


When you use the Z shell, you can use a much cleaner for loop, without braking at white spaces:

for i (*(/)) python path/to/html2text/html2text.py "${i}/index.html" > "${i}/${i}.txt"

The trick here is that *(/) does filename generation, but only returns directories (/).

[Edit]: Also in zsh syntax the variant to convert all HTML files under the current directory (you need the option EXTENDEDGLOB to be set):

for i ((#i)**/*.html) {
   python path/to/html2text/html2text.py "$i" > "${i:r}.txt"
}

(#i) uses case insensitive globbing, ** searches recursively, hence returning all HTML files under the current working directory. (If symbolic links should be followed, use three stars *** instead of two).

If you have more than one command inside the for loop, use curly { ... }brackets (unnecessary here, but they won't hurt).

${i:r} strips the extention (r for remove) from the variable $i.

17
  • 1
    @z4nb0t: Sorry, I forgot the redirection > in my first version. Now it should work as stated. (The error was, that html2text expects as an optional second parameter the encoding of the HTML page, but got the name of the txt file).
    – mpy
    Commented May 9, 2013 at 7:56
  • i just noticed some folders have multiple html files, is this going to be a problem?
    – z4nb0t
    Commented May 9, 2013 at 7:57
  • @z4nb0t: It's no problem, the command given does exactly what you asked for and takes only index.html in every folder. It's easy to use a second for loop to loop also over all html files in the dirs. But you need to specify how the renaming should be done then.
    – mpy
    Commented May 9, 2013 at 7:59
  • 2
    This breaks when file names or paths contain whitespace. You should use the exec option of find or pipe its output into a while loop (but only with the -print0 option. See: mywiki.wooledge.org/ParsingLs
    – slhck
    Commented May 9, 2013 at 8:25
  • 1
    @z4nb0t: zsh is very powerful, but of course you need some training time. If you are interested, I really recommend zsh.sourceforge.net/Guide/zshguide.pdf by Peter Stephenson, the current maintainer of zsh. Don't get frightened by its size, IMHO it is easy to read (at least the first chapters ;)
    – mpy
    Commented May 9, 2013 at 10:03

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .