Parse multiple HTML to text and rename as parent Directory

Question

In a directory there are many folders .html inside, I would like all HTML parsed as new .txt with the name of the parent directory.

Example1/ Index.html>Example1.txt

Example2/ Index.html>Example2.txt

Something like this works with individual files using sed, cat file | sed 's/<b>.*</b>//g' — z4nb0t, Commented May 9, 2013 at 7:19
@z4nb0t it's generally accepted that using regex to parse HTML will lead to elder gods from before the start of time awakening from their eternal slumber to consume your computer and all of humanity. — evilsoup, Commented May 9, 2013 at 10:50

mpy · Accepted Answer · 2013-05-09 12:07:57Z

Obviously, you want to convert some HTML pages into plain text. Therefore I wouldn't strip the tags with a custom built solution (e.g. with some sed magic), but use a tool designed for that purpose like html2text; from its webpage:

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

To address your question of batch renaming:

find . -maxdepth 1 -type d -print0 | while IFS= read -r -d '' dirname
 do python path/to/html2text/html2text.py "${dirname}/index.html" > "${dirname}/${dirname}.txt"
done

Here the find command lists all directories only located in the current directory (i.e. not recursive) and the read command (in the while condition) assigns the values to the variable $dirname. Finally, the command(s) between do and done get(s) executed, here it converts the files according to your request. As pointed out by @slhck you need to use such a complex command, so that dirnames with whitespaces won't break anything.

[Edit]: Another variant to convert all HTML files under the current directory:

find . -iname "*.html" -print0 | while IFS= read -r -d '' filename
  do python path/to/html2text/html2text.py "${filename}" > "${filename%.*}.txt"
done

iname searches case insensitive for *.html.

${filename%.*}.txt strips the extention and appends .txt, i.e. if filename is some/path/index.html, ${filename%.*} is some/path/index and finally ${filename%.*}.txt is some/path/index.txt.

When you use the Z shell, you can use a much cleaner for loop, without braking at white spaces:

for i (*(/)) python path/to/html2text/html2text.py "${i}/index.html" > "${i}/${i}.txt"

The trick here is that *(/) does filename generation, but only returns directories (/).

[Edit]: Also in zsh syntax the variant to convert all HTML files under the current directory (you need the option EXTENDEDGLOB to be set):

for i ((#i)**/*.html) {
   python path/to/html2text/html2text.py "$i" > "${i:r}.txt"
}

(#i) uses case insensitive globbing, ** searches recursively, hence returning all HTML files under the current working directory. (If symbolic links should be followed, use three stars *** instead of two).

If you have more than one command inside the for loop, use curly { ... }brackets (unnecessary here, but they won't hurt).

${i:r} strips the extention (r for remove) from the variable $i.

@z4nb0t: Sorry, I forgot the redirection > in my first version. Now it should work as stated. (The error was, that html2text expects as an optional second parameter the encoding of the HTML page, but got the name of the txt file). — mpy, Commented May 9, 2013 at 7:56
i just noticed some folders have multiple html files, is this going to be a problem? — z4nb0t, Commented May 9, 2013 at 7:57
@z4nb0t: It's no problem, the command given does exactly what you asked for and takes only index.html in every folder. It's easy to use a second for loop to loop also over all html files in the dirs. But you need to specify how the renaming should be done then. — mpy, Commented May 9, 2013 at 7:59
This breaks when file names or paths contain whitespace. You should use the exec option of find or pipe its output into a while loop (but only with the -print0 option. See: mywiki.wooledge.org/ParsingLs — slhck, Commented May 9, 2013 at 8:25
@z4nb0t: zsh is very powerful, but of course you need some training time. If you are interested, I really recommend zsh.sourceforge.net/Guide/zshguide.pdf by Peter Stephenson, the current maintainer of zsh. Don't get frightened by its size, IMHO it is easy to read (at least the first chapters ;) — mpy, Commented May 9, 2013 at 10:03

Stack Exchange Network

Parse multiple HTML to text and rename as parent Directory

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
shell
bash-scripting
.

Linked

Hot Network Questions

Parse multiple HTML to text and rename as parent Directory

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashshellbash-scripting.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
shell
bash-scripting
.