In a directory there are many folders .html inside, I would like all HTML parsed as new .txt with the name of the parent directory.
Example1/ Index.html>Example1.txt
Example2/ Index.html>Example2.txt
In a directory there are many folders .html inside, I would like all HTML parsed as new .txt with the name of the parent directory.
Example1/ Index.html>Example1.txt
Example2/ Index.html>Example2.txt
Obviously, you want to convert some HTML pages into plain text. Therefore I wouldn't strip the tags with a custom built solution (e.g. with some sed magic), but use a tool designed for that purpose like html2text; from its webpage:
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
To address your question of batch renaming:
find . -maxdepth 1 -type d -print0 | while IFS= read -r -d '' dirname
do python path/to/html2text/html2text.py "${dirname}/index.html" > "${dirname}/${dirname}.txt"
done
Here the find command lists all directories only located in the current directory (i.e. not recursive) and the read command (in the while condition) assigns the values to the variable $dirname
. Finally, the command(s) between do
and done
get(s) executed, here it converts the files according to your request. As pointed out by @slhck you need to use such a complex command, so that dirnames with whitespaces won't break anything.
[Edit]: Another variant to convert all HTML files under the current directory:
find . -iname "*.html" -print0 | while IFS= read -r -d '' filename
do python path/to/html2text/html2text.py "${filename}" > "${filename%.*}.txt"
done
iname
searches case insensitive for *.html
.
${filename%.*}.txt
strips the extention and appends .txt
, i.e. if filename
is some/path/index.html
, ${filename%.*}
is some/path/index
and finally ${filename%.*}.txt
is some/path/index.txt
.
When you use the Z shell, you can use a much cleaner for loop, without braking at white spaces:
for i (*(/)) python path/to/html2text/html2text.py "${i}/index.html" > "${i}/${i}.txt"
The trick here is that *(/)
does filename generation, but only returns directories (/)
.
[Edit]: Also in zsh syntax the variant to convert all HTML files under the current directory (you need the option EXTENDEDGLOB
to be set):
for i ((#i)**/*.html) {
python path/to/html2text/html2text.py "$i" > "${i:r}.txt"
}
(#i)
uses case insensitive globbing, **
searches recursively, hence returning all HTML files under the current working directory. (If symbolic links should be followed, use three stars ***
instead of two).
If you have more than one command inside the for loop, use curly { ... }
brackets (unnecessary here, but they won't hurt).
${i:r}
strips the extention (r for remove) from the variable $i
.
>
in my first version. Now it should work as stated. (The error was, that html2text expects as an optional second parameter the encoding of the HTML page, but got the name of the txt file).
-print0
option. See: mywiki.wooledge.org/ParsingLs