Blog to Ebook Conversion (2)

In many cases, your blog contains images which is stored at the site your blog resides. You need to download all the images from the site to be included in your ebook.

Typically, the html code is something like this:

<img border="0" src="http://yourFBblog.com/-BQdcDYPg-yI/UvHQo26PFuI/AAAAAAAABGY/cUzndRZG1lw/s1600/001.jpg" height="213" width="320" /></a>

Out of this code, you will generate % wget command such as:

wget http://yourFBblog.com/-BQdcDYPg-yI/UvHQo26PFuI/AAAAAAAABGY/cUzndRZG1lw/s1600/001.jpg

This wget command will make the file “001.jpg” in your current directory, which is OK, but one thing you must consider is that the URLs are unique but not the file names. If you have some number of different URLs with the same file name such as “001.jpg”, wget will automatically generate the files “001.jpg”, “001.jpg.1”, “001.jpg.2” and so on. This is sometimes inconvenient, and it may also happen that the original file names are either very long or encoded in a character code which you may wish to avoid to handle.

Therefore, my short awk program is:

{
 gsub(/<img.*src="/, "")
 gsub(/\.JPG.*$/     , ".JPG")
 gsub(/\.jpg.*$/     , ".jpg")
 printf("wget -O myfile%03d.jpg %s\n",count, $0)
 count++
}

This program generate a wget command each time it finds an img tag, and the images are stored into serially numbered files, such as “myfile000.jpg”, “myfile001.jpg”, and so on.

BEGIN{
 nrsave_date =-999
 nrsave_title=-999
 img_count=0
}

 /<h2 class='date-header'>/ {nrsave_date=NR}
 /<h3 class='post-title entry-title' itemprop='name'>/ {nrsave_title=NR}
 NR==nrsave_date +2 {print "<h1>" $0 "</h1>"}
 NR==nrsave_title+2 {print "<h2>" $0 "</h2>"}

 /post-body entry-content/, /post-footer/ {
  gsub(" "  , " ")
  if(/^[[:blank:]]+$/) next
  if(/^<img/) {
   printf("<img src=\"../Images/myfile%03d.jpg\" />\n", img_count)
   img_count++
  }
  if(/^</);
  else
  print $0
 }

This is the new awk program to generate an html file, in which img src is rewritten as a local reference. The html file, along with the img files, can be converted into an ebook by using an authoring tool such as SIGIL.