Blog Summary (2)

Or you may also wish to know each date on which the article is published. Looking into the html file again, you will see the line like this:

<h2 class='date-header'><span>12/31/2013</span></h2>

So your code, myprog.awk, might be:

BEGIN {FS = "[<>]"}
/<h2 class='date-header'>/ {date=$5}
/post-title entry-title/ {getline;print date,$3}

And you will obtain:

12/31/2013 Taxi Driver
12/22/2013 The Accused
12/15/2013 The Silence of the Lambs
12/08/2013 Nell
12/01/2013 Contact

If you wish to make a list spanning many months, you first make a file, say, getlist.txt;

http://yourFBblog.com/2014-01-archive.html
...
http://yourFBblog.com/2011-04-archive.html

And get all the files, and process them.

% wget -i getlist.txt
% gawk -f myprog.awk 201*-archive.html | cat -n

The result will be something like this:

  1  29/01/2014 My Latest Post
...
402  01/04/2011 My First Post

Actually, my current version of myprog.awk is:

BEGIN {FS = "[<>]"}
/<h2 class='date-header'>/ {date=$5; if(length(date)==9) date="0" date; month=substr(date,1,2); day=substr(date,4,2); year=substr(date,7,4) }
/post-title entry-title/ {getline;printf("%2s-%2s-%2s %s\n",year,month,day,$3)}

This minor modification was required because the months from Jan. to Sept. are represented as a one digit (1 to 9), while from Oct. to Dec. as two digits(10 to 12). Therefore, the output is something like:

2013-12-31 Auld Lang Syne
2014-01-01 DX Pedition to Mars

Blog Summary

Sometimes you may wish to get all the titles of your previous articles in your or someone else’s blog. There should be numeours ways to attain the goal, and here is one aproach.

Suppose all the articles in December 2013 are in, say,
http://yourFBblog.com/2013-12-archive.html.

You will first get the file by

% wget http://yourFBblog.com/2012-12-archive.html

And looking into the file, you will see something like this

<h3 class='post-title entry-title' itemprop='name'>
<a href='http://yourFBblog.com/2013/12/taxi-driver.html'>Taxi Driver</a>
</h3>

This suggests you that you need a line just after the one containg the phrase “post-title entry-title”, and the title you wish to obtain is the second element of the line. Therfore, what you need is:

% cat 2013-12-archive.html | gawk ' BEGIN {FS = "[<>]"} /post-title entry-title/ {getline;print $3}'

And you will get something like this:

Taxi Driver
The Accused
The Silence of the Lambs
Nell
Contact