Web Scraping with Nokogiri

% ruby scrape_test.rb 
"About - Spinor Lab"
"HomeAboutWebSDRIC-7410 Rig ControlSoftware Defined RadioSoftwareAntennaMorse Keys and PaddlesLoTWeQSLLinksSitemapHome (in Japanese)Type: Squid-faced HumanoidsAffiliated with: Second Great and Bountiful Human EmpireHome planet: Ood Sphere, Horsehead Nebula SKCC: #11158\n"

% nokogiri --version
# Nokogiri (1.8.0)

% ruby --version
ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-darwin16]
require 'nokogiri'
require 'open-uri'
url = 'https://spinorlab.matrix.jp/en/about/'
charset = nil 
html = open(url) do |f| 
  charset = f.charset
  f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
list = doc.search('li')
p doc.title
p list.text

list.each do |t|
  p t.text
end
% ruby scrape_test.rb 
"About - Spinor Lab"
"HomeAboutWebSDRIC-7410 Rig ControlSoftware Defined RadioSoftwareAntennaMorse Keys and PaddlesLoTWeQSLLinksSitemapHome (in Japanese)Type: Squid-faced HumanoidsAffiliated with: Second Great and Bountiful Human EmpireHome planet: Ood Sphere, Horsehead Nebula SKCC: #11158\n"
"Home"
"About"
"WebSDR"
"IC-7410 Rig Control"
"Software Defined Radio"
"Software"
"Antenna"
"Morse Keys and Paddles"
"LoTW"
"eQSL"
"Links"
"Sitemap"
"Home (in Japanese)"
"Type: Squid-faced Humanoids"
"Affiliated with: Second Great and Bountiful Human Empire"
"Home planet: Ood Sphere, Horsehead Nebula"
"SKCC: #11158\n"
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://nuttycellist-unknown.blogspot.jp/'))
doc.css('h2.date-header', 'h3.post-title a').each do |t| 
  puts t.text
end
% ruby scrape_test2.rb 
8/10/2017
8/05/2017
7/31/2017
7/18/2017
7/09/2017
7/06/2017
Seventy two years have passed
Lorin WA1PGB
It's the season of Indian Lilac
In commemoration of my history
Discussing proper sending way of CW
Kidney beans and carrot rolled with pork
My Father's 13th Anniversary

Broken Links (5)

There are no broken links now with the following functions.php.

<?php
function my_theme_enqueue_styles() {
    $parent_style = 'twentyseventeen-style';
    wp_enqueue_style( $parent_style, get_template_directory_uri() . '/style.css' );
    wp_enqueue_style( 'child-style',
        get_stylesheet_directory_uri() . '/style.css',
        array( $parent_style )
    );
}
add_action( 'wp_enqueue_scripts', 'my_theme_enqueue_styles' );

function my_update_styles( $styles ) {
    $mtime = filemtime( get_stylesheet_directory() . '/style.css' );
    $styles->default_version = date("Y-m-d-D-H:i:s", $mtime);
}
add_action( 'wp_default_styles', 'my_update_styles' );

remove_action( 'wp_head', 'wp_resource_hints', 2 );

function my_dns_prefetch() {
    echo '<link rel="dns-prefetch" href="//fonts.googleapis.com/css?family=Asap|Julee" />' . "\n";
}
add_action( 'wp_head', 'my_dns_prefetch' );
?>

The html file is:

45 <link rel='stylesheet' id='twentyseventeen-style-css'  href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
46 <link rel='stylesheet' id='child-style-css'            href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen-child/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
47 <link rel='stylesheet' id='twentyseventeen-fonts-css'  href='https://fonts.googleapis.com/css?family=Libre+Franklin%3A300%2C300i%2C400%2C400i%2C600%2C600i%2C800%2C800i&#038;subset=latin%2Clatin-ext' type='text/css' media='all' />
60 <link rel="dns-prefetch" href="//fonts.googleapis.com/css?family=Asap|Julee" />

Since we have the line 47, the line 60 added by my functions.php is not necessary right now, but I keep it just for the future’s sake.

Broken Links (4)

It seems that I should have used not get_stylesheet_directory() but get_stylesheet_directory_uri().

function theme_enqueue_styles() {
  wp_enqueue_style( 'parent-style', get_template_directory_uri()   . '/style.css' );
  wp_enqueue_style( 'child-style' , get_stylesheet_directory_uri() . '/style.css', array('parent-style')
);

The html file generated is: (note: spaces added to enhance readability.)

48 <link rel='stylesheet' id='parent-style-css'           href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen/      style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
49 <link rel='stylesheet' id='child-style-css'            href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen-child/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
51 <link rel='stylesheet' id='twentyseventeen-style-css'  href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen-child/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />

We don’t need two same lines, 49 and 51, and giving the same timestamp for two different style files, parent and child, doesn’t look nice, I must admit.

Broken Links (3)

Another broken link comes from the line 49:

48 <link rel='stylesheet' id='parent-style-css'  href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
49 <link rel='stylesheet' id='child-style-css'  href='https://spinorlab.matrix.jp/en/home/xxxx/www/spinorlab/en/wp-content/themes/twentyseventeen-child/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />
51 <link rel='stylesheet' id='twentyseventeen-style-css'  href='https://spinorlab.matrix.jp/en/wp-content/themes/twentyseventeen-child/style.css?ver=2017-08-03-Thu-00:25:48' type='text/css' media='all' />

Perhaps this relates to the followings lines in my functions.php :

function theme_enqueue_styles() {
  wp_enqueue_style( 'parent-style', get_template_directory_uri() . '/style.css' );
  wp_enqueue_style( 'child-style', get_stylesheet_directory() . '/style.css', array('parent-style')
);

The function, get_stylesheet_directory(), returns the string “/home/xxx/www/spinorlab/en/wp-content/themes/twentyseventeen-child”, so how come the prefix?

Broken Links (2)

Here is another broken link. The error comes from the Meta section in the sidebar.

I really don’t need this menu, so I simply deleted the whole section.

Broken Links

This is one of the broken links by W3C Link Checker.

25	<link rel='dns-prefetch' href='//fonts.googleapis.com' />

It seems that an http(s) request to fonts.googleapis.com needs a /css?family=… part, as in the followings.

So what should I do?

  • Ignore the broken links, and do nothing?
  • Modify the dns-prefetch line?
  • Or, just delete the dns-prefetch line?

Link Checker

It is a good practice to check if there are any broken links in your site.

Since I corrected my .htaccess file, there should be no broken links now.

Ooops! Something is still wrong..

URL Rewriting

URL rewriting can be the source of suffering unless you are very careful in applying the rules.

My English version of the site has been only accessible at the front page only for the last few days due to the error in the file .htaccess.

% diff ./en/.htaccess ./ja/.htaccess 
21c21
< RewriteBase /spinorlab.matrix.jp/en/
---
> RewriteBase /ja/
25c25
< RewriteRule . /spinorlab.matrix.jp/en/index.php [L]
---
> RewriteRule . /ja/index.php [L]

I suppose the error is introduced by the plugin I was trying to use to move my English site to HTTPS.