-
Notifications
You must be signed in to change notification settings - Fork 177
Open
Description
As I mentioned in #47 (comment), the newsletter parser gets the book title from the url behind the image cover.
packtpub-crawler/script/packtpub.py
Line 101 in e604cc1
urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href'] |
This will work fine if the link on the landing page points to the main book page like it was the case here: https://www.packtpub.com/packt/free-ebook/amazon-web-services-free
<a href="/networking-and-servers/mastering-aws-development">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering AWS Development.jpg" class="bookimage" />
</a>
but will yield some unexpected results when this href points to, for example, a cover image - like here: https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
<a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
<img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
</a>
The latter will result in
packtpub-crawler/script/packtpub.py
Line 102 in e604cc1
title = urlWithTitle.split('/')[-1].replace('-', ' ').title() |
An alternative to this would be to use the string inside the h1 tag of the title-bar-title
div like here: mkarpiarz@c583d37.
But this also doesn't seem to be always reliable, e.g.:
<div id="title-bar-title"><h1>Free Amazon Web Services eBook</h1></div>
Metadata
Metadata
Assignees
Labels
No labels