The Pirate Bay (TPB) search results to RSS conversion script

The Pirate Bay (TPB) search results to RSS conversion script

TPB logo

Update (2011-06-26):
I’ve updated the post below with the Perl version which works much better than the original bash version.
Just make sure you’ve got “libhtml-treebuilder-xpath-perl” package installed.

For those of you who are also incredibly annoyed by the fact that The Pirate Bay doesn’t let you create a custom RSS feed from your searches, I’ve got just the script for you:

#!/usr/bin/perl -w
use strict;

use HTML::TreeBuilder::XPath;	# You'll need libhtml-treebuilder-xpath-perl
use Date::Format;
use Date::Parse;
#use Data::Dumper;

my $htdoc=HTML::TreeBuilder::XPath->new();
if ($#ARGV>=0 && -e $ARGV[0]) {
} else {
	my @in=<STDIN>;

## Get the rfc-2822 date from the TPB date formats
# $1: String		# Date string in TBP format
# return: integer	# Unix time
sub get_date {
	my $tpb_date=$_[0];
	#print STDERR "DEBUG: ".$tpb_date."\n";
	if (!defined($tpb_date)) {die ('tpb_date given to get_date() is undefined - a date string in a TPB format expected.');}
	my $time=time();	# Current epoch time
	my ($second,$minute,$hour,$day,$month,$year)=localtime();
	$month=$month+1;	# Months starts at 0
	# Replace &nbsp; / 0xA0 with space
	$tpb_date=~s/&nbsp;|\xA0/ /g;
	## Get the item's date string to be converted into a date
	# "M mins ago" - e.g. "7 mins ago" 
	if ($tpb_date=~m/^[0-6]?[0-9] mins ago$/i) {	# In case we get "60 mins ago"...
		my ($min)=$tpb_date=~m/^([0-6]?[0-9]) mins ago$/i;
	# "Today HH:MM" - e.g. "Today 08:24"
	} elsif ($tpb_date=~m/^Today [0-9]{2}:[0-9]{2}$/i) {
		($hour, $minute)=$tpb_date=~m/^Today ([0-9]{2}):([0-9]{2})$/i;
	# "Y-day HH:MM" - e.g. "Y-day 16:23"
	} elsif ($tpb_date=~m/^Y-day [0-9]{2}:[0-9]{2}$/i) {
		# Get the time of today at 16:23, then take away 24 hours (in seconds)
		($hour, $minute)=$tpb_date=~m/^Y-day ([0-9]{2}):([0-9]{2})$/i;
		#  We now have the time in seconds of today at 16:23, take away 24 hours in seconds
	# "mm-dd HH:MM" - e.g. "11-16 01:06"
	} elsif ($tpb_date=~m/^[01][0-9]-[0-3][0-9] [0-2][0-9]:[0-5][0-9]$/i) {
		($month, $day, $hour, $minute)=$tpb_date=~m/^([01][0-9])-([0-3][0-9]) ([0-2][0-9]):([0-5][0-9])$/i;
	# "mm-dd YYYY" - e.g. "08-14 2004"
	} elsif ($tpb_date=~m/^[01][0-9]-[0-3][0-9] [0-9]{4}$/i) {
		($month, $day, $year)=$tpb_date=~m/^([01][0-9])-([0-3][0-9]) ([0-9]{4})$/i;
	} else {
		$time=0;	# Return 1900 so we can see it's an error
	return $time;

{	# A RSS Item
	package RSSItem;
	sub new {
		my $class=$_[0];
		bless ({
			'enclosure_url'=>undef,	# The torrent file to put inside of <enclosure>
			'pubdate'=>undef,		# The parsed date string in rfc-2822
		}, $class);

# Escape the HTML code given to this function
# $1: unescaped HTML
# return: escaped HTML
sub escapeHTML {
	if (!defined ($_[0])) {return ''};
	my $html=$_[0];
	$html=~s/\xA0/ /g;
	return $html;

## Get the XML for them item, between (inc.): <item>....</item>
## Generates something like:
##	<title>Item 1 title</title>
##	<link></link>
##	<guid></guid>
##	<enclosure url='' type='application/x-bittorrent' />
##	<pubDate>Mon, 30 Nov 2009 18:59:38 +1100</pubDate>
##	<description><![CDATA[xx item 1 Description xx]]></description>
# $1: rssItem RSSItem
# return: String - XML item data: <item>....</item>
sub get_item_XML {
	my $rssItem=$_[0];
	if (!$rssItem->isa('RSSItem')) {die('get_item_XML() only accepts a RSSItem');}
	<enclosure url=\"".escapeHTML($rssItem->{'enclosure_url'})."\" type=\"application/x-bittorrent\"/>

# Sample data (line number is aligned with the loop below):
2	<td class="vertTh">
3		<center>
4			<a href="/browse/200" title="More from this category">Video</a><br />
5			(<a href="/browse/205" title="More from this category">TV shows</a>)
6		</center>
7	</td>
8	<td>
9		<div class="detName"><a href="/torrent/6147748/Hawaii_Five_O_1x03_(Final_de_la_2)_(HDiTunes)_(DVB)_By_Hero" class="detLink" title="Details for Hawaii Five O 1x03 (Final de la 2) (HDiTunes) (DVB) By Hero">Hawaii Five O 1x03 (Final de la 2) (HDiTunes) (DVB) By Hero</a></div>
10		<a href="" title="Download this torrent"><img src="" class="dl" alt="Download" /></a><a href="magnet:?xt=urn:btih:1fb79a1bfd5a24d43fb13628a18e3690672311d1&dn=Hawaii+Five+O+1x03+%28Final+de+la+2%29+%28HDiTunes%29+%28DVB%29+By+Hero" title="Download this torrent using magnet"><img src="" alt="Magnet link" /></a><img src="" alt="This torrent has 1 comments." title="This torrent has 1 comments." /><a href="/user/hero14"><img src="" alt="Trusted" title="Trusted" style="width:11px;" border=0 /></a>
11		<font class="detDesc">Uploaded 02-03&nbsp;18:19, Size 346.79&nbsp;MiB, ULed by <a class="detDesc" href="/user/hero14/" title="Browse hero14">hero14</a></font>
12	</td>
13	<td align="right">7</td>
14	<td align="right">0</td>

my $now=time2str('%a, %d %b %Y %H:%M:%S %z',time());
print <<RSS_HEADER;
<?xml version="1.0"?>
<rss version="2.0">
		<description>TPB feed</description>

my $nodelist;	# tmp
my $tmpStr;
foreach my $itemTR ($htdoc->findnodes('//table[@id="searchResult"]/tr')->get_nodelist()) {
	# If this row is not a result row, break - we've reached the end of the result list and now at the paging list
	if ($itemTR->findnodes('./td[1][@class="vertTh"]')->size()==0) {

	my $rssItem=RSSItem->new();
	## Categories (line 2-7) - unused for now
	## Title, Link (line 9)
	## Enclosure URI (line 10)
	## Description and pubdate (line 11, 13, 14)
	my $filesize;	# Line 11
	my $seeders;	# Line 13
	my $leechers;	# Line 14
	my $tmpStr=$itemTR->findnodes('./td/font')->get_node(1)->string_value();	# e.g: "Uploaded 02-03&nbsp;18:19, Size 346.79&nbsp;MiB, ULed by hero14"

	# Size
	($filesize)=$tmpStr=~m/Size ([^,]+)/i;	# "Size 346.79&nbsp;MiB"
	# Seeds
	# Leechers
	## Description - we now have enough information to generate the description :-)
	$rssItem->{'description'}="S: $seeders | L: $leechers\n\nSize: $filesize";
	## Pubdate - Thu, 17 Mar 2011 09:20:55 +1100
	$rssItem->{'pubdate'}=time2str('%a, %d %b %Y %H:%M:%S %z',get_date($tmpStr=~m/Uploaded ([^,]+)/i));
	#print Dumper $rssItem;
	print get_item_XML($rssItem);

print <<RSS_FOOTER;


To use the script, simply throw the page in via stdin, then redirect the output to a file:

wget -O - ''|./ > rss.xml

Just in case you don’t already know, you can use this script with Liferea as a feed source filter!:

Liferea subscription properties

Hope you find it useful!


XBindKeys – Fixing broken shift+tab+(key) shortcut in Xubuntu

xbindkeys-config window

For some reason, hotkey / application shortcut support for a keyboard combination such as Ctrl+Shift+(key) was broken back in Xubuntu 8.10 (and still is broken in 9.10…).

Needless to say, this issue is really annoying me as I used to assign keys like Ctrl+Shift+# (I have an British keyboard) for controlling Play/Pause in Banshee to enable one-handed operation.

Anyhow, I came across this application called "XBindKeys" on the other day, and now, with the help of XBindKeys, I can now use those hotkey combinations!

To install:

aptitude install xbindkeys xbindkeys-config

Then run xbindkeys-config to setup your hotkeys via a GUI.

XBindKeys is capabile of more things (such as extending hotkeys support for mouse buttons) – please see XBindKeys’ homepage for more information.