OpenOffice, PDF, Microsoft Office and generic documents linear search script

Another problem, another script. This time, it’s a script to perform linear search through various documents, without having to install something heavyweight like Google Desktop for Linux.

Why did I write this script you say? Well, let’s just say it drove me nuts to be staring at a directory full of OpenOffice / PDF / MS Office files that you know grep can’t do its thing on.

Here is the script (please feel free to improve it!):

#!/bin/bash

# Script to look inside each file specified and try and match on the given string

set -u

if (($#<2)); then
	# e.g. finddocs -iE "someText" *
	echo -n 
'Usage:
	finddocs [-(grep_flgas)] (SearchStr|RegExp) file1 [file2] [file3]...

Examples:
	find -iname "*some_partial_name*" -print0|xargs -0 finddocs -i "some text"
	find -iname "*.txt" -print0|xargs -0 finddocs -iE "xx[0-5]"
	find -print0|xargs -0 finddocs -i "kerberos"
'
	exit 1
fi

rtr=""

usrGrepFlags=""
searchStr=""
mimeType=""

# Get grep flags
if [ "${1:0:1}" == "-" ]; then
	usrGrepFlags="${1:1}"	# Append just the flags
	# Strip all non-compatible options - we are appending the options into grep!
	usrGrepFlags="$(echo "$usrGrepFlags"|sed -e s/[^iEPv]//g)"	# use "v" with caution!
	shift
fi

# Get search string
searchStr="$1"
shift


## Escape the HTML of the search string - only applicable to documents that stores its data as XML (e.g. OpenOffice, MS Office 2007+)
# $1: input string
# return: string - escaped string
function escapeHTML() {
	local searchStr="$1"
	
	local search=(	"&"			"<"		">")	# **** Please consider the ordering due to sed replace loop below, and don't forget to escape! ****
	local replace=(	"&amp;"	"&lt;"	"&gt;")
	
	local i;
	for ((i=0; i<${#search[@]}; i++)); do
		searchStr=$(echo "$searchStr"|sed -e "s/${search[$i]}/${replace[$i]}/g")
	done
	rtr="$searchStr"
}

## Check to see if group exist
# $1: filename
# return: int - count of string occurrences
function findtxt () {
	rtr=$(grep -c"$usrGrepFlags" "$searchStr" "$1")
	# For documents with unicode characters
	if (( $rtr == 0 )); then	# This takes longer to search - only do if we can't find anything!
		rtr=$(sed -s "s/x00//g" < "$1"|grep -c"$usrGrepFlags" "$searchStr")
	fi
}

## Grep for file within a zip container - OpenOffice / MS Office 2007+
# $1: filename
# $2: path of file with content inside the zip file
# return: int - count of string occurrences
function zipGrep () {
	# Try unzipping the document's content file and see if the string exist
	# e.g.: #count=$(unzip -pa "$i" content.xml 2>/dev/null|grep -c"$usrGrepFlags" "$searchStr")
	escapeHTML "$searchStr"
	local escapedSearchStr="$rtr"
	count=$(unzip -pa "$1" "$2" 2>/dev/null|grep -c"$usrGrepFlags" "$escapedSearchStr")
	rtr=$count
}

# Search using the rest of the file names
for i in "$@"; do
	# If it's a file
	if [ -f "$i" ]; then
		count=0
		mimeType="$(file -bL0 "$i")"	# Note I'm not using the --mime-type / -i - OpenOffice documents shows up as "application/octet-stream"...
		case "$mimeType" in
			"PDF"*)
				count=$(pdftotext -q "$i" - 2>/dev/null|grep -c"$usrGrepFlags" "$searchStr")
			;;
			"OpenDocument"*)
				zipGrep "$i" content.xml
				count=$rtr
			;;
			"ASCII"*)
				findtxt "$i"
				count=$rtr
			;;
			"Bourne"*)
				findtxt "$i"
				count=$rtr
			;;
			# Some MS document gets reported as CDF...
			"CDF"*)
				findtxt "$i"
				count=$rtr
			;;
			"Microsoft"*)
				findtxt "$i"
				count=$rtr
			;;
			"UTF-8"*)
				findtxt "$i"
				count=$rtr
			;;
			"XML"* | "HTML"*)
				findtxt "$i"
				count=$rtr
			;;
			# Office 2007 formats
			"Zip"*)
				fileext=$(basename "$i"|grep -ioE ".([a-z]{4,})$"|cut -c 2-|tr "[:upper:]" "[:lower:]")
				# If the extension ends in
				case "$fileext" in
					"docx"*)
						zipGrep "$i" "word/document.xml"
						count=$rtr
					;;
					"pptx"*)
						# pptx contents are in /ppt/slides/slide*.xml
						for slide in $(zipinfo -1 "$i"|grep "ppt/slides/slide[0-9]*.xml"); do
							zipGrep "$i" "$slide"
							count=$rtr
							if (( rtr != 0 )); then break; fi
						done
					;;
					"xlsx"*)
						# Strings
						zipGrep "$i" "xl/sharedStrings.xml"
						count=$rtr
						# Try looking into the individual worksheets' formula
						if (( $count == 0 )); then
							# Formulas
							for worksheet in $(zipinfo -1 "$i"|grep "xl/worksheets/sheet[0-9]*.xml"); do
								zipGrep "$i" "$worksheet"
								count=$rtr
								if (( rtr != 0 )); then break; fi
							done
						fi;
					;;
				esac
			;;
			*)
				continue
			;;
		esac
		if (($count>0)); then
			echo "$i"
		fi
	fi
done

The Pirate Bay (TPB) search results to RSS conversion script

The Pirate Bay (TPB) search results to RSS conversion script

TPB logo

Update (2011-06-26):
I’ve updated the post below with the Perl version which works much better than the original bash version.
Just make sure you’ve got “libhtml-treebuilder-xpath-perl” package installed.

For those of you who are also incredibly annoyed by the fact that The Pirate Bay doesn’t let you create a custom RSS feed from your searches, I’ve got just the script for you:

#!/usr/bin/perl -w
use strict;

use HTML::TreeBuilder::XPath;	# You'll need libhtml-treebuilder-xpath-perl
use Date::Format;
use Date::Parse;
#use Data::Dumper;

my $htdoc=HTML::TreeBuilder::XPath->new();
if ($#ARGV>=0 && -e $ARGV[0]) {
	$htdoc->parse_file($ARGV[0]);
} else {
	my @in=<STDIN>;
	$htdoc->parse_content(@in);
}




## Get the rfc-2822 date from the TPB date formats
# $1: String		# Date string in TBP format
# return: integer	# Unix time
sub get_date {
	my $tpb_date=$_[0];
	#print STDERR "DEBUG: ".$tpb_date."\n";
	if (!defined($tpb_date)) {die ('tpb_date given to get_date() is undefined - a date string in a TPB format expected.');}
	
	my $time=time();	# Current epoch time
	my ($second,$minute,$hour,$day,$month,$year)=localtime();
	$month=$month+1;	# Months starts at 0
	
	# Replace &nbsp; / 0xA0 with space
	$tpb_date=~s/&nbsp;|\xA0/ /g;
	
	## Get the item's date string to be converted into a date
	# "M mins ago" - e.g. "7 mins ago" 
	if ($tpb_date=~m/^[0-6]?[0-9] mins ago$/i) {	# In case we get "60 mins ago"...
		my ($min)=$tpb_date=~m/^([0-6]?[0-9]) mins ago$/i;
		$time=$time-($min*60);
	# "Today HH:MM" - e.g. "Today 08:24"
	} elsif ($tpb_date=~m/^Today [0-9]{2}:[0-9]{2}$/i) {
		($hour, $minute)=$tpb_date=~m/^Today ([0-9]{2}):([0-9]{2})$/i;
		$time=str2time(($year+1900)."-$month-$day"."T"."$hour:$minute:00");
	# "Y-day HH:MM" - e.g. "Y-day 16:23"
	} elsif ($tpb_date=~m/^Y-day [0-9]{2}:[0-9]{2}$/i) {
		# Get the time of today at 16:23, then take away 24 hours (in seconds)
		($hour, $minute)=$tpb_date=~m/^Y-day ([0-9]{2}):([0-9]{2})$/i;
		$time=str2time(($year+1900)."-$month-$day"."T"."$hour:$minute:00");
		#  We now have the time in seconds of today at 16:23, take away 24 hours in seconds
		$time=$time-(24*60*60);
	# "mm-dd HH:MM" - e.g. "11-16 01:06"
	} elsif ($tpb_date=~m/^[01][0-9]-[0-3][0-9] [0-2][0-9]:[0-5][0-9]$/i) {
		($month, $day, $hour, $minute)=$tpb_date=~m/^([01][0-9])-([0-3][0-9]) ([0-2][0-9]):([0-5][0-9])$/i;
		$time=str2time(($year+1900)."-$month-$day"."T"."$hour:$minute:00");
	# "mm-dd YYYY" - e.g. "08-14 2004"
	} elsif ($tpb_date=~m/^[01][0-9]-[0-3][0-9] [0-9]{4}$/i) {
		($month, $day, $year)=$tpb_date=~m/^([01][0-9])-([0-3][0-9]) ([0-9]{4})$/i;
		$time=str2time("$year-$month-$day");
	} else {
		$time=0;	# Return 1900 so we can see it's an error
	}
	
	return $time;
}





{	# A RSS Item
	package RSSItem;
	sub new {
		my $class=$_[0];
		bless ({
			'title'=>undef,
			'link'=>undef,
			'guid'=>undef,
			'enclosure_url'=>undef,	# The torrent file to put inside of <enclosure>
			'pubdate'=>undef,		# The parsed date string in rfc-2822
			'description'=>undef
		}, $class);
	}
}



# Escape the HTML code given to this function
# $1: unescaped HTML
# return: escaped HTML
sub escapeHTML {
	if (!defined ($_[0])) {return ''};
	my $html=$_[0];
	$html=~s/&/&amp;/ig;
	$html=~s/</&lt;/ig;
	$html=~s/>/&rt;/ig;
	$html=~s/\n/<br\/>/ig;
	$html=~s/\xA0/ /g;
	return $html;
}



## Get the XML for them item, between (inc.): <item>....</item>
## Generates something like:
##<item>
##	<title>Item 1 title</title>
##	<link>http://www.google.com</link>
##	<guid>http://www.google.com</guid>
##	<enclosure url='http://isohunt.com/download/142836069/nightwish.torrent' type='application/x-bittorrent' />
##	<pubDate>Mon, 30 Nov 2009 18:59:38 +1100</pubDate>
##	<description><![CDATA[xx item 1 Description xx]]></description>
##</item>
# $1: rssItem RSSItem
# return: String - XML item data: <item>....</item>
sub get_item_XML {
	my $rssItem=$_[0];
	if (!$rssItem->isa('RSSItem')) {die('get_item_XML() only accepts a RSSItem');}
	print
"<item>
	<title>".escapeHTML($rssItem->{'title'})."</title>
	<link>".escapeHTML($rssItem->{'link'})."</link>
	<guid>".escapeHTML($rssItem->{'guid'})."</guid>
	<enclosure url=\"".escapeHTML($rssItem->{'enclosure_url'})."\" type=\"application/x-bittorrent\"/>
	<pubDate>".escapeHTML($rssItem->{'pubdate'})."</pubDate>
	<description><![CDATA[".escapeHTML($rssItem->{'description'})."]]></description>
</item>\n";
	return;
}







=rem
# Sample data (line number is aligned with the loop below):
1<tr>
2	<td class="vertTh">
3		<center>
4			<a href="/browse/200" title="More from this category">Video</a><br />
5			(<a href="/browse/205" title="More from this category">TV shows</a>)
6		</center>
7	</td>
8	<td>
9		<div class="detName"><a href="/torrent/6147748/Hawaii_Five_O_1x03_(Final_de_la_2)_(HDiTunes)_(DVB)_By_Hero" class="detLink" title="Details for Hawaii Five O 1x03 (Final de la 2) (HDiTunes) (DVB) By Hero">Hawaii Five O 1x03 (Final de la 2) (HDiTunes) (DVB) By Hero</a></div>
10		<a href="http://torrents.thepiratebay.org/6147748/Hawaii_Five_O_1x03_(Final_de_la_2)_(HDiTunes)_(DVB)_By_Hero.6147748.TPB.torrent" title="Download this torrent"><img src="http://static.thepiratebay.org/img/dl.gif" class="dl" alt="Download" /></a><a href="magnet:?xt=urn:btih:1fb79a1bfd5a24d43fb13628a18e3690672311d1&dn=Hawaii+Five+O+1x03+%28Final+de+la+2%29+%28HDiTunes%29+%28DVB%29+By+Hero" title="Download this torrent using magnet"><img src="http://static.thepiratebay.org/img/icon-magnet.gif" alt="Magnet link" /></a><img src="http://static.thepiratebay.org/img/icon_comment.gif" alt="This torrent has 1 comments." title="This torrent has 1 comments." /><a href="/user/hero14"><img src="http://static.thepiratebay.org/img/trusted.png" alt="Trusted" title="Trusted" style="width:11px;" border=0 /></a>
11		<font class="detDesc">Uploaded 02-03&nbsp;18:19, Size 346.79&nbsp;MiB, ULed by <a class="detDesc" href="/user/hero14/" title="Browse hero14">hero14</a></font>
12	</td>
13	<td align="right">7</td>
14	<td align="right">0</td>
15</tr>
=cut

my $now=time2str('%a, %d %b %Y %H:%M:%S %z',time());
print <<RSS_HEADER;
<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>TPB</title>
		<link>http://thepiratebay.org</link>
		<description>TPB feed</description>
		<lastBuildDate>$now</lastBuildDate>
RSS_HEADER

my $nodelist;	# tmp
my $tmpStr;
foreach my $itemTR ($htdoc->findnodes('//table[@id="searchResult"]/tr')->get_nodelist()) {
	# If this row is not a result row, break - we've reached the end of the result list and now at the paging list
	if ($itemTR->findnodes('./td[1][@class="vertTh"]')->size()==0) {
		last;
	}

	my $rssItem=RSSItem->new();
	## Categories (line 2-7) - unused for now
	#$nodelist=$itemTR->findnodes('./td/center/a');
	#$nodelist->get_node(1)->string_value()."\n";
	#$nodelist->get_node(2)->string_value()."\n";
	
	## Title, Link (line 9)
	$nodelist=$itemTR->findnodes('./td/div/a');
	$rssItem->{'title'}=$nodelist->get_node(1)->string_value();
	$rssItem->{'link'}='http://thepiratebay.org'.$nodelist->get_node(1)->attr('href');
	$rssItem->{'guid'}=$rssItem->{'link'};
	
	## Enclosure URI (line 10)
	$nodelist=$itemTR->findnodes('./td/a[1]');
	$rssItem->{'enclosure_url'}=$nodelist->get_node(1)->attr('href');
	
	## Description and pubdate (line 11, 13, 14)
	my $filesize;	# Line 11
	my $seeders;	# Line 13
	my $leechers;	# Line 14
	my $tmpStr=$itemTR->findnodes('./td/font')->get_node(1)->string_value();	# e.g: "Uploaded 02-03&nbsp;18:19, Size 346.79&nbsp;MiB, ULed by hero14"

	# Size
	($filesize)=$tmpStr=~m/Size ([^,]+)/i;	# "Size 346.79&nbsp;MiB"
	# Seeds
	$seeders=$itemTR->findnodes('./td')->get_node(3)->string_value();
	# Leechers
	$leechers=$itemTR->findnodes('./td')->get_node(4)->string_value();
	## Description - we now have enough information to generate the description :-)
	$rssItem->{'description'}="S: $seeders | L: $leechers\n\nSize: $filesize";
	
	## Pubdate - Thu, 17 Mar 2011 09:20:55 +1100
	$rssItem->{'pubdate'}=time2str('%a, %d %b %Y %H:%M:%S %z',get_date($tmpStr=~m/Uploaded ([^,]+)/i));
	
	#print Dumper $rssItem;
	print get_item_XML($rssItem);
}

print <<RSS_FOOTER;
	</channel>
</rss>
RSS_FOOTER


$htdoc->delete();

To use the script, simply throw the page in via stdin, then redirect the output to a file:

wget -O - 'http://thepiratebay.org/search/lie.to.me/0/3/0'|./thepiratebay.org.pl > rss.xml

Just in case you don’t already know, you can use this script with Liferea as a feed source filter!:

Liferea subscription properties

Hope you find it useful!

Returning an array from a Bash function

For those of you who are still learning Bash (including me…), I’m sure one of the things you would have asked yourself is "How on earth do I return an array from a bash function?".

Well I’ve written a small script that will explain this:

(Please see update below)

#!/bin/bash
IFS=$'nt'

function fnGo () {
	array=(
		a	s	d	f	
		"gh ij"	"kl mn"
	)
	echo "${array[*]}"
}

# -------- out - String variable --------
out=$(fnGo)
echo ""out" isn't an array: ${out[1]} - nothing"

echo $'n'""out" Works with iteration:"
for item in $out; do
	echo "item:"$'t'"$item"
done

# -------- out2 - An array --------
out2=($(fnGo))
echo $'n'""out2" now an array:"
for ((i=0; i<${#out2[*]}; i++)); do
	echo "item $i:"$'t'"${out2[i]}"
done

echo $'n'"Though "out2" cannot be iterated anymore...:"
for item in $out2; do
	echo "item:"$'t'"$item"
done


Basically, it’s exactly as you’d do for returning a single value from a function (use echo) – but, you need to make sure you surround the variable with quotes (in the function – echo "${array[*]}"), and receive it as an array – out2=($(fnGo)).


Note you can only either choose to use an iterator method (out1), or an addressing method (out2), but not both – run the script and you’ll see what I mean.

Oh, one more thing (just as a tip for those who don’t already know) – pay attention to your IFS variable (which determines how parameters are separated)! This is especially important if you’re taking in quoted (escaped) command-line parameters that may have a space in them (such as file names) – in that case, I normally use "n".

—- Update (2009-10-03 @ 10:35:27) —-

Ok, having written bash scripts for a little while now, I have found a better way of doing this by using a special "return" variable; the old method will not let you actually echo anything onto the console and do things like "exit 1", but the following will:

#!/bin/bash
IFS=$'nt'

rtr=""

function fnGo () {
	local array=()	# You can use array="" - it won't make any difference
	echo "fnGo() called"
	array=(
		a	s	d	f	
		"gh ij"	"kl mn"
	)
	rtr=(${array[@]})
}

# -------- out - String variable --------
fnGo
out="${rtr[*]}"
echo ""out" isn't an array: "'${out[1]}'"="${out[1]}" - nothing"

echo $'n'""out" Works with iteration:"
for item in $out; do
	echo "item:"$'t'"$item"
done

echo -e "n"

# -------- out2 - An array --------
fnGo
out2=(${rtr[@]})
echo -e ""out2" now an array:"
for ((i=0; i<${#out2[@]}; i++)); do
	echo "item $i:"$'t'"${out2[i]}"
done

echo $'n'"Though "out2" cannot be iterated anymore...:"
for item in $out2; do
	echo "item:"$'t'"$item"
done

You may also want to check out this article for the difference in the use of the [*] and [@] for enumeration.