Subscribe Add to Technorati Favorites

Thursday, June 18, 2009

Extract or Reap or Download Pictures from HTML Web Pages

I was exploring python and the HTML parsers that it has. I hit a page that explained about the SGMLParser and its uses.

Using this package I had written a simple script that could extract pictures from simple HTML pages. While experimenting it with many websites I noticed some websites did not allow my script to reap or extract pictures.

Then I found that it is because of the User-Agent string in the HTTP request. Some websites do not respond properly to HTTP requests that have unknown User-Agent string. I then used the urllib's FancyURLOpener to change the User-Agent to Firefox.

Now that the script work on most of the websites, i would like to post the code here. Feel free to modify to make it more robust and reliable, post a comment or a link to your modified script.

The code is here (or a link for you to download):



from sgmllib import SGMLParser
import sys
import os
import re
import urllib
from urllib import FancyURLopener
from urlparse import urlparse

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

class MyURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6'

def url_mapper(var):
regexp = re.compile('.+?(http://.+&?)', re.I)
values = urlparse(var)
if len(values) >= 5 and values[4] != '':
obj = regexp.search(values[4])
if obj:
values = urlparse(obj.group(1))
else:
if values[4] != '': print('Could not get real url from:' + values[4])
site = re.sub('[^0-9a-zA-Z_/]', '_', values[1])
location_file = re.findall('(.+)/(.+)', values[2])
try:
if len(location_file[0]) != 2:
return ''
except IndexError:
return ''
location = location_file[0][0]
file = location_file[0][1]
location = re.sub('\.{2,}/', '/', location)
location = re.sub('[^0-9a-zA-Z_/]', '_', location)
location = location + '/' + file
return site + location

if __name__ == '__main__':
url_opener = MyURLOpener()
parser = URLLister()
sys.argv[1] = urllib.unquote(sys.argv[1])
try:
usock = url_opener.open(sys.argv[1])
except IOError:
print 'skipping ' + sys.argv[1]
sys.exit(0)
parser.feed(usock.read())
parser.close()
usock.close()
count = 0
urlfs = urlparse(sys.argv[1])
print "Url fs: ", urlfs
parent = urlfs[1]
try:
parent += urlfs[2]
parent = re.search('(.+)/.+',parent).group(1)
except IndexError:
parent = urlfs[1]
for img_url in parser.urls:
if re.search('\.jpe?g$', img_url):
print "looking at: " + img_url
if not re.match('^http://', img_url):
img_url = 'http://' + parent + '/' + img_url
loc = url_mapper(img_url)
if os.path.exists(loc): continue
url_opener1 = MyURLOpener()
retrieve = url_opener1.retrieve
if loc == '':
print "\turl_mapper returned NULL for " + img_url
continue
loc_dir = re.match('(.+)/.+', loc).group(1)
try:
print('\ttrying to save ' + img_url)
if not os.path.exists(loc_dir):
os.makedirs(loc_dir)
retrieve(img_url, loc)
except IOError:
print('\tSkipping saving ' + img_url)
continue
count += 1
print "\tImg fetched: ",count

Thursday, May 28, 2009

Installing andLinux on Windows Vista

After downloading the andLinux installer (KDE version), launch the installer as Administrator. And follow the steps with default options set.

Reboot/Restart machine after installation.

Click on Unblock if the Firewall prompts.

After the reboot, you click on Konsole (any) launcher on KDE Menu from system tray, you are likely to get this error "could not launch ... could not connect to 192.168.11.150"

So this guide explains how to fix this issue, for the launchers to work on Windows Vista.

First thing to do is to go to the "Network and Sharing Center" from Control Panel. You can see the "customize" and "view status" links on "Unidentified Network". Click on the "view status", click on properties and double click "Internet Protocol Version 4". That should open up a dialog where you can change the IP and the subnet mask, change the IP to 192.168.10.1 and subnet mask to 255.255.254.0. Click on OK OK ....

Secondly, open the "andlinux Console" get sudo access and edit the file /etc/network/interfaces.
After edit the file should have IP of 192.168.10.150 and the mask 255.255.254.0 under eth1

Thirdly edit file /etc/profile to have IPs set to 192.168.10.150

Fourthly navigate to the andlinux installation folder goto Xming folder edit X0.hosts and add the following IPs 192.168.10.150 and 192.168.10.1

Lastly few registry changes, regedit and navigate to \\HKEY_LOCAL_MACHINE\\SOFTWARE\\andLinux\\Launcher
Here you find to keys "IP" and "Port" set them to 192.168.10.150 and 2081 respectively.
(Remember to click on Decimal while editing Port value) And close the regedit.

Finally kill KDE Menu, Xming. Restart andLinux service, and then start Xming and KDE Menu.
Now the launchers work fine.

If clicking on a launcher does not start, you open the andLinux Console application and execute:
"/etc/init.d/networking restart". After this any launcher should work fine.

!Enjoy!

Saturday, May 9, 2009

Parsing and Storing parts or paragraphs of log file

At many times we need to parse and store parts of a log file. Suppose an error log, you would like to parse all errors. If the errors span only one line, you could use a grep.

But if the errors span multiple lines ... you need to put a little logic into your script. There are many ways to do this, but this script that i post here is more readable and can be modified to other's requirement.

Link to the code
Link to a sample input

In the example script i am parsing an error log (named 'input.txt'). The script stores all lines that start with "SQL0204N" till it finds an empty line. Likewise it extracts all errors, even if the error messages span multiple lines.