Using this package I had written a simple script that could extract pictures from simple HTML pages. While experimenting it with many websites I noticed some websites did not allow my script to reap or extract pictures.
Then I found that it is because of the User-Agent string in the HTTP request. Some websites do not respond properly to HTTP requests that have unknown User-Agent string. I then used the urllib's FancyURLOpener to change the User-Agent to Firefox.
Now that the script work on most of the websites, i would like to post the code here. Feel free to modify to make it more robust and reliable, post a comment or a link to your modified script.
This script search pictures from web pages and download them for you. All you need to do is
get the URL of the web page from which you want the pictures to be downloaded and pass it as
an argument to this script. Here is how to use it:
0. Note that python is installed on your system to run this script (Windows/Linux/Solaris ... OS independent;)
1. Download this script and save it as gal_ext.py
2. Open a command-prompt (if Windows) or Terminal (if you use any *nix OS)
3. Type "python gal_ext.py 'http://URL'
4. Here note that the 'http://' is mandatory
Using this script you can avoid downloading images with a web browser, which takes a long time and is annoying to right click on every image and save them. This way you can download picture from web faster and easier. Since this is a script it is open for modifications and is free to use. But if you modify, make a note to send it to me ;). Thanks
The code is here (or a link for you to download):
from sgmllib import SGMLParser
import sys
import os
import re
import urllib
from urllib import FancyURLopener
from urlparse import urlparse
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
class MyURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6'
def url_mapper(var):
regexp = re.compile('.+?(http://.+&?)', re.I)
values = urlparse(var)
if len(values) >= 5 and values[4] != '':
obj = regexp.search(values[4])
if obj:
values = urlparse(obj.group(1))
else:
if values[4] != '': print('Could not get real url from:' + values[4])
site = re.sub('[^0-9a-zA-Z_/]', '_', values[1])
location_file = re.findall('(.+)/(.+)', values[2])
try:
if len(location_file[0]) != 2:
return ''
except IndexError:
return ''
location = location_file[0][0]
file = location_file[0][1]
location = re.sub('\.{2,}/', '/', location)
location = re.sub('[^0-9a-zA-Z_/]', '_', location)
location = location + '/' + file
return site + location
if __name__ == '__main__':
url_opener = MyURLOpener()
parser = URLLister()
sys.argv[1] = urllib.unquote(sys.argv[1])
try:
usock = url_opener.open(sys.argv[1])
except IOError:
print 'skipping ' + sys.argv[1]
sys.exit(0)
parser.feed(usock.read())
parser.close()
usock.close()
count = 0
urlfs = urlparse(sys.argv[1])
print "Url fs: ", urlfs
parent = urlfs[1]
try:
parent += urlfs[2]
parent = re.search('(.+)/.+',parent).group(1)
except IndexError:
parent = urlfs[1]
for img_url in parser.urls:
if re.search('\.jpe?g$', img_url):
print "looking at: " + img_url
if not re.match('^http://', img_url):
img_url = 'http://' + parent + '/' + img_url
loc = url_mapper(img_url)
if os.path.exists(loc): continue
url_opener1 = MyURLOpener()
retrieve = url_opener1.retrieve
if loc == '':
print "\turl_mapper returned NULL for " + img_url
continue
loc_dir = re.match('(.+)/.+', loc).group(1)
try:
print('\ttrying to save ' + img_url)
if not os.path.exists(loc_dir):
os.makedirs(loc_dir)
retrieve(img_url, loc)
except IOError:
print('\tSkipping saving ' + img_url)
continue
count += 1
print "\tImg fetched: ",count



2 comments:
Could you do this with Perl!!
Can you upload the same!
Why do u want this in Perl?
Pyhton is better
Post a Comment