以下代码摘自Python核心编程并经过本人调试,切实可行,主要的是urlib库的应用和字符串的处理。在命令行或idle中运行。
缺点及改进。
1.它对用户交互输入的URL并没有做规范化处理,比如输入http://falcon.sinaapp.com/是可以,但如果不加域名后面的(路径也就是”/”),如http://falcon.sinaapp.com是会在创建目录时报错,因为此时的路径为空的。
2.只有单线程,要考虑多线程问题
3.Crawler实例内的self.seen是用来存储已经下载的链接的,如果链接足够多的话。不知道会不会撑得住。而且有一个问题是如果不是一次性下载全部链接,比如抓取一半链接后关机,开机后又进行抓取,这时会重复以前的抓取,并下载以前的内容。已抓取链接的记录持久化也是要考虑的问题。
4.日志记录,这个不必多说,全部屏显不实际
5.创建目录和文件时的非法字符串的判断,如xbaxecxc2xc3xb6xafxc2xfe@wt4.hltm.cc:6111这种字符串在WIN建立文件是非法的。
6.对链接的判断不严谨,比如会把href里的javascript,#之类也算进去。
from sys import argv
from os import makedirs,unlink,sep
from os.path import dirname,exists,isdir,splitext
from string import replace,find,lower
from htmllib import HTMLParser
from urllib import urlretrieve
from urlparse import urlparse,urljoin
from formatter import DumbWriter,AbstractFormatter
from cStringIO import StringIO
class Retriever(object):#download Web pages
def __init__(self,url):
self.url = url
self.file = self.filename(url)
def filename(self,url,deffile ='index.htm'):
parsedurl = urlparse(url,'http:',0) #parse url?
path = parsedurl[1] + parsedurl[2]
ext = splitext(path)
if ext[1] =='': #no file,use default
if path[-1] =='/':
path +=deffile
else:
path +='/' + deffile
ldir = dirname(path)
if sep !='/':
ldir = replace(ldir,'/',sep)
if not isdir(ldir):
if exists(ldir):unlink(ldir)
makedirs(ldir)
return path
def download(self):
try :
retval = urlretrieve(self.url,self.file)
except IOError:
retval = '*** ERROR: invalid URL "%s"' % self.url
return retval
def parseAndGetLinks(self):
self.parser = HTMLParser(AbstractFormatter(
DumbWriter(StringIO())))
self.parser.feed(open(self.file).read())
self.parser.close()
return self.parser.anchorlist
class Crawler(object):
count = 0
def __init__(self,url):
self.q =[url,]
self.seen =[]
self.dom = urlparse(url)[1]
def getPage(self,url):
r = Retriever(url)
retval = r.download()
if retval[0] =='*':
print retval,'...skipping parser'
return
Crawler.count +=1#self.__class__.count +=1 maybe better
print '
(',Crawler.count,')'
print 'URL:',url
print 'FILE:',retval[0]
self.seen.append(url)
links = r.parseAndGetLinks()
for eachLink in links:
if eachLink[:4] != 'http' and
find(eachLink,'://') == -1:
eachLink = urljoin(url,eachLink)
print '*',eachLink,
if find(lower(eachLink),'mailto:') !=-1:
print '... discarded,mailto link'
continue
if eachLink not in self.seen:
if find(eachLink,self.dom) ==-1:
print '...discard, not in domain'
else:
if eachLink not in self.q:
self.q.append(eachLink)
print '... new,added to Q'
else:
print '...discarded,already in Q'
else :
print '... discarded,already processed'
def go(self):#process links in queue
while self.q:
url = self.q.pop()
self.getPage(url)
def main():
if len(argv)>1:
url = argv[1]
else:
try:
url = raw_input('Enter starting URL:')
except (KeyboardInterrupt,EOFError):
url =''
if not url :return
robot = Crawler(url)
robot.go()
if __name__ =='__main__':
main()