本文可以无视掉了，直接看上面的链接吧，删起来有点麻烦先放在这

静态网页的抓取

一开始用urlbin，但是这个只能抓静态网页，现在没有几个网页是静态的了，都是用的ajax动态加载。这个如果抓取动态网页，网页会不全

附上源码

#-*-coding:utf-8-*-
#爬取图片
#打开的网页不全，因为网页有动态加载。于是又写了个基于selenium的爬虫，完全模拟浏览器行为
#urllib也算是处理静态网页最容易上手的库了，简单的爬虫用这个最好写

import re
import urllib2
import urllib
import pyquery

#http头，没啥意义
header = {
‘User-Agent’:’Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6′
‘Connection: Keep-Alive’
‘Referer: http://www.baidu.com’
‘Pragma: no-cache’
‘Accept: application/json, text/javascript, */*; q=0.01′
}

#读取网页
def openURL(url):
a = urllib2.Request(url = url,headers = header)
domin = urllib2.urlopen(a)
content = domin.read()
return content

#这个正则可能也有问题，但是一开始不全确实是网页代码是动态加载，URLBIL没加载的关系
def comp(str):
rere = r’src=”(.*?\.jpg)” bdwater=’
compicture = re.compile(rere)
pictures = re.findall(compicture,str)
return pictures

def main():
index = 0
wanna = comp(openURL(‘http://tieba.baidu.com/p/2166231880′))
for i in wanna:
index = index + 1
urllib.urlretrieve(i,’pic/%d.jpg’ % index)
print “[+] Saved “+ str(index) + “.jpg”

if __name__ == ‘__main__’:
main()

动态网页的抓取

然后写了个第二个版本，用selenium模拟浏览器的整个访问过程，这个东西会等到网页加载渲染完之后才会进行之后的代码

#-*-coding:utf-8-*-
#完全模仿浏览器行为,利用第四行的库可以打开一个浏览器，模拟浏览器操作

from selenium import webdriver
import urllib
import re

#一开始存网页源代码的时候编码有问题，以前有过类似的的情况
import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)

#模拟浏览器操作，这也算是处理动态网页通用的方法
def getwholesource(url):
chrome = webdriver.Chrome()
chrome.get(‘http://tieba.baidu.com/p/2166231880′)
#用js把网页拉到最下面，主要目的是触发JS事件使Ajax加载内容（前两天看XSS的时候学了一点JS
js = “var q=document.documentElement.scrollTop=40000”
chrome.execute_script(js)
str1 = chrome.page_source
return str1

#正则匹配，找出图片的URL
def comp(str1):
#发现所有图片的URL的长度是一样的，这个正则有点偷懒了，但确实有用
rere = r’http://.{182}\.jpg’
compicture = re.compile(rere)
pictures = re.findall(compicture,str1)
return pictures

#最后利用urlbin.urlretrieve方法下载图片
def main():
pic = comp(getwholesource(‘http://tieba.baidu.com/p/2166231880’))
index = 0
for i in pic:
index = index + 1
urllib.urlretrieve(i, ‘pic/%d.jpg’ % index)
print “[+] Saved ” + str(index) + “.jpg”

if __name__ == ‘__main__’:
main()

oldshe100

python爬虫

静态网页的抓取

动态网页的抓取

然后写了个第二个版本，用selenium模拟浏览器的整个访问过程，这个东西会等到网页加载渲染完之后才会进行之后的代码

discuz任意文件删除漏洞复现

一个可以发微博的多功能脚本

发表回复取消回复

python爬虫

静态网页的抓取

动态网页的抓取

然后写了个第二个版本，用selenium模拟浏览器的整个访问过程，这个东西会等到网页加载渲染完之后才会进行之后的代码

discuz任意文件删除漏洞复现

一个可以发微博的多功能脚本

发表回复 取消回复

发表回复取消回复