• python爬虫备忘录


    我都不知道多久没有发过博文了,伴随着毕业的到来,论文和实习,都一起到来了,可能我以后也很少发布编程类的文章了,更多的将会注重于网络安全文章的发布了,windowsAPI的文章将会逐渐从我的博文中删除,以后将会不定期更新webdirscan,weblogon_brust等的python技术文章,二进制或者手机APP类型的,很感叹自己一路走过来,学习了不少的知识,也遇到过很多大佬,向他们学习了一些知识,到如今,我还是觉得脚踏实地的走比较靠谱。之后我会陆续更新我的开源webdirscan软件,开源一些信息收集的小工具。

    爬虫环境配置

    selenium

    描述:模拟浏览器访问,有界面
    安装: pip3 install selenium
    基本使用:
    import selenium
    from selenium import webdriver
    driver=webdriver.Chrome()
    

    chromedriver

    描述:谷歌驱动
    安装:pip install chromedriver
    基本使用:
    import selenium
    from selenium import webdriver
    driver=webdriver.Chrome()
    

    phantomjs

    描述:模拟浏览器访问,无界面
    安装: 下载https://phantomjs.org/download
    配置:
    export PATH=${PATH}:/root/phantomjs/bin
    export OPENSSL_CONF=/etc/ssl/
    

    beautifulsoup4

    描述:html
    安装: pip3 install beautifulsoup4
    使用:
    from bs4 import BeautifulSoup
    soup=BeautifulSoup('<html><html>','lxml')
    

    pyquery

    描述:类似Jquery
    安装: pip3 install pyquery
    使用:
    from pyquery import PyQuery as pq
    doc=pq('<html>HELLO</html>')
    result=doc('html').text()
    result
    

    pymysql

    描述:mysql
    安装: pip3 install pymysql
    使用:
    import pymysql
    conn=pymysql.connect(host='127.0.0.1',user='root',password='root',port=3306,db='mysql')
    cursor=conn.cursor()
    cursor.execute('select * from db')
    cursor.fetchone()
    出错解决:
    update mysql.user set authentication_string=PASSWORD('root'), plugin='mysql_native_password' where user='root';
    

    pymongo

    描述:MongoDB
    安装:pip3 install pymongo
    使用:
    import pymongo
    client=pymongo.MongoClient('localhost')
    db=client['newtestdb']
    db['table'].insert({'name':'Bob'})
    db['table'].find_one({name':'Bob'})
    

    pyredis

    描述:Redis
    安装:pip3 install redis
    使用:
    import redis
    r=redis.Redis('localhost',6379)
    r.set('name','Bob')
    r.get('name')
    

    flask代理

    描述:proxy
    安装:pip3 install flask
    使用:
    import flask
    

    django

    描述:django
    安装:pip3 install django
    使用:
    import django
    

    jupyter

    描述:makedown在线,编译在线
    安装:pip3 install jupyter
    使用:
    jupyter notebook
    

    ALL

    pip3 install requests selenium beautifulsoup4 pyquery pymysql pymongo redis flask django jupyter
    

    什么是爬虫?

    一个简单的请求

    import requests
    response=request.get('https://www.baidu.com')
    #response.decoding="utf8"
    print(response.text)
    print(response.header)
    print(response.status_code)
    
    headers={'User-Agent':"**********"}
    response=requests.get('https://www.baidu.com',headers=headers)
    print(response.status_code)
    
    # 下载图片
    response=request.get('https://www.baidu.com/gif.ico')
    print(response.content)
    with open('/var/tmp/1.git','wb') as f:
    	f.write(response.content)
    	f.close
    
    

    JS(javascript)渲染问题

    selenium/webDriver or Splash or pyv8、Ghost.py

    from selenium import webdriver
    driver=webdriver.Chrome()
    driver.get('http://m.weibo.com')
    print(driver.page_sources)
    

    Urllib库

    什么是Urllib?

    内置请求:
    urllib.request 请求模块
    usllib.error 异常
    urllib.parse url解析模块
    urllib.robotparser robot.txt解析模块
    
    相比python2变化:
    python2:
    impoort urllib2
    response=usllib.urlopen('https://www.baidu.com')
    
    python3:
    import urllib.request
    response=urllib.request.urlopen('https://www.baidu.com')
    

    Requests库

    import requests
    
    response=requests.get('https://www.baidu.com')
    print(type(response))
    print(response.status_code)
    print(type(response.text))
    print(response.text)
    print(response.cookies)
    
    import requests
    response=requests.get('http://www.baidu.com?id=1')
    print(response.text)
    
    参数:
    import requests
    data={
    'name':'germay',
    'age':22
    }
    response=requests.get('url',params=data)
    print(response.text)
    
    JSON:
    import request
    response=requests.get('url')
    print(type(response.text))
    print(response.json())
    print(json.loads(response.text))
    print(type(response.json()))
    
    二进制:
    import requests
    response=request.get('url/img.ico')
    print(response.text)
    print(response.content)
    
    with open('a.ico','web') as f:
    	f.write(response.content)
    	f.close()
    
    headers:
    import requests 
    headers={
    'User-Agent':'Moziila/5.0'
    }
    response =request.get('URL/explore',headers=headers)
    print(response.text)
    
    POST请求:
    import request
    
    data={'name':'asd','age':22}
    response=request.post('http://www',data=data)
    print(response.text)
    
    import request
    
    data={'name':'asd','age':22}
    headers={
    'User-Agent':'asdasdasd'
    }
    response=requests.post('URL',data=data,headers=headers)
    print(response.json())
    
    响应:
    import requests
    
    response=requests.get('URL')
    print(response.status_code)
    print(response.headers)
    print(response.cookies)
    print(response.url)
    print(response.history)
    
    状态码判断
    import requests
    response=request.get('URL')
    exit() if not response.status_code==request.codes.ok else print('ok')
    #exit() if not response.status_code==request.codes.200 else print('ok')
    

    高级操作

    文件上传:
    import requests
    
    files={'file',open('img.jpg','rb')}
    response=requests.post('URL',file=files)
    print(response.text)
    
    获取cookie
    import requests
    
    response=request.get('URL')
    print(response.cookies)
    for key,value in reqonse.cookies.items():
    	print(key+'='+value)
    
    会话维持:
    import requests
    s=request.Session()
    s.get('set cookie URL')
    response=s.get('get cookie url')
    print(response.text)
    
    证书验证
    import requests 
    response=requests.get('URL')
    print(response.status_code)
    
    import requests 
    from request.packages import urllib3
    urllib3.disable_warinings()
    response=requests.get('URL',verify=False)# 不认证证书
    print(response.status_code)
    
    import requests 
    response=requests.get('URL',cert=('/path/server.crt','path/key'))
    print(response.status_code)
    
    代理设置:
    import request
    
    proxies={
    'http':'http://127.0.0.1:1080',
    'https':'https://127.0.0.1:1080'
    }
    
    response=request.get('URL',proxies=proxies)
    print(response.status_code)
    
    
    安装:pip install 'requests[socks]'
    import requests
    
    proxies={
    'http':'socks5://127.0.0.1:1080',
    'https':'socks5://127.0.0.1:1080'
    }
    response=request.get('URL',proxies=proxies)
    print(response.status_code)
    
    超时设置:
    import requests
    response=request.get('https://www.baidu.com',timeout=1)
    
    认证设置(网站访问认证):
    import requests
    from requests.auth import HTTPBasicAuth
    r=requests.get('http://127.0.0.1:9090',auth=HTTPBasicAuth('user',123))
    print(r.status_code)
    
    import requests
    r=requests.get('http://127.0.0.1:9090',auth=('user','123'))
    print(r.status_code)
    
    异常处理:
    import requests
    from request.excception import ReadTimeout,HTTPError,RequestException
    
    try:
    	response=request.get('URL',timeout=1)
    	print(response.status_code)
    except ReadTimeout:
    	print('...')
    except HTTPError:
    	print('...')
    except RequestException:
    	print('...')
    

    re正则表达式

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wq1rQiNO-1578217063553)(C:UserslvyAppDataRoamingTypora ypora-user-imagesimage-20200105170342954.png)]

    match

    re.match
    从字符串起始位置匹配一个模式,没有就是None
    re.match(pattern,staring,flags=0)
    
    普通匹配:
    impoer re
    content="Hello 123 4567 World_This is a Regex Demo"
    result=re.match('^Hellosdddsd[4]sw[10]ss[2].*Demo$',content)
    print(result.group())
    print(result.span())
    
    泛匹配:
    impoer re
    content="Hello 123 4567 Wordld_This is a Regex Demo"
    result=re.match("^Hello.*Demo$",content)
    print(result.group())
    print(result.span())
    
    目标匹配:
    impoer re
    content="Hello 123 4567 Wordld_This is a Regex Demo"
    result=re.match('^Hellos(d+)sWorld.*Demo$',content)# (d+)
    print(result.group(1))
    print(result.span())
    
    贪婪匹配:
    impoer re
    content="Hello 123 4567 Wordld_This is a Regex Demo"
    result=re.match("^He.*(d+).*Demo$",content)
    print(result)
    print(result.group(1))
    print(result.span())
    
    非贪婪匹配:
    impoer re
    content="Hello 123 4567 Wordld_This is a Regex Demo"
    result=re.match("^He.*?(d+).*Demo$",content)
    print(result)
    print(result.group(1))
    
    模式匹配:
    impoer re
    content="Hello 123 4567 Wordld_This
    is a Regex Demo
    "
    result=re.match("^He.*?(d+).*?Demo$",content,re.S)#re.S 这样.*可以匹配任意字符
    print(result)
    print(result.group(1))
    
    转义:
    import re
    content='price is $5.00'
    result=re.match('price is $5.00',content)
    print(result)
    print(result.group(1))
    

    search(一个)

    re.search:
    import re
    content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
    result=re.search('Hello.*?(d+).*?Demo',content)
    print(result)
    print(result.group(1))
    

    findall(所有)

    result=re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html.re.S)
    print(result)
    

    Sub

    替换字符串中每一个匹配的字串后返回替换后的字符串
    import re
    content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
    content=re.sub('d+','',content)
    print(content)
    
    import re
    content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
    content=re.sub('d+','replace',content)
    print(content)
    
    替换整体
    import re
    content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
    content=re.sub('(d+)','r1 8910',content)
    print(content)
    

    compile

    将一个正则表达式编译成表达式对象
    import re
    content='Extra strings Hello 1234567 World_This 
    is a Regex Demo Extra strings'
    pattern=re.compile("Hello.*Demo",re.S)
    result=re.match(pattern,content)
    print(result)
    
    技术不分国界
  • 相关阅读:
    玩转 CSS3 3D 技术
    什么是网站劫持?
    html5 播放 rtsp
    display:none和visibility:hidden两者的区别
    css中div透明度有几种方法设置?
    前端有架构吗?
    HTML a标签打开新标签页避免出现安全漏洞,请使用“noopener”
    写给刚入门的前端工程师的前后端交互指南
    为什么会出现CSS前缀?
    cookies和session得一些理解
  • 原文地址:https://www.cnblogs.com/angels-yaoyao/p/12443595.html
Copyright © 2020-2023  润新知