IP 代理
代理IP 存储的数据库clawer, smart_proxy_proxyip 表
select count(*) from smart_proxy_proxyip [where is_valid=1];
1.进行IP 爬去
from smart_proxy.cramer_proxy_ip import Cramer
cramer=Cramer()
cramer.run()
2.进行IP轮询
import smart_proxy.round_proxy_ip
smart_proxy.round_proxy_ip.run()
å 3.验证请求IP机制
from smart_proxy.api import Proxy
proxy=Proxy()
#不输入任何参数
print proxy.get_proxy()
#输入需要请求的数量/地区
print proxy.get_proxy(num=5,province='Beijing')
#获取无效的IP
print proxy.get_proxy(is_valid=False)
4.验证ip地址有效性判断
import smart_proxy.round_proxy_ip
print smart_proxy.round_proxy_ip.change_valid(10); #传入的是代理ip的id;
生成器
git pull前操作
git add clawer/settings_local.py
git add ../confs/dev/run_local.py
git commit -m “save settings_local and run_local”
git pull
准备工作:
1.修改 /confs/dev/run_local.sh 文件中
WOEKDIR=~/Projects/cr-clawer/clawer
PY=~/Projects/env/bin/python
RQWORKER=~/Projects/env/bin/rqworker
2.修改 cr-clawer/clawer/clawer/settings_local.py
PYTHON=“/home/princetechs/Projects/env/bin/python”
3.启动redis服务器
安装redis
yum install epel-release
yum install redis # 安装redis
service redis start #启动redis
redis-cli #查看是否正确启动redis
exit #退出redis
更改后:
redis-server # 启动redis
redis-cli # 启动客户端
4.启动worker
执行/confs/dev/run_local.sh该文件
./run_local.sh rq
5. 启动mongodb 服务器:
进入mongodb文件夹:cd ~/mongodb
第一次先要创建set与log文件夹. mkdir set; mkdir log; 并创建启动配置文件: vim mongo.conf
在mongo.conf中输入:
port=27017
dbpath=set/
logpath=log/mongo.log
logappend=true
启动mongod 命令:
./bin/mongod -f mongo.conf
每次使用python前,都要开启Python 虚拟环境:
source ~/Projects/env/bin/activate # 启动虚拟环境
deactivate # 离开虚拟环境
生成器调用步骤:
1.新建job
cd ~/Projects/cr-clawer/clawer # 进入manage.py sou所在目录从的
python manage.py makemigrations
python manage.py test collector.tests.test_generator.TestMongodb.test_job_save
cd mongodb/bin/
./mongo #进入mongo客户端
show dbs
use source
db.getCollectionNames()
db.job.find().pretty() #显示job表结构
2.数据预处理
python manage.py shell
from collector.utils_generator import DataPreprocess
dp = DataPreprocess('job_id_stirng')
1.schemes=['http']
dp.save(text='http://www.baidu.com',settings={'schemes':schemes})
# 查看mongo中是否正确添加
db.getCollectionNames()
db.crawler_task.find().pretty()
# 第一中直接录入urI,并制定协议
2. script="""import json
print json.dumps({'uri':"http://www.baidu.com",'sdf':"sdfdf"})"""
cron="*/3 * * * *"
dp.save(script=script,settings={'cron':cron,'code_type':1})
script="""import json
print json.dumps({'uri':"http://www.newbaidu.com"})"""
cron="* * * * *"
dp.save(script=script,settings={'cron':cron,'code_type':1})
crontab.task_generator_install()
# shell 脚本
script = """#!/bin/bash
echo "{'uri':'http://www.shell.com'}"
"""
cron = "* * * * *"
dp.save(script=script,settings={'cron':cron,'code_type':2})
导入txt csv:
file_object=open('/home/princetechs/桌面/文件名')
all_the_text=file_object.read()
dp.save(text=all_the_text,settings={'schemes':schemes})
补充说明:传入参数按测试用例上的不同情况,一一测试,分手工录入、导入txt、导入csv、上传python脚本;
cd
3.更新生成器脚本
在/cr-clawer/confs/dev/ 下新建文件 testFilename;
from collector.utils_generator import CrawlerCronTab
filename="/home/princetechs/Projects/cr-clawer/confs/dev/testFilename"
crontab = CrawlerCronTab(filename= filename)
#filename为字符串类型,要读取或保存crontab信息的文件地址。
#定期更新job与生成器脚本
crontab.task_generator_install()
4.执行crontab中任务分发命令
crontab.task_generator_run()
5. 4000条job与generator生成器脚本:
python manage.py test collector.tests.test_generator.TestPreprocess.insert_4000_jobs_with_generators
补充:功能代码路径
~/Projects/cr-clawer/clawer/collector/utils-generator.py
输入法设置参考网站:
http://blog.csdn.net/alex_my/article/details/38223449
下载器:下载器爬取uri enterprice://重庆/重庆钢铁集团电视台/50010410003471