job1: 把超过24小时的数据移到另一张表
@Override @Transactional public void moveItemsTo48h() { List<CrawlItem> historyItems = super.baseMapper.getItemsAfter24Hours(); historyItems.stream().forEach(a -> { Long id = a.getId(); CrawlItem48h item = BeanUtil.toBean(a,CrawlItem48h.class); QueryWrapper<CrawlItem48h> wrapper = new QueryWrapper<CrawlItem48h>(); wrapper.eq("mall",a.getMall()); wrapper.eq("goods_source_sn", a.getGoodsSourceSn()); CrawlItem48h exist = crawlItem48hMapper.selectOne(wrapper); if(exist == null) { crawlItem48hMapper.insert(item); }else { BeanUtil.copyProperties(item,exist,"id"); crawlItem48hMapper.updateById(exist); } removeById(id); }); }
@Select("select id,goods_source_sn,goods_info_url,source,url_code," +
"thumb_url,zhi_count,buzhi_count,star_count,comments_count,mall,title,emphsis,detail,detail_brief," +
"label,category_text,item_create_time,item_update_time,main_image_url,big_image_urls,small_image_urls," +
"price_text,price,unit_price,actual_buy_link,transfer_link,transfer_result,transfer_remark,coupon_info,taobao_pwd," +
"score,score_minute,keywords,status,remark,creator," +
"creator_id,last_operator,last_operator_id from crawl_items where TIMESTAMPDIFF(HOUR,item_create_time,now()) > 24 for update")
List<CrawlItem> getItemsAfter24Hours();
job2: 把前两页的数据重新爬取
@Override @Transactional public void reCrawl() { List<String> urllist = crawlItemMapper.getFirstRecrawlItems(); StringBuffer b = new StringBuffer(); urllist.stream().forEach(a -> b.append(a).append(",")); String raw= b.toString(); String urls = raw.substring(0,raw.length()-1); Map<String, Object> paramMap = new HashMap<String,Object>(); paramMap.put("project","smzdmCrawler"); paramMap.put("spider","smzdm_single"); paramMap.put("url",urls String response = HttpUtil.createRequest(Method.POST,"http://42.192.51.99:6801/schedule.json").basicAuth("david_scrapyd","david_2021").form(paramMap).execute().body(); log.info(response); } @Select("select goods_info_url from crawl_items where status=1 order by score_minute desc, id desc limit 0,40 for update") List<String> getFirstRecrawlItems();
这两个查询都有for update,而且方法都加了@Transactional,理论上job1 sql先锁表,job2的sql会等在那,直到moveItemsTo48h()执行完,然后解锁,job2执行getFirstRecrawlItems(),然后job1里被删的数据job2里应该检索不出来,也就不会重新插进去了。但线上结果似乎都是job2的sql先锁表,然后就会把在job1里删掉的数据重新爬取插进去。可惜线上没把sql语句的debug日志打出,mysql也没有开general_log,只能暂且认为job2里先锁表了。