• 两个job同时执行发生的数据问题


    job1: 把超过24小时的数据移到另一张表

        @Override
        @Transactional
        public void moveItemsTo48h() {
            List<CrawlItem> historyItems = super.baseMapper.getItemsAfter24Hours();
            historyItems.stream().forEach(a -> {
                Long id = a.getId();
                CrawlItem48h item = BeanUtil.toBean(a,CrawlItem48h.class);
    
                QueryWrapper<CrawlItem48h> wrapper = new QueryWrapper<CrawlItem48h>();
                wrapper.eq("mall",a.getMall());
                wrapper.eq("goods_source_sn", a.getGoodsSourceSn());
                CrawlItem48h exist = crawlItem48hMapper.selectOne(wrapper);
                if(exist == null) {
                    crawlItem48hMapper.insert(item);
                }else {
                    BeanUtil.copyProperties(item,exist,"id");
                    crawlItem48hMapper.updateById(exist);
                }
                removeById(id);
            });
        }


    @Select("select id,goods_source_sn,goods_info_url,source,url_code," +
    "thumb_url,zhi_count,buzhi_count,star_count,comments_count,mall,title,emphsis,detail,detail_brief," +
    "label,category_text,item_create_time,item_update_time,main_image_url,big_image_urls,small_image_urls," +
    "price_text,price,unit_price,actual_buy_link,transfer_link,transfer_result,transfer_remark,coupon_info,taobao_pwd," +
    "score,score_minute,keywords,status,remark,creator," +
    "creator_id,last_operator,last_operator_id from crawl_items where TIMESTAMPDIFF(HOUR,item_create_time,now()) > 24 for update")
    List<CrawlItem> getItemsAfter24Hours();
     

    job2: 把前两页的数据重新爬取

        @Override
        @Transactional
        public void reCrawl() {
            List<String> urllist = crawlItemMapper.getFirstRecrawlItems();
            StringBuffer b = new StringBuffer();
            urllist.stream().forEach(a -> b.append(a).append(","));
            String raw= b.toString();
            String urls = raw.substring(0,raw.length()-1);
            Map<String, Object> paramMap = new HashMap<String,Object>();
            paramMap.put("project","smzdmCrawler");
            paramMap.put("spider","smzdm_single");
            paramMap.put("url",urls
            String response = HttpUtil.createRequest(Method.POST,"http://42.192.51.99:6801/schedule.json").basicAuth("david_scrapyd","david_2021").form(paramMap).execute().body();
            log.info(response);
        }
    
        @Select("select goods_info_url from crawl_items where status=1 order by score_minute desc, id desc limit 0,40 for update")
        List<String> getFirstRecrawlItems();

    这两个查询都有for update,而且方法都加了@Transactional,理论上job1 sql先锁表,job2的sql会等在那,直到moveItemsTo48h()执行完,然后解锁,job2执行getFirstRecrawlItems(),然后job1里被删的数据job2里应该检索不出来,也就不会重新插进去了。但线上结果似乎都是job2的sql先锁表,然后就会把在job1里删掉的数据重新爬取插进去。可惜线上没把sql语句的debug日志打出,mysql也没有开general_log,只能暂且认为job2里先锁表了。

    喜欢艺术的码农
  • 相关阅读:
    洛谷 P5043 树的同构 题解
    CF 1178E Archaeology 题解
    UVA 1642 MagicalGCD 题解
    洛谷 P3919 可持久化线段树 题解
    SPOJ 4003 Phone List 题解
    OI/ACM最全卡常大招
    洛谷 P3368 树状数组 题解
    逆序对
    洛谷 U78696 图书馆馆长的考验 题解
    原创 疲劳炉石传说
  • 原文地址:https://www.cnblogs.com/zjhgx/p/15092376.html
Copyright © 2020-2023  润新知