• python首次使用generator时遇到的问题


    1. 因为刚开始开发python程序不久就遇到了一个generator的问题,python有的特性还是和java有所区别,记录一下这个问题

       def parse(self, response):
            self.logger.info("smzdm_jingxuan spider starting")
            self.broswer.get('https://www.smzdm.com/jingxuan/')
            last_height = self.broswer.execute_script("return document.body.scrollHeight")
            print(last_height)
            count = 0
    
            while True:
                if count == 30:
                    break
                self.broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(2)
                new_height = self.broswer.execute_script("return document.body.scrollHeight")
                if new_height == last_height:
                    break
                last_height = new_height
                time.sleep(1.2)
                count = count + 1
    
            source = self.broswer.page_source
            self.logger.info('before parse goods in page')
            try:
                # return self.parse_goods_in_page(source)
                self.parse_goods_in_page(source)
            except Exception as e:   def parse_goods_in_page(self, source):
       def parse_goods_in_page(self, source):
            print('parse_goods_in_page in')
            self.logger.info("parse_goods_in_page in in")
            scrapy_selector = Selector(text=source)
            items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]')
            self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.')
            s = 0
    
            for item_selector in items_selector:
                try:
                    self.logger.info("s=" + str(s))
                    .......
                    .......
                    s = s + 1
                    yield smzdm_item
    
    
                except Exception as e:
                    self.logger.info('Reached last iteration #' + e.__traceback__ + str(s))
    
            self.broswer.close()
            return

    现象是在parse()中调parse_goods_in_page(),发现调不到,打断点跟踪也进不去方法。因为java没有这样的特性所以一开始有点懵。后来查了资料,在《Python Cookbook》中有说到

    “The mere presence of the yield statement in a function turns it into a generator. Unlike a normal function, a generator only runs in response to iteration”
    
    Excerpt From: David Beazley and Brian K. Jones. “Python Cookbook.” Apple Books. 

    也就是说,关键字yield说明parse_goods_in_page这个方法是个generator。而直接调用generator方法,如代码中 self.parse_goods_in_page(source)这种不会运行这个方法,而只是返回了一个generator。如 gen = self.parse_goods_in_page(source), 可以看到里面的日志没有输出。要运行的话只能next(gen),此时代码会跑到 yield smzdm_item 然后返回,如果再运行一次 next(gen),代码从yield smzdm_item后继续第二次循环到yield smzdm_item返回。

    那我想调用parse_goods_in_page这个方法希望它一下子整个执行完怎么办呢,可以把self.parse_goods_in_page(source)这句改成yield from self.parse_goods_in_page(source)[https://www.python.org/dev/peps/pep-0380/]或 return self.parse_goods_in_page(source)

    2. 需要注意以下代码

       def parse_goods_in_page(self, page_items_selector_list):
      ...
                        for t in create_time_text:
                            if t.strip():
                                item_create_time = t.strip()
                                self.logger.info("create_time = " + item_create_time)
                                smzdm_item['item_create_time'] = item_create_time
                        # 拼多多的create_time和其他的不太一样
                        self.logger.info("item_create_time = " + str(item_create_time))
                        if not item_create_time:
                            create_time_text = item_selector.css(".wechat-time::text").get()
                            self.logger.info("pdd create_time_text = " + str(create_time_text))
                            item_create_time = create_time_text.strip()
                            self.logger.info("create_time = " + item_create_time)
    
                 ........
                        s = s + 1
                        yield smzdm_item

    在开发过程中发现item_create_time没有赋值的情况下(没进入到for t in create_time_text),执行到if not item_create_time时没有报错(变量未被赋值时这个变量应该不存在),打印出来后确实有值。注意到这个方法是个generator,那前面next跑了一次,接下来next时这个item_create_time其实是上一次的值。所以要避免这个问题在方法前面可以先给item_create_time初始化一下

    喜欢艺术的码农
  • 相关阅读:
    C#基础笔记(第二十一天)
    C#基础笔记(第十九天)
    C#基础笔记(第十八天)
    C#基础笔记(第十七天)
    C#基础笔记(第十六天)
    C#基础笔记(第十五天)
    [PyTorch 学习笔记] 6.1 weight decay 和 dropout
    PyTorch ResNet 使用与源码解析
    [PyTorch 学习笔记] 5.2 Hook 函数与 CAM 算法
    [PyTorch 学习笔记] 5.1 TensorBoard 介绍
  • 原文地址:https://www.cnblogs.com/zjhgx/p/13090241.html
Copyright © 2020-2023  润新知