1. 因为刚开始开发python程序不久就遇到了一个generator的问题,python有的特性还是和java有所区别,记录一下这个问题
def parse(self, response): self.logger.info("smzdm_jingxuan spider starting") self.broswer.get('https://www.smzdm.com/jingxuan/') last_height = self.broswer.execute_script("return document.body.scrollHeight") print(last_height) count = 0 while True: if count == 30: break self.broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) new_height = self.broswer.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height time.sleep(1.2) count = count + 1 source = self.broswer.page_source self.logger.info('before parse goods in page') try: # return self.parse_goods_in_page(source) self.parse_goods_in_page(source) except Exception as e: def parse_goods_in_page(self, source):
def parse_goods_in_page(self, source): print('parse_goods_in_page in') self.logger.info("parse_goods_in_page in in") scrapy_selector = Selector(text=source) items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]') self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.') s = 0 for item_selector in items_selector: try: self.logger.info("s=" + str(s)) ....... ....... s = s + 1 yield smzdm_item except Exception as e: self.logger.info('Reached last iteration #' + e.__traceback__ + str(s)) self.broswer.close() return
现象是在parse()中调parse_goods_in_page(),发现调不到,打断点跟踪也进不去方法。因为java没有这样的特性所以一开始有点懵。后来查了资料,在《Python Cookbook》中有说到
“The mere presence of the yield statement in a function turns it into a generator. Unlike a normal function, a generator only runs in response to iteration”
Excerpt From: David Beazley and Brian K. Jones. “Python Cookbook.” Apple Books.
也就是说,关键字yield说明parse_goods_in_page这个方法是个generator。而直接调用generator方法,如代码中 self.parse_goods_in_page(source)这种不会运行这个方法,而只是返回了一个generator。如 gen = self.parse_goods_in_page(source), 可以看到里面的日志没有输出。要运行的话只能next(gen),此时代码会跑到 yield smzdm_item 然后返回,如果再运行一次 next(gen),代码从yield smzdm_item后继续第二次循环到yield smzdm_item返回。
那我想调用parse_goods_in_page这个方法希望它一下子整个执行完怎么办呢,可以把self.parse_goods_in_page(source)这句改成yield from self.parse_goods_in_page(source)[https://www.python.org/dev/peps/pep-0380/]或 return self.parse_goods_in_page(source)
2. 需要注意以下代码
def parse_goods_in_page(self, page_items_selector_list): ... for t in create_time_text: if t.strip(): item_create_time = t.strip() self.logger.info("create_time = " + item_create_time) smzdm_item['item_create_time'] = item_create_time # 拼多多的create_time和其他的不太一样 self.logger.info("item_create_time = " + str(item_create_time)) if not item_create_time: create_time_text = item_selector.css(".wechat-time::text").get() self.logger.info("pdd create_time_text = " + str(create_time_text)) item_create_time = create_time_text.strip() self.logger.info("create_time = " + item_create_time) ........ s = s + 1 yield smzdm_item
在开发过程中发现item_create_time没有赋值的情况下(没进入到for t in create_time_text),执行到if not item_create_time时没有报错(变量未被赋值时这个变量应该不存在),打印出来后确实有值。注意到这个方法是个generator,那前面next跑了一次,接下来next时这个item_create_time其实是上一次的值。所以要避免这个问题在方法前面可以先给item_create_time初始化一下