• Scrapy学习-10-Request&Response对象


    请求URL流程

    Scarpy使用请求和响应对象来抓取网站

     通常情况下,请求对象会在spider中生成,并在系统中传递,直到到达downloader,它执行请求并返回一个响应对象,该对象返回发送请求的spider。
     请求和响应类都有子类,它们添加了基类中不需要的功能。
     

    Request对象

    """
    This module implements the Request class which is used to represent HTTP
    requests in Scrapy.
    
    See documentation in docs/topics/request-response.rst
    """
    import six
    from w3lib.url import safe_url_string
    
    from scrapy.http.headers import Headers
    from scrapy.utils.python import to_bytes
    from scrapy.utils.trackref import object_ref
    from scrapy.utils.url import escape_ajax
    from scrapy.http.common import obsolete_setter
    
    
    class Request(object_ref):
    
        def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                     cookies=None, meta=None, encoding='utf-8', priority=0,
                     dont_filter=False, errback=None, flags=None):
    
            self._encoding = encoding  # this one has to be set first
            self.method = str(method).upper()
            self._set_url(url)
            self._set_body(body)
            assert isinstance(priority, int), "Request priority not an integer: %r" % priority
            self.priority = priority
    
            if callback is not None and not callable(callback):
                raise TypeError('callback must be a callable, got %s' % type(callback).__name__)
            if errback is not None and not callable(errback):
                raise TypeError('errback must be a callable, got %s' % type(errback).__name__)
            assert callback or not errback, "Cannot use errback without a callback"
            self.callback = callback
            self.errback = errback
    
            self.cookies = cookies or {}
            self.headers = Headers(headers or {}, encoding=encoding)
            self.dont_filter = dont_filter
    
            self._meta = dict(meta) if meta else None
            self.flags = [] if flags is None else list(flags)
    
        @property
        def meta(self):
            if self._meta is None:
                self._meta = {}
            return self._meta
    
        def _get_url(self):
            return self._url
    
        def _set_url(self, url):
            if not isinstance(url, six.string_types):
                raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    
            s = safe_url_string(url, self.encoding)
            self._url = escape_ajax(s)
    
            if ':' not in self._url:
                raise ValueError('Missing scheme in request url: %s' % self._url)
    
        url = property(_get_url, obsolete_setter(_set_url, 'url'))
    
        def _get_body(self):
            return self._body
    
        def _set_body(self, body):
            if body is None:
                self._body = b''
            else:
                self._body = to_bytes(body, self.encoding)
    
        body = property(_get_body, obsolete_setter(_set_body, 'body'))
    
        @property
        def encoding(self):
            return self._encoding
    
        def __str__(self):
            return "<%s %s>" % (self.method, self.url)
    
        __repr__ = __str__
    
        def copy(self):
            """Return a copy of this Request"""
            return self.replace()
    
        def replace(self, *args, **kwargs):
            """Create a new Request with the same attributes except for those
            given new values.
            """
            for x in ['url', 'method', 'headers', 'body', 'cookies', 'meta',
                      'encoding', 'priority', 'dont_filter', 'callback', 'errback']:
                kwargs.setdefault(x, getattr(self, x))
            cls = kwargs.pop('cls', self.__class__)
            return cls(*args, **kwargs)

    部分参数解析

    url (string) – the URL of this request
    
    callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
    
    method (string) – the HTTP method of this request. Defaults to 'GET'.
    
    meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.
    
    body (str or unicode) – the request body. If a unicode is passed, then it’s encoded to str using the encoding passed (which defaults to utf-8). If body is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be a str (never unicode or None).
    
    headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all.
    
    cookies (dict or list) –
    the request cookies. These can be sent in two forms.
    1.Using a dict:
        request_with_cookies = Request(url="http://www.example.com",
                                   cookies={'currency': 'USD', 'country': 'UY'})
    2. Using a list of dicts
        request_with_cookies =     Request(url="http://www.example.com",
                                   cookies=[{'name': 'currency',
                                            'value': 'USD',
                                            'domain': 'example.com',
                                            'path': '/currency'}])           

    Response对象

    """
    This module implements the Response class which is used to represent HTTP
    responses in Scrapy.
    
    See documentation in docs/topics/request-response.rst
    """
    from six.moves.urllib.parse import urljoin
    
    from scrapy.http.request import Request
    from scrapy.http.headers import Headers
    from scrapy.link import Link
    from scrapy.utils.trackref import object_ref
    from scrapy.http.common import obsolete_setter
    from scrapy.exceptions import NotSupported
    
    
    class Response(object_ref):
    
        def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None):
            self.headers = Headers(headers or {})
            self.status = int(status)
            self._set_body(body)
            self._set_url(url)
            self.request = request
            self.flags = [] if flags is None else list(flags)
    
        @property
        def meta(self):
            try:
                return self.request.meta
            except AttributeError:
                raise AttributeError(
                    "Response.meta not available, this response "
                    "is not tied to any request"
                )
    
        def _get_url(self):
            return self._url
    
        def _set_url(self, url):
            if isinstance(url, str):
                self._url = url
            else:
                raise TypeError('%s url must be str, got %s:' % (type(self).__name__,
                    type(url).__name__))
    
        url = property(_get_url, obsolete_setter(_set_url, 'url'))
    
        def _get_body(self):
            return self._body
    
        def _set_body(self, body):
            if body is None:
                self._body = b''
            elif not isinstance(body, bytes):
                raise TypeError(
                    "Response body must be bytes. "
                    "If you want to pass unicode body use TextResponse "
                    "or HtmlResponse.")
            else:
                self._body = body
    
        body = property(_get_body, obsolete_setter(_set_body, 'body'))
    
        def __str__(self):
            return "<%d %s>" % (self.status, self.url)
    
        __repr__ = __str__
    
        def copy(self):
            """Return a copy of this Response"""
            return self.replace()
    
        def replace(self, *args, **kwargs):
            """Create a new Response with the same attributes except for those
            given new values.
            """
            for x in ['url', 'status', 'headers', 'body', 'request', 'flags']:
                kwargs.setdefault(x, getattr(self, x))
            cls = kwargs.pop('cls', self.__class__)
            return cls(*args, **kwargs)
    
        def urljoin(self, url):
            """Join this Response's url with a possible relative url to form an
            absolute interpretation of the latter."""
            return urljoin(self.url, url)
    
        @property
        def text(self):
            """For subclasses of TextResponse, this will return the body
            as text (unicode object in Python 2 and str in Python 3)
            """
            raise AttributeError("Response content isn't text")
    
        def css(self, *a, **kw):
            """Shortcut method implemented only by responses whose content
            is text (subclasses of TextResponse).
            """
            raise NotSupported("Response content isn't text")
    
        def xpath(self, *a, **kw):
            """Shortcut method implemented only by responses whose content
            is text (subclasses of TextResponse).
            """
            raise NotSupported("Response content isn't text")
    
        def follow(self, url, callback=None, method='GET', headers=None, body=None,
                   cookies=None, meta=None, encoding='utf-8', priority=0,
                   dont_filter=False, errback=None):
            # type: (...) -> Request
            """
            Return a :class:`~.Request` instance to follow a link ``url``.
            It accepts the same arguments as ``Request.__init__`` method,
            but ``url`` can be a relative URL or a ``scrapy.link.Link`` object,
            not only an absolute URL.
            
            :class:`~.TextResponse` provides a :meth:`~.TextResponse.follow` 
            method which supports selectors in addition to absolute/relative URLs
            and Link objects.
            """
            if isinstance(url, Link):
                url = url.url
            url = self.urljoin(url)
            return Request(url, callback,
                           method=method,
                           headers=headers,
                           body=body,
                           cookies=cookies,
                           meta=meta,
                           encoding=encoding,
                           priority=priority,
                           dont_filter=dont_filter,
                           errback=errback)

    参考官方文档  https://doc.scrapy.org

  • 相关阅读:
    oracle启动的三步
    Solaris下vi的简单使用帮助
    Solaris下ftp配置(初稿待补充)
    soap笔记1
    Solaris 10 查看机器的网卡mac地址
    查看表空间名称与对应文件
    [转]Ubuntu10.04的网络配置
    [转]红帽企业版RHEL6使用Fedora13的yum源
    [转]linux忘记密码怎么办法
    [转]个人管理 - IT人士书籍推荐(已读)
  • 原文地址:https://www.cnblogs.com/cq146637/p/9069454.html
Copyright © 2020-2023  润新知