• gocrawl 分析


    1. gocrawl 类结构

     

     1 // The crawler itself, the master of the whole process
     2 type Crawler struct {
     3     Options *Options
     4 
     5     // Internal fields
     6     logFunc         func(LogFlags, string, ...interface{})
     7     push            chan *workerResponse
     8     enqueue         chan interface{}
     9     stop            chan struct{}
    10     wg              *sync.WaitGroup
    11     pushPopRefCount int
    12     visits          int
    13 
    14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
    15     // is of no use, but this is the smallest type possible - it uses no memory at all.
    16     visited map[string]struct{}
    17     hosts   map[string]struct{}
    18     workers map[string]*worker
    19 }
     1 // The Options available to control and customize the crawling process.
     2 type Options struct {
     3     UserAgent             string
     4     RobotUserAgent        string
     5     MaxVisits             int
     6     EnqueueChanBuffer     int
     7     HostBufferFactor      int
     8     CrawlDelay            time.Duration // Applied per host
     9     WorkerIdleTTL         time.Duration
    10     SameHostOnly          bool
    11     HeadBeforeGet         bool
    12     URLNormalizationFlags purell.NormalizationFlags
    13     LogFlags              LogFlags
    14     Extender              Extender
    15 }
     1 // Extension methods required to provide an extender instance.
     2 type Extender interface {
     3     // Start, End, Error and Log are not related to a specific URL, so they don't
     4     // receive a URLContext struct.
     5     Start(interface{}) interface{}
     6     End(error)
     7     Error(*CrawlError)
     8     Log(LogFlags, LogFlags, string)
     9 
    10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
    11     // is related to a URLContext (holds a ctx field).
    12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
    13 
    14     // All other extender methods are executed in the context of an URL, and thus
    15     // receive an URLContext struct as first argument.
    16     Fetch(*URLContext, string, bool) (*http.Response, error)
    17     RequestGet(*URLContext, *http.Response) bool
    18     RequestRobots(*URLContext, string) ([]byte, bool)
    19     FetchedRobots(*URLContext, *http.Response)
    20     Filter(*URLContext, bool) bool
    21     Enqueued(*URLContext)
    22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
    23     Visited(*URLContext, interface{})
    24     Disallowed(*URLContext)
    25 }

    entry point:

     1 func main() {
     2     ext := &Ext{&gocrawl.DefaultExtender{}}
     3     // Set custom options
     4     opts := gocrawl.NewOptions(ext)
     5     opts.CrawlDelay = 1 * time.Second
     6     opts.LogFlags = gocrawl.LogError
     7     opts.SameHostOnly = false
     8     opts.MaxVisits = 10
     9 
    10     c := gocrawl.NewCrawlerWithOptions(opts)
    11     c.Run("http://0value.com")
    12 }

    3 steps:  in main

    1) get a Extender

    2) create Options with given Extender

    3) create gocrawel

    as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

    2. other key structs

    worker, workResponse and sync.WaitGroup

    1 // Communication from worker to the master crawler, about the crawling of a URL
    2 type workerResponse struct {
    3     ctx           *URLContext
    4     visited       bool
    5     harvestedURLs interface{}
    6     host          string
    7     idleDeath     bool
    8 }
     1 // The worker is dedicated to fetching and visiting a given host, respecting
     2 // this host's robots.txt crawling policies.
     3 type worker struct {
     4     // Worker identification
     5     host  string
     6     index int
     7 
     8     // Communication channels and sync
     9     push    chan<- *workerResponse
    10     pop     popChannel
    11     stop    chan struct{}
    12     enqueue chan<- interface{}
    13     wg      *sync.WaitGroup
    14 
    15     // Robots validation
    16     robotsGroup *robotstxt.Group
    17 
    18     // Logging
    19     logFunc func(LogFlags, string, ...interface{})
    20 
    21     // Implementation fields
    22     wait           <-chan time.Time
    23     lastFetch      *FetchInfo
    24     lastCrawlDelay time.Duration
    25     opts           *Options
    26 }
    for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

    3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

  • 相关阅读:
    是否可以从一个静态(static)方法内部发出对非静态 (non-static)方法的调用?
    是否可以继承 String 类?
    如何实现字符串的反转及替换?
    String s = new String(“xyz”);创建了几个字符串对象?
    Serial 与 Parallel GC 之间的不同之处?
    SVG是什么?
    阐述静态变量和实例变量的区别?
    Java 中 ++ 操作符是线程安全的吗?
    什么是 REST / RESTful 以及它的用途是什么?
    WebSQL是HTML 5规范的一部分吗?
  • 原文地址:https://www.cnblogs.com/harrysun/p/3798438.html
Copyright © 2020-2023  润新知