• gocrawl 分析


    1. gocrawl 类结构

     

     1 // The crawler itself, the master of the whole process
     2 type Crawler struct {
     3     Options *Options
     4 
     5     // Internal fields
     6     logFunc         func(LogFlags, string, ...interface{})
     7     push            chan *workerResponse
     8     enqueue         chan interface{}
     9     stop            chan struct{}
    10     wg              *sync.WaitGroup
    11     pushPopRefCount int
    12     visits          int
    13 
    14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
    15     // is of no use, but this is the smallest type possible - it uses no memory at all.
    16     visited map[string]struct{}
    17     hosts   map[string]struct{}
    18     workers map[string]*worker
    19 }
     1 // The Options available to control and customize the crawling process.
     2 type Options struct {
     3     UserAgent             string
     4     RobotUserAgent        string
     5     MaxVisits             int
     6     EnqueueChanBuffer     int
     7     HostBufferFactor      int
     8     CrawlDelay            time.Duration // Applied per host
     9     WorkerIdleTTL         time.Duration
    10     SameHostOnly          bool
    11     HeadBeforeGet         bool
    12     URLNormalizationFlags purell.NormalizationFlags
    13     LogFlags              LogFlags
    14     Extender              Extender
    15 }
     1 // Extension methods required to provide an extender instance.
     2 type Extender interface {
     3     // Start, End, Error and Log are not related to a specific URL, so they don't
     4     // receive a URLContext struct.
     5     Start(interface{}) interface{}
     6     End(error)
     7     Error(*CrawlError)
     8     Log(LogFlags, LogFlags, string)
     9 
    10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
    11     // is related to a URLContext (holds a ctx field).
    12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
    13 
    14     // All other extender methods are executed in the context of an URL, and thus
    15     // receive an URLContext struct as first argument.
    16     Fetch(*URLContext, string, bool) (*http.Response, error)
    17     RequestGet(*URLContext, *http.Response) bool
    18     RequestRobots(*URLContext, string) ([]byte, bool)
    19     FetchedRobots(*URLContext, *http.Response)
    20     Filter(*URLContext, bool) bool
    21     Enqueued(*URLContext)
    22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
    23     Visited(*URLContext, interface{})
    24     Disallowed(*URLContext)
    25 }

    entry point:

     1 func main() {
     2     ext := &Ext{&gocrawl.DefaultExtender{}}
     3     // Set custom options
     4     opts := gocrawl.NewOptions(ext)
     5     opts.CrawlDelay = 1 * time.Second
     6     opts.LogFlags = gocrawl.LogError
     7     opts.SameHostOnly = false
     8     opts.MaxVisits = 10
     9 
    10     c := gocrawl.NewCrawlerWithOptions(opts)
    11     c.Run("http://0value.com")
    12 }

    3 steps:  in main

    1) get a Extender

    2) create Options with given Extender

    3) create gocrawel

    as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

    2. other key structs

    worker, workResponse and sync.WaitGroup

    1 // Communication from worker to the master crawler, about the crawling of a URL
    2 type workerResponse struct {
    3     ctx           *URLContext
    4     visited       bool
    5     harvestedURLs interface{}
    6     host          string
    7     idleDeath     bool
    8 }
     1 // The worker is dedicated to fetching and visiting a given host, respecting
     2 // this host's robots.txt crawling policies.
     3 type worker struct {
     4     // Worker identification
     5     host  string
     6     index int
     7 
     8     // Communication channels and sync
     9     push    chan<- *workerResponse
    10     pop     popChannel
    11     stop    chan struct{}
    12     enqueue chan<- interface{}
    13     wg      *sync.WaitGroup
    14 
    15     // Robots validation
    16     robotsGroup *robotstxt.Group
    17 
    18     // Logging
    19     logFunc func(LogFlags, string, ...interface{})
    20 
    21     // Implementation fields
    22     wait           <-chan time.Time
    23     lastFetch      *FetchInfo
    24     lastCrawlDelay time.Duration
    25     opts           *Options
    26 }
    for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

    3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

  • 相关阅读:
    《Selenium自动化测试实战》新书上市,有需要朋友们可以了解下,欢迎大家多提宝贵意见
    OrchardCore 如何动态加载模块?
    性能测试基础知识体系
    职场新人如何快速融入团队
    技术之外的工程师另类成长指南
    4.17-线上-技术沙龙问题汇总答疑
    3.20-上海-技术沙龙问题汇总答疑
    推荐书单4.0:测试工程师破局之路
    从技术专家到技术管理,我对管理的思考
    chrome打开本地链接
  • 原文地址:https://www.cnblogs.com/harrysun/p/3798438.html
Copyright © 2020-2023  润新知