1. gocrawl 类结构
1 // The crawler itself, the master of the whole process 2 type Crawler struct { 3 Options *Options 4 5 // Internal fields 6 logFunc func(LogFlags, string, ...interface{}) 7 push chan *workerResponse 8 enqueue chan interface{} 9 stop chan struct{} 10 wg *sync.WaitGroup 11 pushPopRefCount int 12 visits int 13 14 // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value 15 // is of no use, but this is the smallest type possible - it uses no memory at all. 16 visited map[string]struct{} 17 hosts map[string]struct{} 18 workers map[string]*worker 19 }
1 // The Options available to control and customize the crawling process. 2 type Options struct { 3 UserAgent string 4 RobotUserAgent string 5 MaxVisits int 6 EnqueueChanBuffer int 7 HostBufferFactor int 8 CrawlDelay time.Duration // Applied per host 9 WorkerIdleTTL time.Duration 10 SameHostOnly bool 11 HeadBeforeGet bool 12 URLNormalizationFlags purell.NormalizationFlags 13 LogFlags LogFlags 14 Extender Extender 15 }
1 // Extension methods required to provide an extender instance. 2 type Extender interface { 3 // Start, End, Error and Log are not related to a specific URL, so they don't 4 // receive a URLContext struct. 5 Start(interface{}) interface{} 6 End(error) 7 Error(*CrawlError) 8 Log(LogFlags, LogFlags, string) 9 10 // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo 11 // is related to a URLContext (holds a ctx field). 12 ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration 13 14 // All other extender methods are executed in the context of an URL, and thus 15 // receive an URLContext struct as first argument. 16 Fetch(*URLContext, string, bool) (*http.Response, error) 17 RequestGet(*URLContext, *http.Response) bool 18 RequestRobots(*URLContext, string) ([]byte, bool) 19 FetchedRobots(*URLContext, *http.Response) 20 Filter(*URLContext, bool) bool 21 Enqueued(*URLContext) 22 Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool) 23 Visited(*URLContext, interface{}) 24 Disallowed(*URLContext) 25 }
entry point:
1 func main() { 2 ext := &Ext{&gocrawl.DefaultExtender{}} 3 // Set custom options 4 opts := gocrawl.NewOptions(ext) 5 opts.CrawlDelay = 1 * time.Second 6 opts.LogFlags = gocrawl.LogError 7 opts.SameHostOnly = false 8 opts.MaxVisits = 10 9 10 c := gocrawl.NewCrawlerWithOptions(opts) 11 c.Run("http://0value.com") 12 }
3 steps: in main
1) get a Extender
2) create Options with given Extender
3) create gocrawel
as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.
2. other key structs
worker, workResponse and sync.WaitGroup
1 // Communication from worker to the master crawler, about the crawling of a URL 2 type workerResponse struct { 3 ctx *URLContext 4 visited bool 5 harvestedURLs interface{} 6 host string 7 idleDeath bool 8 }
1 // The worker is dedicated to fetching and visiting a given host, respecting 2 // this host's robots.txt crawling policies. 3 type worker struct { 4 // Worker identification 5 host string 6 index int 7 8 // Communication channels and sync 9 push chan<- *workerResponse 10 pop popChannel 11 stop chan struct{} 12 enqueue chan<- interface{} 13 wg *sync.WaitGroup 14 15 // Robots validation 16 robotsGroup *robotstxt.Group 17 18 // Logging 19 logFunc func(LogFlags, string, ...interface{}) 20 21 // Implementation fields 22 wait <-chan time.Time 23 lastFetch *FetchInfo 24 lastCrawlDelay time.Duration 25 opts *Options 26 }
for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/
3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)