• PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取


    参考了DotNetSpider示例
    感觉DotNetSpider太重了,它是一个比较完整的爬虫框架。
    对比了以下各种无头浏览器,最终采用PuppeteerSharp+AngleSharp写一个爬虫示例。
    和上面的博文一样,都是用汽车之家的https://store.mall.autohome.com.cn/83106681.html这个页面做数据采集示例。
    本文中使用PuppeteerSharp获取最终页面(即加载JavaScript之后的页面),使用AngleSharp进行Html documents解析处理。

    Headless Browsers

    A list of (almost) all headless web browsers in existence

    A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.

    Browser engines

    These browser engines fully render web pages or run JavaScript in a virtual DOM

    NameAboutSupported LanguagesLicense
    Chromium Embedded Framework CEF is a open source project based on the Google Chromium project. JavaScript BSD
    Erik Headless browser on top of Kanna and WebKit. Swift MIT
    jBrowserDriver A Selenium-compatible headless browser which is written in pure Java. WebKit-based. Works with any of the Selenium Server bindings. Java Apache License v2.0
    PhantomJS [Unmaintained] PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R(via Selenium) BSD 3-Clause
    Splash Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT. Any BSD 3-Clause

    Multi drivers

    These libraries can control multiple browser engines (typically using Selenium)

    NameAboutSupported LanguagesLicense
    CasperJS CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). JavaScript MIT
    Geb Geb is a Groovy interface to WebDriver. Groovy Apache
    Selenium Selenium is a suite of tools to automate web browsers across many platforms. JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R Apache
    Splinter Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items. Python -
    SST SST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests. Python -
    Watir The most elegant way to use Selenium WebDriver with ruby. Ruby MIT

    PhantomJS drivers

    These libraries control PhantomJS

    NameAboutSupported LanguagesLicense
    Ghostbuster Automated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing! JavaScript Not specified
    jedi-crawler Lightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassle JavaScript Not specified
    Lotte Lotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster. JavaScript MIT
    phantompy Phantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 Webkit Python LGPL-2.1
    X-RAY Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) JavaScript MIT
    Horseman Promise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery. JavaScript MIT

    Chromium drivers

    These libraries control Chromium

    NameAboutSupported LanguagesLicense
    Awesomium Chromium-based headless browser engine C++, Free/Commercial
    Headless Chromium Chromium feature activated with the --headlesss flag, currently availible in the nightly build of Chromium, not yet released C++ Opensource
    Puppeteer Headless Chrome Node API from the Chrome DevTools team JavaScript Apache
    PuppeteerSharp PuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer API   MIT
    chrome-remote-interface Chrome Debugging Protocol interface for Node.js JavaScript MIT
    Chromy Features chainable API, mobile emulation, fundamental API such as javascript evaluation. JavaScript MIT
    chromedp A faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol. Go MIT
    Chromeless Chrome automation made simple. Runs locally or headless on AWS Lambda. JavaScript MIT

    Webkit drivers

    These drivers control an in-process instance of Webkit

    NameAboutSupported LanguagesLicense
    Browserjet Runs a custom build of webkit, controlled by node.js interface. JavaScript Not specified
    ghost.py ghost.py is a webkit web client written in python. Python MIT
    headless_browser Headless browser based on WebKit written in C++. C++ Not Specified
    Jabba-Webkit Jabba's headless webkit browser for scraping AJAX-powered webpages. Python Not specified
    Jasmine-Headless-Webkit jasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel. Python, JavaScript, Ruby Free
    Python-Webkit Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM Python GNU
    Spynner Programmatic web browsing module with AJAX support for Python Python Not specified
    Webloop Scriptable, headless WebKit with a Go API. Go BSD 3-Clause
    wkhtmltopdf wkhtmltox wkhtmltoimage Command line tool rendering HTML into PDF and other image formats. shell, C LGPLv3
    WKZombie Functional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2. Swift MIT

    Other drivers

    These libraries control lesser known browsers or OS-provided web libraries

    NameAboutSupported LanguagesLicense
    Nightmare Nightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine. JavaScript MIT
    grope A RubyCocoa interface to the macOS WebKit Framework RubyCocoa MIT
    SlimerJS SlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless). JavaScript Mozilla 2.0
    SpecterJS A scriptable headless Internet Explorer port of PhantomJS. JavaScript MIT
    trifleJS A headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine. JavaScript MIT

    Fake Browser Engine

    These libraries are typically naive or HTML-only browsers

    NameAboutSupported LanguagesLicense
    AngleSharp Http Parsing Library   MIT
    Guillotine A headless browser, written in C#   LGPL-3.0
    benv Stub a browser environment in node.js and headlessly test your client-side code. JavaScript MIT
    browser.rb Headless Ruby browser on top of Nokogiri and TheRubyRacer Ruby Not specified
    BrowserKit BrowserKit simulates the behavior of a web browser. PHP MIT
    DamonJS Bot navigating urls and doing tasks. JavaScript Apache
    Headless Headless browser support for fast web acceptance testing in   MIT
    HeadlessBrowser A very miniature headless browser, for testing the DOM on Node.js JavaScript Not specified
    HtmlUnit HtmlUnit is a "GUI-Less browser for Java programs". Java Apache
    Jaunt Java Web Scraping & Automation API Java Not specified
    JSDom A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js. JavaScript MIT
    MechanicalSoup A Python library for automating interaction with websites. Python MIT
    mechanize Stateful programmatic web browsing. Python BSD 3-Clause, ZPL 2.1
    node-as-browser Create a browser-like environment within Node.js JavaScript MIT
    RoboBrowser A simple, Pythonic library for browsing the web without a standalone web browser. Python BSD 3-Clause
    SimpleBrowser A flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework.   BSD 3-Clause
    stanislaw Naive, mechanize-like HTML parser/form driver. Python Not specified
    twill Twill is a simple language that interacts with basic HTML pages (no JavaScript support). Python MIT
    WeasyPrint WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing. Python BSD 3-Clause
    WWW::Mechanize Headless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testing Perl Perl 5
    X-RAY Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) JavaScript MIT
    Xidel (Internet Tools) An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs. FreePascal, XQuery GPL-2
    Zombie.js Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. JavaScript MIT

    Runs in a browser

    NameAboutSupported LanguagesLicense
    DalekJS [unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript. JavaScript MIT
    TestCafé Automated browser testing for the modern web development stack. JavaScript MIT
    Sahi Sahi is a cross-browser automation/testing tool with the facility to record and playback scripts. JavaScript, Java, Ruby, PHP Apache / Commercial
    WatiN Web Application Testing In   Apache 2.0

    Misc tools

    NameAboutSupported LanguagesLicense
    browser-launcher Detect and launch browser versions, headlessly or otherwise JavaScript MIT

    其实如果没有JavaScripts加载数据需求,单独用AngleSharp就可以搞定了。
    但涉及到JavaScripts加载数据需求的,就需要上真正的无头浏览器组件才能搞定了。
    AngleSharp现在只支持简单的JavaScripts代码执行,稍微复杂点的,都不行,听说以后要完整支持JavaScripts,敬请期待吧!

    Code

    /*
     * This is a Puppeteer+AngleSharp crawler console app samples
     */
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using AngleSharp;
    using AngleSharp.Dom;
    using AngleSharp.Html.Parser;
    using Newtonsoft.Json;
    using PuppeteerSharp;
    
    namespace CrawlerSamples
    {
        internal class Program
        {
            private const string Url = "https://store.mall.autohome.com.cn/83106681.html";
            private const int ChromiumRevision = BrowserFetcher.DefaultRevision;
    
            private static async Task Main(string[] args)
            {
                //Download chromium browser revision package
                await new BrowserFetcher().DownloadAsync(ChromiumRevision);
    
                //Test AngleSharp
                await TestAngleSharp();
    
                Console.ReadKey();
            }
    
            private static async Task TestAngleSharp()
            {
                /*
                 * Used AngleSharp loading of HTML document
                 * TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package
                 * Note: that JavaScripts support is an experimental and does not support complex JavaScripts code.
                 */
                //IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript();
                //IBrowsingContext context = BrowsingContext.New(config);
                //IDocument document = await context.OpenAsync(url);
    
                //Used PuppeteerSharp loading of HTML document
                var htmlString = await TestPuppeteerSharp();
    
                /*
                 * Parsing of HTML document string
                 */
                var context = BrowsingContext.New(Configuration.Default);
                var parser = context.GetService<IHtmlParser>();
                var document = parser.ParseDocument(htmlString);
    
                //Selector carbox element list
                var carboxList = document.QuerySelectorAll("div.shop-content div.content div.list li.carbox");
    
                var carModelList = new List<CarModel>();
                foreach (var carbox in carboxList)
                {
                    //Parsing and converting to the car model object.
                    var model = CreateModelWithAngleSharp(carbox);
                    carModelList.Add(model);
    
                    //Printing to console windows
                    var jsonString = JsonConvert.SerializeObject(model);
                    Console.WriteLine(jsonString);
                    Console.WriteLine();
                }
    
                Console.WriteLine("Total count:" + carModelList.Count);
            }
    
            private static async Task<string> TestPuppeteerSharp()
            {
                //Enabled headless option
                var launchOptions = new LaunchOptions { Headless = true };
                //Starting headless browser
                var browser = await Puppeteer.LaunchAsync(launchOptions);
    
                //New tab page
                var page = await browser.NewPageAsync();
                //Request URL to get the page
                await page.GoToAsync(Url);
    
                //Get and return the HTML content of the page
                var htmlString = await page.GetContentAsync();
    
                #region Dispose resources
                //Close tab page
                await page.CloseAsync();
    
                //Close headless browser, all pages will be closed here.
                await browser.CloseAsync();
                #endregion
    
                return htmlString;
            }
    
            private static CarModel CreateModelWithAngleSharp(IParentNode node)
            {
                var model = new CarModel
                {
                    Title = node.QuerySelector("a div.carbox-title").TextContent,
                    ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"),
                    ProductUrl = node.QuerySelector("a").GetAttribute("href"),
                    Tip = node.QuerySelector("a div.carbox-tip").TextContent,
                    OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent
                };
    
                return model;
            }
        }
    }
    

    Result

    Note

    注意,第一次运行,这一句代码:

    await new BrowserFetcher().DownloadAsync(ChromiumRevision);
    

    会从网络上下载浏览器便捷式安装包download-Win64-536395.zip到你本地,里面解压后是一个Chromium浏览器。这里需要等待一些时间。

    Source

    https://github.com/VAllens/CrawlerSamples

  • 相关阅读:
    Java8新特性之lambda表达式
    查询数据库存在特殊列字段的所有表的表名和字段名
    BigDecimal相关整理
    MyBatis正在爬的坑
    Java面试题整理
    qs库的使用
    配置proxy解决跨域问题
    PDF.js 使用方式
    HTML转义以及防止JS注入攻击
    jquery ajax跨域回调
  • 原文地址:https://www.cnblogs.com/webenh/p/13359924.html
Copyright © 2020-2023  润新知