• Headless Chromium


    把网页输出成pdf:chromium -disable-gpu -headless -print-to-pdf https://www.bilibili.com

    The Beginner’s Guide to Chrome Headless

    Chrome Headless is “Chrome without Chrome,” in the words of Chrome developer and engineer Eric Bidelman. It’s the functionality of Chrome, but operated from the computer’s command line.

    That’s what’s meant by a “headless browser,” which makes now a good time to answer:

    What is a headless browser?

    A headless browser is a browser without a graphical user interface. Instead of controlling the browser’s actions via its graphical user interface (GUI), headless browsers are controlled using the command line.

    Don’t worry. All will become clearer as you read on.

    Why use Chrome Headless?

    Chrome Headless is used for crawling (by Google), testing (by developers), and hacking (by hackers). It’s also used by:

    • Search engines, which use it to render pages, generate dynamic content, and index data from single-page web apps.
    • SEO tools, to analyze websites and make suggestions on how to improve it.
    • Monitoring tools, to monitor JavaScript execution times in web apps.
    • Testing tools, to render pages and compare them to previous versions, in order to track changes in the user interface.

    The major advantage of using Headless Chrome is that users can write script to run the browser programmatically, doing tasks like scraping, analyzing, or imaging websites rapidly and at scale without having to open the browser’s GUI and click a million things.

    Doing that requires three things: Headless Chrome, DevTools Protocol, and Puppeteer.

    You’ve already met Chrome Headless. DevTools Protocol is a remote instance of Chrome DevTools, open in another browser, which allows you to see “through the eyes” of Headless Chrome without running the browser’s GUI. And Puppeteer is a Node library that gives developers tools to programmatically control Headless Chrome via the DevTools Protocol.

    Combine all three, and you have a way to script repetitive, large-scale actions using Headless Chrome and run them at scale fast.

    How does Headless Chrome compare to other versions of Chrome?

    Chrome releases four standard channels, plus Chromium builds that match Chrome release numbers and the Chrome OS for Chromebooks. Those channels are:

    Chrome Stable

    Chrome Stable is the mainstream release that most users have. Its features are tried and tested, and it hardly ever crashes.

    Chrome Beta

    Chrome Beta is tomorrow’s Stable, and thus isn’t quite as stable. The trade-off is more new features, sooner.

    Chrome Dev

    Chrome Dev is aimed at developers, updated much more frequently and much more prone to crashes. It primarily exists to let developers test their apps on the Chrome of the future and avoid obsolescence.

    Chrome Canary

    Chrome Canary is updated daily and especially prone to crashing and glitching. It’s an early testbed for features and ideas, and it’s the only Chrome channel that runs in its own instance automatically.

    Finally, if you are—in Google’s words—“absolutely crazy,” there’s Chromium Raw, a hastily assembled, wildly unstable look into one of Chrome’s potential futures.

    Headless Chrome isn’t a different channel. It’s a different way to run the same application. Later in this post you’ll find out how to launch both Chrome Stable and Chrome Canary in Headless mode. It’s the absence of a GUI that makes the first impression of difference; functionality is the same, you just have to access it differently.

    Normally, when you launch Chrome, you’ll click on the application icon—either in your dock or Applications folder, or in your Start menu if you’re a Windows user. Chrome opens like any other application, in a window on your desktop that you can make fullscreen if you want.

    You can enter URLs or search terms, navigate to websites, view them, and interact with them. If you want the browser to do different things or display a different website, you use a set of clickable dropdown menus in the application’s GUI or in your OS to do that.

    Chrome is designed to be simple and intuitive to get started and its GUI is easy to get used to.

    If you’re a more advanced user, you can open Chrome’s powerful, flexible DevTools and modify the way websites are displayed and the way they work in your browser, in real time, right in front of you. All that takes place inside the application window, with web pages rendered and displayed, inside Chrome’s GUI.

    All this is true of the other types of Chrome⁠—even Chrome Canary, the unstable, bleeding-edge Chrome version that’s updated daily. Whichever channel or build of Chrome you’re running, this relationship between the application and the user remains the same.

    Headless Chrome is not the same.

    In Headless Chrome, you’re not going to see any of these familiar elements of Chrome. There is no user interface. This means there’s nothing to interact with in the way we’re used to. So a new set of tools is needed to interact with Chrome. It also means that you can easily use Chrome Headless to do things that don’t need a UI or where a UI would actively get in the way, like testing and web scraping.

    Instead, you’re going to start Chrome from the command line. What you’ll see is just text in the Terminal or Command Line window. Chrome will be doing its thing without any of the superstructure that normally shows you, the user, what designers and developers wanted you to see. You’ll see what goes on under the hood instead.

    Let’s get started.

    Getting started with Chrome Headless

    To open Chrome Headless you need to open a Chrome binary in the command line. If you only recognize a couple of words in that sentence, don’t worry. It’s simple and we’re about to walk through it step by step.

    First, open your command line application.

    • For Mac users this is Terminal, which is usually in the Utilities folder in Applications.
    • For Windows users, it’s Command Line. You’ll find that by opening Start, going to “Search” or “Run,” and typing “cmd” (short for “command”) and hitting Enter.

    Once you have your command line tool open in front of you, it’s time to use it to open Chrome.

    To do that you need to know where Chrome is on your computer—where it really is, not where your computer’s GUI shows you it is.

    In nearly every case if you’re using a Mac, this is what you’ll use:

    /Applications/Google Chrome.app/Contents/MacOS/Google Chrome

    Windows users should use this filepath:

    C:Program FilesGoogleChromeApplication

    The problem with this is that if you’re reading this in Chrome—which statistically you are—Chrome won’t open a new browsing session, just a new window in your extant browsing session.

    You need a version of Chrome that runs separately as a different application. (You can also do some stuff with aliases that makes normal Chrome work like this, but that’s a bit complex for a beginner’s guide.)

    Time to download Google Chrome Canary:

    Having downloaded Chrome Canary, we’re going to open it in the command line. Again, for Mac users you want:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary

    Copy and paste that into Terminal and you should see Canary open a new window.

    Windows users should amend the filepath to lead to Canary in their C drive.

    Now you have opened a Chrome binary. How do we make it headless?

    Shut Canary—just using the normal UI for now—and go back to Terminal/Command Line. Now, reenter the same command you used before but append this to it:

    –headless

    So if you’re a Mac user you’re copy-pasting this:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary –headless

    This is the headless “flag”—not to be confused with Chrome flags, which are internal to Chrome and are experimental features you can enable by going to chrome://flags.

    Windows users should do the same thing in their Command Line tool.

    You’ll see the yellow Chrome Canary symbol appear in your doc and then immediately disappear. Chrome Canary is now running in Headless mode.

    Now what?

    One thing about a tool with no UI is, it’s tough to interact with—what can we really do with this tool right now?

    Not much.

    But we can use a version of DevTools to manage this headless Chrome instance and do stuff like test for throttling, device emulation, check for code coverage, and plenty more. Anything you can do from inside Chrome’s DevTools, you can do programmatically in Headless Chrome, automatically and a lot faster.

    You can also do some fast, simple things to get you started.

    Things you can do with Chrome Headless right now

    Now that you’ve learned how to launch and kill Chrome Headless from the command line, there’s a ton you can do with it. Here are a few basics to get you started deriving some actual value from this tool.

    1. Visit a website

    Before you do anything else in Chrome Headless you need to give it something to chew on. Launching the browser in headless mode isn’t enough.

    To visit a website in Chrome Headless, all you have to do is add the URL after the headless flag in the command line.

    Mac users should use this:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary –headless https://usefyi.com

    Again, you’ll see the Canary icon jump up and disappear in the Doc. But that’s all you’ll see. To see more of what’s happening you can screenshot the page in the command line, or use DevTools from the command line.

    2. Screenshot

    Screenshots can be done with a flag:

    –headless –disable-gpu –screenshot

    Add that to your command line text:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary –headless –disable-gpu –screenshot https://usefyi.com

    You’ll see a notification in the command line telling you where the image is. By default it will be a file called screenshot.png:

    [0329/141521.683403:INFO:headless_shell.cc(620)] Written to file screenshot.png

    Macs will save it to the Home folder automatically. Be aware that each new screenshot will be screenshot.png, and will overwrite the last one.

    This is just a screen’s worth of imagery. On a longer page, everything after the first screen will be missing. What if you want a complete web page? Then you should make a PDF. Incidentally, this is one of the easiest and quickest ways to make a non-watermarked PDF of a website, using nothing but your (headless) browser.

    3. Create a PDF

    Add this flag to your command line script:

    –print-to-pdf

    If you’re a Mac user that script should now read:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary –headless –disable-gpu –print-to-pdf https://moz.com/

    I’m using Moz’s homepage rather than ours because it’s longer, so the effect is easier to see.

    That will produce a file called output.pdf, which again will be in the Home folder by default if you’re a Mac user.

    [0329/142229.301088:INFO:headless_shell.cc(620)] Written to file output.pdf

    Again, this file will be overwritten every time you do this.

    4. Use DevTools from the command line

    You can open a remote instance of Chrome DevTools and use it to control your Headless Chrome. Just add this flag to your command line text:

    –remote-debugging-port=9222

    You can use any port, but if you don’t have much experience with this, stick to the default. Your command line script should look like this now:

    /Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary –headless –remote-debugging-port=9222 https://usefyi.com

    I recommend quitting and reopening Canary when you do this.

    Chrome’s DevTools will let you know they’re ready to help you:

    DevTools listening on ws://127.0.0.1:9222/devtools/browser/e9deca6c-777b-4615-b313-9b0103cf7566

    Then drop this URL into a new tab on the Chrome you’re actually using to read this:

    http://localhost:9222

    Obviously the numbers have to match—if you used a different port, use those numbers in your URL.

    What if you’re not using Chrome?

    I know the odds are that you’re already using Chrome to read this but in case you’re not, this works fine in any browser. All it does is show you what’s happening as you manage DevTools through the command line. You’ll see a tab marked Headless, with Inspectable WebContents at the top, and your page meta title one line down. That’s a link. Click it and you’re in. You’ll see the page and the code next to it in a remote instance of DevTools.


    Headless Chromium

    Headless Chromium allows running Chromium in a headless/server environment. Expected use cases include loading web pages, extracting metadata (e.g., the DOM) and generating bitmaps from page contents -- using all the modern web platform features provided by Chromium and Blink.

    There are two ways to use Headless Chromium:

    Usage via the DevTools remote debugging protocol

    1. Start a normal Chrome binary with the --headless command line flag (Linux-only for now):
    $ chrome --headless --remote-debugging-port=9222 https://chromium.org
    

    Currently you'll also need to use --disable-gpu to avoid an error from a missing Mesa library.

    1. Navigate to http://localhost:9222 in another browser to open the DevTools interface or use a tool such as Selenium to drive the headless browser.

    Usage from Node.js

    For example, the chrome-remote-interface Node.js package can be used to extract a page's DOM like this:

    const CDP = require('chrome-remote-interface');
    
    CDP((client) => {
      // Extract used DevTools domains.
      const {Page, Runtime} = client;
    
      // Enable events on domains we are interested in.
      Promise.all([
        Page.enable()
      ]).then(() => {
        return Page.navigate({url: 'https://example.com'});
      });
    
      // Evaluate outerHTML after page has loaded.
      Page.loadEventFired(() => {
        Runtime.evaluate({expression: 'document.body.outerHTML'}).then((result) => {
          console.log(result.result.value);
          client.close();
        });
      });
    }).on('error', (err) => {
      console.error('Cannot connect to browser:', err);
    });
    

    Usage as a C++ library

    Headless Chromium can be built as a library for embedding into a C++ application. This approach is otherwise similar to controlling the browser over a DevTools connection, but it provides more customization points, e.g., for networking and mojo services.

    Headless Example is a small sample application which demonstrates the use of the headless C++ API. It loads a web page and outputs the resulting DOM. To run it, first initialize a headless build configuration:

    $ mkdir -p out/Debug
    $ echo 'import("//build/args/headless.gn")' > out/Debug/args.gn
    $ gn gen out/Debug
    

    Then build the example:

    $ ninja -C out/Debug headless_example
    

    After the build completes, the example can be run with the following command:

    $ out/Debug/headless_example https://www.google.com
    

    Headless Shell is a more capable headless application. For instance, it supports remote debugging with the DevTools protocol. To do this, start the application with an argument specifying the debugging port:

    $ ninja -C out/Debug headless_shell
    $ out/Debug/headless_shell --remote-debugging-port=9222 https://youtube.com
    

    Then navigate to http://localhost:9222 with your browser.

    Embedder API

    The embedder API allows developers to integrate the headless library into their application. The API provides default implementations for low level adaptation points such as networking and the run loop.

    The main embedder API classes are:

    • HeadlessBrowser::Options::Builder - Defines the embedding options, e.g.:
      • SetMessagePump - Replaces the default base message pump. See base::MessagePump.
      • SetProxyServer - Configures an HTTP/HTTPS proxy server to be used for accessing the network.

    Client/DevTools API

    The headless client API is used to drive the browser and interact with loaded web pages. Its main classes are:

    • HeadlessBrowser - Represents the global headless browser instance.
    • HeadlessWebContents - Represents a single “tab” within the browser.
    • HeadlessDevToolsClient - Provides a C++ interface for inspecting and controlling a tab. The API functions corresponds to DevTools commands. See the client API documentation for more information.

    Resources and Documentation

    Mailing list: headless-dev@chromium.org

    Bug tracker: Internals>Headless

    File a new bug (bit.ly/2pP6SBb)

  • 相关阅读:
    Hadoop工作流--ChainMapper/ChainReducer?(三)
    什么是工作流?(二)
    Hadoop工作流概念学习系列总述(一)
    Hadoop的ChainMapper和ChainReducer使用案例(链式处理)(四)
    Azkaban是什么?(一)
    爬虫概念与编程学习之如何爬取视频网站页面(三)
    爬虫概念与编程学习之如何爬取视频网站页面(用HttpClient)(二)
    爬虫概念与编程学习之如何爬取网页源代码(一)
    net命令
    arp命令
  • 原文地址:https://www.cnblogs.com/bigben0123/p/13678264.html
Copyright © 2020-2023  润新知