• HTML Parser HTML Parser


    HTML Parser - HTML Parser

    HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.
    Primarily used for transformation or extraction, it features filters, visitors,
    custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.

    Welcome to the homepage of HTMLParser - a super-fast real-time
    parser for real-world HTML. What has attracted most developers to HTMLParser has
    been its simplicity in design, speed and ability to handle streaming real-world
    html.

    The two fundamental use-cases that are handled by the parser are
    extraction and transformation

    (the syntheses use-case, where HTML pages are created from scratch, is better
    handled by other tools closer to the source of data). While prior versions
    concentrated on data extraction from web pages, Version 1.4 of the
    HTMLParser has substantial improvements in the area of transforming web
    pages, with simplified tag creation and editing, and verbatim toHtml() method
    output.

    In general, to use the HTMLParser you will need to be able to write code in
    the Java programming language. Although some example programs are provided
    that may be useful as they stand, it's more than likely you will need (or
    want) to create your own programs or modify the ones provided to match your
    intended application.

    To use the library, you will need to add either the htmllexer.jar or
    htmlparser.jar to your classpath when compiling and running. The
    htmllexer.jar provides low level access to generic string, remark and tag nodes on
    the page in a linear, flat, sequential manner. The htmlparser.jar, which
    includes the classes found in htmllexer.jar, provides access to a page as a
    sequence of nested differentiated tags containing string, remark and other
    tag nodes. So where the output from calls to the lexer
    nextNode()
    method might be:

        <html>
        <head>
        <title>
    
        "Welcome"
        </title>
        </head>
        <body>
        etc...
        

    The output from the parser NodeIterator would
    nest the tags as children of the <html>, <head> and other nodes
    (here represented by indentation):

        <html>
            <head>
                <title>
                    "Welcome"
                    </title>
                </head>
            <body>
    
                etc...
        

    The parser attempts to balance opening tags with ending tags to present the
    structure of the page, while the lexer simply spits out nodes. If your
    application requires only modest structural knowledge of the page, and is
    primarily concerned with individual, isolated nodes, you should consider
    using the lightweight lexer. But if your application requires knowledge of
    the nested structure of the page, for example processing tables, you will
    probably want to use the full parser.

  • 相关阅读:
    技嘉Z390 AORUS MASTER+酷睿I9超频5.0GHz教程
    USDT
    Scopus数据库简介
    windows server 2016 安装网卡驱动
    solr配置同义词,停止词,和扩展词库(IK分词器为例)
    Solr 数字字符不能搜索的一个问题
    solr添加中文IK分词器,以及配置自定义词库
    SQL Server表分区(转)
    税改后每月个人所得税逐月增加
    IIS Express总结
  • 原文地址:https://www.cnblogs.com/lexus/p/2388604.html
Copyright © 2020-2023  润新知