• QT学习:c++解析html相关


    原来我做爬虫的时候,对页面进行解析的时候总是用很简单粗暴的方法,直接找规律。后来在网上看到了gumbo,尝试了一下,发现确实很好用,所以向大家推荐一下。

    以下转自:http://blog.csdn.net/whyistao/article/details/37919581

    1.c++好像没有太多的html解析库可以用,最后试着在qt里面集成了htmlcxx,一开始在pro里面写了 includepath += 路径,发现仍然没有用
    后来发现只要在 HEADERS 和 SOURCES 里面 把htmlcxx的c文件和.h文件 +=进去就行了,像这样:
    SOURCES += main.cpp
            html/utils.cc 
            html/Uri.cc 
            html/ParserSax.cc 
            html/ParserDom.cc 
            html/Node.cc 
            html/Extensions.cc
    HEADERS  += mainwindow.h 
            html/utils.h 
            html/Uri.h 
            html/tree.h 
            html/ParserSax.h 
            html/ParserDom.h 
            html/Node.h 
            html/Extensions.h 
            html/debug.h 
            html/ci_string.h 
            html/wincstring.h 
            html/tld.h
    
    参考了:   htmlcxx for qt(mingw)      http://blog.chinaunix.net/uid-21525518-id-1824657.html
    
    
    2.使用gumbo解析
    导入c和h文件方法同上,记一下gumbo常用类型
    GumboOutput   
    用GumboOutput来解析html源码,然后output->root即为根节点。
    GumboOutput* output = gumbo_parse(htmlString.c_str());
    GumboNode* node = output->root
    GumboNode    节点                      
    GumboNode node;      
    获得节点里面的东西    
    node->v->text                           //  节点的文本
    node->v.element.children    // 获得节点的子节点列表
    node->type     //节点的类型 
    GumboVector    节点容器  
    比如可以   GumboVector  * children  =    node->v.element.children;   来获得节点的子节点列表
    (GumboNode*) ( children->data[i] )     //获得这个节点列表的第i个节点   
    GumboAttribute  节点属性
    GumboAttribute* href;  
    if (node->v.element.tag == GUMBO_TAG_A &&   (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
    {    std::cout << href->value << std::endl;  }
    
    
    节点的类型  
      ELEMENT_NODE,普通元素节点,如<html>,<p>,<div>,<span>,<img>  
      ATTRIBUTE_NODE,元素属性  
      TEXT_NODE,文本节点  
      CDATA_SECTION_NODE,即<![CDATA[ ]]>  
      ENTITY_REFERENCE_NODE,实体引用,如&   
      ENTITY_NODE,实体,如<!ENTITY copyright “Copyright 2010, impng. All rights reserved”]>  
      PROCESSING_INSTRUCTION_NODE,PI,处理指令,如<?xml  version=”1.0″?>  
      COMMENT_NODE,注释<!–   –>  
      DOCUMENT_NODE,根节点,即document.nodeType  
      DOCUMENT_TYPE_NODE,DTD,文档类型<!DOCTYPE   >  
      DOCUMENT_FRAGMENT_NODE,文档片段  
      NOTATION_NODE,DTD中定义的记号  
    
    在代码里的节点类型可以有如下几种           (使用方法       node->type ==  GUMBO_NODE_ELEMENT )
    typedef enum {
      /** Document node.  v will be a GumboDocument. */
      GUMBO_NODE_DOCUMENT,
      /** Element node.  v will be a GumboElement. */
      GUMBO_NODE_ELEMENT,
      /** Text node.  v will be a GumboText. */
      GUMBO_NODE_TEXT,
      /** CDATA node. v will be a GumboText. */
      GUMBO_NODE_CDATA,
      /** Comment node.  v. will be a GumboText, excluding comment delimiters. */
      GUMBO_NODE_COMMENT,
      /** Text node, where all contents is whitespace.  v will be a GumboText. */
      GUMBO_NODE_WHITESPACE
    } GumboNodeType;
    
    标签类型:                           (使用方法    node->v.element.tag != GUMBO_TAG_SCRIPT   )
    typedef enum {
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-root-element
      GUMBO_TAG_HTML,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#document-metadata
      GUMBO_TAG_HEAD,
      GUMBO_TAG_TITLE,
      GUMBO_TAG_BASE,
      GUMBO_TAG_LINK,
      GUMBO_TAG_META,
      GUMBO_TAG_STYLE,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/scripting-1.html#scripting-1
      GUMBO_TAG_SCRIPT,
      GUMBO_TAG_NOSCRIPT,
      GUMBO_TAG_TEMPLATE,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#sections
      GUMBO_TAG_BODY,
      GUMBO_TAG_ARTICLE,
      GUMBO_TAG_SECTION,
      GUMBO_TAG_NAV,
      GUMBO_TAG_ASIDE,
      GUMBO_TAG_H1,
      GUMBO_TAG_H2,
      GUMBO_TAG_H3,
      GUMBO_TAG_H4,
      GUMBO_TAG_H5,
      GUMBO_TAG_H6,
      GUMBO_TAG_HGROUP,
      GUMBO_TAG_HEADER,
      GUMBO_TAG_FOOTER,
      GUMBO_TAG_ADDRESS,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#grouping-content
      GUMBO_TAG_P,
      GUMBO_TAG_HR,
      GUMBO_TAG_PRE,
      GUMBO_TAG_BLOCKQUOTE,
      GUMBO_TAG_OL,
      GUMBO_TAG_UL,
      GUMBO_TAG_LI,
      GUMBO_TAG_DL,
      GUMBO_TAG_DT,
      GUMBO_TAG_DD,
      GUMBO_TAG_FIGURE,
      GUMBO_TAG_FIGCAPTION,
      GUMBO_TAG_MAIN,
      GUMBO_TAG_DIV,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#text-level-semantics
      GUMBO_TAG_A,
      GUMBO_TAG_EM,
      GUMBO_TAG_STRONG,
      GUMBO_TAG_SMALL,
      GUMBO_TAG_S,
      GUMBO_TAG_CITE,
      GUMBO_TAG_Q,
      GUMBO_TAG_DFN,
      GUMBO_TAG_ABBR,
      GUMBO_TAG_DATA,
      GUMBO_TAG_TIME,
      GUMBO_TAG_CODE,
      GUMBO_TAG_VAR,
      GUMBO_TAG_SAMP,
      GUMBO_TAG_KBD,
      GUMBO_TAG_SUB,
      GUMBO_TAG_SUP,
      GUMBO_TAG_I,
      GUMBO_TAG_B,
      GUMBO_TAG_U,
      GUMBO_TAG_MARK,
      GUMBO_TAG_RUBY,
      GUMBO_TAG_RT,
      GUMBO_TAG_RP,
      GUMBO_TAG_BDI,
      GUMBO_TAG_BDO,
      GUMBO_TAG_SPAN,
      GUMBO_TAG_BR,
      GUMBO_TAG_WBR,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/edits.html#edits
      GUMBO_TAG_INS,
      GUMBO_TAG_DEL,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#embedded-content-1
      GUMBO_TAG_IMAGE,
      GUMBO_TAG_IMG,
      GUMBO_TAG_IFRAME,
      GUMBO_TAG_EMBED,
      GUMBO_TAG_OBJECT,
      GUMBO_TAG_PARAM,
      GUMBO_TAG_VIDEO,
      GUMBO_TAG_AUDIO,
      GUMBO_TAG_SOURCE,
      GUMBO_TAG_TRACK,
      GUMBO_TAG_CANVAS,
      GUMBO_TAG_MAP,
      GUMBO_TAG_AREA,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#mathml
      GUMBO_TAG_MATH,
      GUMBO_TAG_MI,
      GUMBO_TAG_MO,
      GUMBO_TAG_MN,
      GUMBO_TAG_MS,
      GUMBO_TAG_MTEXT,
      GUMBO_TAG_MGLYPH,
      GUMBO_TAG_MALIGNMARK,
      GUMBO_TAG_ANNOTATION_XML,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#svg-0
      GUMBO_TAG_SVG,
      GUMBO_TAG_FOREIGNOBJECT,
      GUMBO_TAG_DESC,
      // SVG title tags will have GUMBO_TAG_TITLE as with HTML.
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/tabular-data.html#tabular-data
      GUMBO_TAG_TABLE,
      GUMBO_TAG_CAPTION,
      GUMBO_TAG_COLGROUP,
      GUMBO_TAG_COL,
      GUMBO_TAG_TBODY,
      GUMBO_TAG_THEAD,
      GUMBO_TAG_TFOOT,
      GUMBO_TAG_TR,
      GUMBO_TAG_TD,
      GUMBO_TAG_TH,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#forms
      GUMBO_TAG_FORM,
      GUMBO_TAG_FIELDSET,
      GUMBO_TAG_LEGEND,
      GUMBO_TAG_LABEL,
      GUMBO_TAG_INPUT,
      GUMBO_TAG_BUTTON,
      GUMBO_TAG_SELECT,
      GUMBO_TAG_DATALIST,
      GUMBO_TAG_OPTGROUP,
      GUMBO_TAG_OPTION,
      GUMBO_TAG_TEXTAREA,
      GUMBO_TAG_KEYGEN,
      GUMBO_TAG_OUTPUT,
      GUMBO_TAG_PROGRESS,
      GUMBO_TAG_METER,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/interactive-elements.html#interactive-elements
      GUMBO_TAG_DETAILS,
      GUMBO_TAG_SUMMARY,
      GUMBO_TAG_MENU,
      GUMBO_TAG_MENUITEM,
      // Non-conforming elements that nonetheless appear in the HTML5 spec.
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/obsolete.html#non-conforming-features
      GUMBO_TAG_APPLET,
      GUMBO_TAG_ACRONYM,
      GUMBO_TAG_BGSOUND,
      GUMBO_TAG_DIR,
      GUMBO_TAG_FRAME,
      GUMBO_TAG_FRAMESET,
      GUMBO_TAG_NOFRAMES,
      GUMBO_TAG_ISINDEX,
      GUMBO_TAG_LISTING,
      GUMBO_TAG_XMP,
      GUMBO_TAG_NEXTID,
      GUMBO_TAG_NOEMBED,
      GUMBO_TAG_PLAINTEXT,
      GUMBO_TAG_RB,
      GUMBO_TAG_STRIKE,
      GUMBO_TAG_BASEFONT,
      GUMBO_TAG_BIG,
      GUMBO_TAG_BLINK,
      GUMBO_TAG_CENTER,
      GUMBO_TAG_FONT,
      GUMBO_TAG_MARQUEE,
      GUMBO_TAG_MULTICOL,
      GUMBO_TAG_NOBR,
      GUMBO_TAG_SPACER,
      GUMBO_TAG_TT,
      // Used for all tags that don't have special handling in HTML.
      GUMBO_TAG_UNKNOWN,
      // A marker value to indicate the end of the enum, for iterating over it.
      // Also used as the terminator for varargs functions that take tags.
      GUMBO_TAG_LAST,
    } GumboTag;
    
    
    3.使用gumbo的时候,报了一个RtlWerpReportException failed with status code :-1073741823 错,
    一开始以为是堆栈溢出的问题,后来发现是自己代码逻辑没写对,最好对照着官方demo的用法去写
    if (node->v.element.tag == GUMBO_TAG_A &&      (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
    {    std::cout << href->value << std::endl;  }
    
    
    4.编译gumbo的时候报了一个错
     错误:'for' loop initial declarations are only allowed in C99 mode
    所以在项目pro配置里要加上这两句
    QMAKE_CFLAGS_DEBUG +=  --std=c99
    QMAKE_CFLAGS_RELEASE +=  --std=c99

    转载请注明:http://www.cnblogs.com/fnlingnzb-learner/p/5835428.html

  • 相关阅读:
    [Agc029D]Grid game_贪心
    [Agc029C]Lexicographic constraints_进制_二分答案_贪心
    [Agc029B]Powers of two_贪心_树形dp
    [Agc029A]Irreversible operation_逆序对
    [LuoguP1074]靶形数独_搜索
    umi react处理接口请求慢的问题
    typescript-类型
    bizcharts画图遇到的几个问题
    webpack 热更新原理
    webpack配置
  • 原文地址:https://www.cnblogs.com/fnlingnzb-learner/p/5835428.html
Copyright © 2020-2023  润新知