• 如何将html代码转换为Xml并生成Dom树


    using (SgmlReader reader = new SgmlReader())
            {
                reader.DocType = "HTML";
                reader.InputStream = new StringReader(“html代码”);
                using (StringWriter stringWriter = new StringWriter())
                {
                    using (XmlTextWriter writer = new XmlTextWriter(stringWriter))
                    {
                        reader.WhitespaceHandling = WhitespaceHandling.None;
                        writer.Formatting = Formatting.Indented;
                        XmlDocument Doc = new XmlDocument();
                        Doc.Load(reader);
                        XmlNodeList XnlInput = Doc.getElementsByTagName_r("input");

                     }

                }

                }

    如果是html文件的话

    运用.NET Framework类来解析HTML文件、读取数据并不是最容易的。虽然你可以用.NET Framework中的许多类(如StreamReader)来逐行解析文件,但XmlReader提供的API并不是“取出即可用(out of the box)”的,因为HTML的格式不规范。你可以用正则表达式(regular expression),但如果你对这些表达式运用不熟练,你可能开始时会觉得它们有些难。

    Microsoft的XML大师Chris Lovett最近在http://www.gotdotnet.com网站上发布了一个新的SGML解析器,叫做SgmlReader,它可以解析HTML文件,甚至将它们转换成一个格式规范的结构。SgmlReader派生于XmlReader,这就是说,你可以像运用诸如XmlTextReader这样的类来解析XML文件那样来解析HTML文件。在本文中,我将介绍如何用SgmlReader类来解析HTML文件并生成格式规范的HTML,从而使你可以用XPath语句来读取数据。

    创建一个SgmlReader实例来解析HTML

    在开始运用SgmlReader前,从gotdotnet.com下载它,并将assembly放到你的应用程序bin folder中。在可以运用assembly集后,编写代码来读取你想解析的HTML。在本文的例子中,我们用了HttpWebRequest和HttpWebResponse对象来访问一个远程的HTML文件:

     

    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri);HttpWebResponse res = (HttpWebResponse)req.GetResponse();StreamReader sReader = new StreamReader(res.GetResponseStream());

     

    在得到远程的HTML文件后,你就可以创建一个SgmlReader类的实例了。通过将其DocType属性设置为“HTML”,让用户知道你正在处理HTML文件:

     

    SgmlReader reader = new SgmlReader();reader.DocType = "HTML";

     

    HTML文件的响应流可以被加载到SgmlReader实例,通过其InputStream属性进行解析。首先将HTML文件流加载到一个TextReader对象,然后将TextReader赋值给InputStream属性:

     

    reader.InputStream = new StringReader(sReader.ReadToEnd());

     

    现在,你就可以通过调用SgmlReader的Read()方法来解析HTML文件了:

     

    sw = new StringWriter();writer = new XmlTextWriter(sw);writer.Formatting = Formatting.Indented;while (reader.Read()) { if (reader.NodeType != XmlNodeType.Whitespace) { writer.WriteNode(reader, true); }}

     

    因为SgmlReader创建了格式规范的HTML,所以你可以用XPath语句来读取不同的节点。下面的代码说明了如何将SgmlReader生成的输出结果加载到一个XPathNavigator,然后如何用一个XPath语句来查询HTML文件结构:

     

    StringBuilder sb = new StringBuilder();XPathDocument doc = new XPathDocument(new StringReader(sw.ToString()));XPathNavigator nav = doc.CreateNavigator();XPathNodeIterator nodes = nav.Select(xpath);while (nodes.MoveNext()) { sb.Append(nodes.Current.Value);}return sb.ToString();

     

    点击此处来查看SgmlReader类的一个实例演示

    如果你对XPath语言已经很熟悉,并了解.NET Framework中不同的XML解析API了,那么你就可以很容易地用SgmlReader类来解析HTML并读取数据了。

    部分代码C#

                private string GetWellFormedHTML(string uri,string xpath) ...{

                StreamReader sReader = null;

                StringWriter sw = null;

                

    SgmlReader reader = null;

                XmlTextWriter writer = null;

                try ...{

                    if (uri == String.Empty) uri = "http://www.XMLforASP.NET";

                    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri);

                    HttpWebResponse res = (HttpWebResponse)req.GetResponse();

                    sReader = new StreamReader(res.GetResponseStream());

                    reader = new SgmlReader();

                    reader.DocType = "HTML";

                    reader.InputStream = new StringReader(sReader.ReadToEnd());

                    sw = new StringWriter();

                    writer = new XmlTextWriter(sw);

                    writer.Formatting = Formatting.Indented;

                    //writer.WriteStartElement("Test");

                    while (reader.Read()) ...{

                        if (reader.NodeType != XmlNodeType.Whitespace) ...{

                            writer.WriteNode(reader, true);

                        }

                    }

                    //writer.WriteEndElement();

                    if (xpath == null) ...{

                        return sw.ToString();  

                    } else ...{ //Filter out nodes from HTML

                        StringBuilder sb = new StringBuilder();

                        XPathDocument doc = new XPathDocument(new StringReader(sw.ToString()));

                        XPathNavigator nav = doc.CreateNavigator();

                        XPathNodeIterator nodes = nav.Select(xpath);

                        while (nodes.MoveNext()) ...{

                            sb.Append(nodes.Current.Value + " ");

                        }

                        return sb.ToString();

                    }

                } catch (Exception exp) ...{

                    writer.Close();

                    reader.Close();

                    sw.Close();

                    sReader.Close();

                    return exp.Message;

                }

            }

  • 相关阅读:
    【linux】Centos下登陆mysql报错#1045
    tomcat在centos7里面启动很慢的解决办法
    tomcat日志文件 转载https://www.cnblogs.com/operationhome/p/9680040.html
    tomcat的文件目录结构
    centos 7服务器下tomcat 问题 1.配置问题
    x11转发遇到的问题
    x11转发,可以在shell里面看到图形界面
    linux里面tomcat配置遇到的问题
    vim中文乱码 vim字符集设置
    c#.net常见字符串处理方法
  • 原文地址:https://www.cnblogs.com/iwaitu/p/1780669.html
Copyright © 2020-2023  润新知