C#利用HtmlAgilityPack组件使用法之配合Xpath解析获取循环列表中各的节点信息

HtmlAgilityPack是C#解析HTML的神器，相信用过HtmlAgilityPack这个组件的人都不会否认这一点。虽然其官方文档很少，但只要GOOGLE，BAIDU一下，网上还是有很多关于介绍此神器的文章。具体的使用在这里就不多列举了，今天最主要介绍的是C#中，如果通过HtmlAgilityPack这个组件来获取有循环列表的数据节点信息的。当然，这里的循环可以是任意的HTML标签，比如ul中的li,或者有着相同class的div都行，下面来baidu搜索HtmlAgilityPack来做示例：首先用Nuget管理工具Package Manager Console将组件添加到需要的项目：

Install-Package HtmlAgilityPack

然后在需要使用HtmlAgilityPack组件的cs文件中引入其命名空间：

using HtmlAgilityPack;

完成以上两步之后，我们便可下载网页源码并作解析工作了，先看解析结果：

以下是整个示例的实现源代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Net;
using System.IO;

namespace HtmlAgilityDemo
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Press any key to start...");
      Console.ReadKey();
      LoopListParser();
      Console.ReadKey();
    }

    static void LoopListParser()
    {
      Console.WriteLine("Download page source...");
      var html = GetWebPageContentFromUrl("http://www.baidu.com/s?wd=htmlagilitypack&rsv_spt=1&issp=1&f=8&rsv_bp=0&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=15&rsv_sug4=944&rsv_sug1=14&rsv_sug2=0&inputT=7");
      //Console.WriteLine(html);
      Console.WriteLine("Download completed...");
      Console.WriteLine("Start parse page source...");
      Pareser(html);
      Console.WriteLine("Parse completed...");
    }

    static void Pareser(string html)
    {
      var doc = new HtmlDocument();
      doc.LoadHtml(html);

      var nodes = doc.DocumentNode.SelectNodes("//div[@tpl='se_com_default']");
      foreach (HtmlNode node in nodes)
      {
        //TODO:获取单个列表的标题
        var title = node.SelectSingleNode(".//h3") != null ? node.SelectSingleNode(".//h3").InnerText : "";
        Console.WriteLine(title);
      }
    }

    static string DownloadPageSource(string url)
    {
      var html = "";
      //直接使用WebClient下载网页，全出现乱码的情况
      //var client = new WebClient();
      //client.DownloadStringCompleted += (sender, e) =>
      //{
      //  html=e.Result;
      //};
      //client.DownloadStringAsync(new Uri("http://www.baidu.com/s?wd=htmlagilitypack&rsv_spt=1&issp=1&f=8&rsv_bp=0&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=15&rsv_sug4=944&rsv_sug1=14&rsv_sug2=0&inputT=7"));
      return html;
    }

    static string GetWebPageContentFromUrl(string url, int timeout = 30)
    {
      try
      {
        string pageContent;
        var myRequest = (HttpWebRequest)WebRequest.Create(url);
        myRequest.Method = "GET";
        myRequest.Timeout = 1000 * timeout;
        myRequest.AllowAutoRedirect = true;
        var myResponse = (HttpWebResponse)myRequest.GetResponse();
        using (var sr = new StreamReader(myResponse.GetResponseStream(), Encoding.GetEncoding((myResponse.CharacterSet))))
        {
          pageContent = sr.ReadToEnd();
          myResponse.Close();
        }
        return pageContent;
      }
      catch (Exception ex)
      {
        throw new Exception(ex.Message);
      }

    }
  }
}

本示例中用到了Xpath，如有不了解的同学，可以参考:http://www.w3school.com.cn/xpath/xpath_syntax.asp

C#利用HtmlAgilityPack组件使用法之配合Xpath解析获取循环列表中各的节点信息

发表评论