今天一位兄弟问了一个关于数据抓取的问题,刚好以前写过一个简单的天气预报的数据抓取功能。毕竟找一个免费的天气服务真实太不容易了,又不想花钱买,抓取吧!方便、简单!
数据抓取主要是涉及到两方面的知识:
1、正则就不说了,用多了也就熟练了。
2、模拟登录 似很难,其实很简单,当然是只抓取一些公共的信息,如果有登录验证,也就不简单了。我今天也是说说简单的数据抓取功能,专取天气预报的数据。
直接贴出代码吧,因为真是没什么可讲的,就是个知道与不知道的问题。告诉你,你就知道了。
/// <summary>
/// 得到天气数据
/// </summary>
/// <returns>数组(0、天气;1、气温;2、风力;3、紫外线;4、空气)</returns>
public static string[] GetWeather()
{
Regex regex;
string[] weather = new string[5];
string content = "";
Match mcTmp;
Match mcCity;
int k = 1;
HttpWebResponse theResponse;
WebRequest theRequest;
theRequest = WebRequest.Create("http://weather.news.qq.com/inc/ss82.htm");
try
{
theResponse = (HttpWebResponse)theRequest.GetResponse();
using (System.IO.Stream sm = theResponse.GetResponseStream())
{
System.IO.StreamReader read = new System.IO.StreamReader(sm, Encoding.Default);
content = read.ReadToEnd();
}
}
catch (Exception)
{
content = "";
}
string parttenTmp = "<td height=\"23\" width=\"117\" background=\"/images/r_tembg5.gif\" align=\"center\">(?<item1>[^<]+)</td>";
k = 1;
regex = new Regex(parttenTmp, RegexOptions.Compiled | RegexOptions.IgnoreCase);
for (mcTmp = regex.Match(content), k = 1; mcTmp.Success; mcTmp = mcTmp.NextMatch(), k++)
{
weather[0] = mcTmp.Groups["item1"].Value;
}
parttenTmp = "height=\"23\" align=\"center\">(?<item1>[^/]+)</td>";
k = 1;
regex = new Regex(parttenTmp, RegexOptions.Compiled | RegexOptions.IgnoreCase);
for (mcTmp = regex.Match(content), k = 1; mcTmp.Success; mcTmp = mcTmp.NextMatch(), k++)
{
weather[k] = mcTmp.Groups["item1"].Value;
}
return weather;
}
/// 得到天气数据
/// </summary>
/// <returns>数组(0、天气;1、气温;2、风力;3、紫外线;4、空气)</returns>
public static string[] GetWeather()
{
Regex regex;
string[] weather = new string[5];
string content = "";
Match mcTmp;
Match mcCity;
int k = 1;
HttpWebResponse theResponse;
WebRequest theRequest;
theRequest = WebRequest.Create("http://weather.news.qq.com/inc/ss82.htm");
try
{
theResponse = (HttpWebResponse)theRequest.GetResponse();
using (System.IO.Stream sm = theResponse.GetResponseStream())
{
System.IO.StreamReader read = new System.IO.StreamReader(sm, Encoding.Default);
content = read.ReadToEnd();
}
}
catch (Exception)
{
content = "";
}
string parttenTmp = "<td height=\"23\" width=\"117\" background=\"/images/r_tembg5.gif\" align=\"center\">(?<item1>[^<]+)</td>";
k = 1;
regex = new Regex(parttenTmp, RegexOptions.Compiled | RegexOptions.IgnoreCase);
for (mcTmp = regex.Match(content), k = 1; mcTmp.Success; mcTmp = mcTmp.NextMatch(), k++)
{
weather[0] = mcTmp.Groups["item1"].Value;
}
parttenTmp = "height=\"23\" align=\"center\">(?<item1>[^/]+)</td>";
k = 1;
regex = new Regex(parttenTmp, RegexOptions.Compiled | RegexOptions.IgnoreCase);
for (mcTmp = regex.Match(content), k = 1; mcTmp.Success; mcTmp = mcTmp.NextMatch(), k++)
{
weather[k] = mcTmp.Groups["item1"].Value;
}
return weather;
}
看过上面的代码,应该明白了一些吧。只是路在那的问题,谁不会走路呢?这些只是简单的应用,数据抓取其实在很多的行业都会用到,举个简单的例子:搜索。搜索就是数据抓取的极端,当然数据抓取只是爬虫机器人的一小部分技术。就说到这把,下班喽!