• NetCore控制台程序-使用HostService和HttpClient实现简单的定时爬虫


    .NetCore承载系统

    .NetCore的承载系统, 可以将长时间运行的服务承载于托管进程中, AspNetCore应用其实就是一个长时间运行的服务, 启动AspNetCore应用后, 它就会监听网络请求, 也就是开启了一个监听器, 监听器会将网络请求传递给管道进行处理, 处理后得到Http响应返回

    有很多场景都会有服务承载的需求, 比如这篇博文要做的, 定时抓取华为论坛的文章点赞数

    爬取文章点赞数

    分析

    比如这个链接 https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23 , 点进去不难发现这是用angular做的一个页面, 既然是Angular, 那说明前后端分离了, 浏览器F12查看网络请求
    image
    找到对应api请求方法:

    POST https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail? HTTP/1.1
    Host: developer.huawei.com
    Content-Type: application/json
    Content-Length: 33
    
    {"topicId":"0201302923811480141"}
    

    这里经过我的测试, Content-TypeContent-Length必须上面那样的值, 还有body, 你多一个空格请求都会失败

    使用HttpClient请求数据

    直接看代码吧, 这里使用了依赖注入来注入HttpClientFactory, 还可以使用强类型的HttpClient, 具体可以看文档和dudu博客的这篇博文
    工厂参观记:.NET Core 中 HttpClientFactory 如何解决 HttpClient 臭名昭著的问题

    private readonly IHttpClientFactory _httpClientFactory;
    
    public async Task<int> Crawl(string link)
    {
        using (var httpClient = _httpClientFactory.CreateClient())
        {
            var uri = new Uri(link);
            uri.TryReadQueryAsJson(out var queryParams);
            var topicId = queryParams["tid"].ToString();
            int likeCount = -1;
            if (!string.IsNullOrEmpty(topicId))
            {
                var body = JsonConvert.SerializeObject(
                            new { topicId },
                            Formatting.None);
                uri = new Uri(_baseUrl);
                var jsonContentType = "application/json";
    
                var requestMessage = new HttpRequestMessage
                {
                    RequestUri = uri,
                    Headers =
                    {
                        { "Host", uri.Host }
                    },
                    Method = HttpMethod.Post,
                    Content = new StringContent(body)
                };
                requestMessage.Content.Headers.ContentType = new MediaTypeWithQualityHeaderValue(jsonContentType);
                requestMessage.Content.Headers.ContentLength = body.Length;
                var response = await httpClient.SendAsync(requestMessage);
                if (response.StatusCode == HttpStatusCode.OK)
                {
                    dynamic data = await response.Content.ReadAsAsync<dynamic>();
                    likeCount = data.result.likes;
                }
            }
    
            return likeCount;
        }
    }
    

    这里有更简洁的的写法, 使用_httpClient.PostAsJsonAsync(), 但是考虑到可能需要自定义Content-Type这些请求头, 所以先这样写;

    配置承载系统

    class Program
    {
        static void Main()
        {
            new HostBuilder()
                .ConfigureServices(services =>
                {
                    services.AddHttpClient();
                    services.AddHostedService<LikeCountCrawler>();
                })
                .Build()
                .Run();
        }
    }
    

    LikeCountCrawler实现了IHostedService接口

    IHostedService接口

    public interface IHostedService
    {
        /// <summary>
        /// Triggered when the application host is ready to start the service.
        /// </summary>
        /// <param name="cancellationToken">Indicates that the start process has been aborted.</param>
        Task StartAsync(CancellationToken cancellationToken);
    
        /// <summary>
        /// Triggered when the application host is performing a graceful shutdown.
        /// </summary>
        /// <param name="cancellationToken">Indicates that the shutdown process should no longer be graceful.</param>
        Task StopAsync(CancellationToken cancellationToken);
    }
    

    LikeCountCrawlerStartAsync方法中, 设置开启了一个定时器, 定时器每次溢出, 都执行一次爬虫逻辑

    private readonly Timer _timer = new Timer();
    private readonly IEnumerable<string> _links = new string[]
    {
        "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23",
        "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201303654965850166&fid=18",
        "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294272503450453&fid=24",
        "https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294189025490019&fid=17"
    };
    private readonly string _baseUrl = "https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail";
    ...
    
    public Task StartAsync(CancellationToken cancellationToken)
    {
        _timer.Interval = 5 * 60 * 1000;
        _timer.Elapsed += OnTimer;
        _timer.AutoReset = true;
        _timer.Enabled = true;
        _timer.Start();
        OnTimer(null, null);
        return Task.CompletedTask;
    }
    
    public async Task Crawl(IEnumerable<string> links)
    {
        await Task.Run(() =>
        {
            Parallel.ForEach(links, async link =>
            {
                Console.WriteLine($"Crawling link:{link}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
                var likeCount = await Crawl(link);
                Console.WriteLine($"Succeed crawling likecount - {likeCount}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
            });
        });
    }
    
    private void OnTimer(object sender, ElapsedEventArgs args)
    {
        _ = Crawl(_links);
    }
    
    ...
    

    运行效果:
    image

  • 相关阅读:
    WebGoat之Injection Flaws
    WebGoat之Web Services
    XPath语法
    用Maven在Eclipse中配置Selenium WebDriver——方法1
    用Maven在Eclipse中配置Selenium WebDriver——方法2
    Enum与enum名字相互转换
    数据库数据类型varchar(n)与nvarchar(n)比较
    网页切图工具
    Html标签
    在VS的Solution Explorer中高亮当前编辑的文件
  • 原文地址:https://www.cnblogs.com/Laggage/p/13381991.html
Copyright © 2020-2023  润新知