• c# .net core中调python


    背景

    爬虫项目,由于目标网站验证码频率比较频繁,于是上了IP proxy;

    然而

    1、c# HttpClient对象有资源释放不及时的问题,会导致系统套接字耗尽;以及内存占用越来越高!

    2、如果使用一个全局静态HttpClient对象的话,又会由于初始化时只能有一个固定的代理地址,无法在本地做到动态代理;

    //每次请求目标地址,都创建新的对象,在代码层Dispose(),无法释放系统底层socket
    HttpClient HttpClient = new HttpClient(new HttpClientHandler
                {
                    AutomaticDecompression = DecompressionMethods.GZip,
                    UseCookies = true,
                    Proxy = new WebProxy(new Uri(GetProxy())) { Credentials = null, UseDefaultCredentials = false },
    //实例化后就不可以修改代理地址了
                    UseProxy = true,
                    AllowAutoRedirect = true,
                    ClientCertificateOptions = ClientCertificateOption.Automatic,
                    ServerCertificateCustomValidationCallback = (message, cert, chain, error) => true
                });
                HttpClient.Timeout = new TimeSpan(0, 0, 3);
                HttpClient.BaseAddress = uri;
                var result = HttpClient.GetAsync(uri).Result.Content.ReadAsStringAsync().Result;

    于是想到调用python request库。。

    实现

    IronPython可以在vs中直接调用python,但是不支持第三方库,遂选用命令行调用的方式;

    1、python request.py

    将python请求结果打印在命令行:

    import ast
    import sys
    import time
    
    import requests
    
    FAIL_MESSAGE = "失败的请求"
    
    
    def send_request(**kwargs):
        url = kwargs.get('url')
        if not url:
            raise Exception("无效的url")
        headers = {
            'Referer': url,
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
        }
        headers.update(kwargs.get('headers', dict()))
        timeout = kwargs.get('timeout', 3)
        proxy_api = kwargs.get('proxy_api', Getproxy())
        verify_text = kwargs.get('verify_text', '验证码')
        proxy_address = requests.get(proxy_api).json()[0]['proxy_address']
        # print(f"proxy_address:{proxy_address}")
        time.sleep(0.1)
        try:
            # print(f"headers:{headers}")
            proxies = {"http": "http://" + proxy_address, "https": "http://" + proxy_address}
            response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
            if verify_text in response.text:
                send_request(**kwargs)
            else:
                print(f"{response.text}")
        except Exception as exception:
            print(FAIL_MESSAGE, exception)
    
    
    if __name__ == '__main__':
        kwargs = ast.literal_eval(sys.argv[1])
        send_request(**kwargs)
    

    2、用c#去读取

    private static string RequestByPython(Uri uri)
            {
                var cmdArgs = "{'url':'" + uri + "'}";
                Process process = new Process();
                //py脚本地址
                string path = Directory.GetCurrentDirectory() + PythonRequestFile;
                //本地python安装路径/python.exe
                process.StartInfo.FileName = PythonPath;
                //使用命令行调用py脚本 约定命令格式
                string sArguments = path;
                sArguments += " " + cmdArgs;
                process.StartInfo.Arguments = sArguments;
                process.StartInfo.UseShellExecute = false;
                process.StartInfo.RedirectStandardOutput = true;
                process.StartInfo.RedirectStandardInput = true;
                process.StartInfo.RedirectStandardError = true;
                process.StartInfo.CreateNoWindow = true;
                process.Start();
                StringBuilder stringBuilder = new StringBuilder();
                StreamReader streamReader = process.StandardOutput;
                while (!streamReader.EndOfStream)
                {
                    stringBuilder.Append(streamReader.ReadLine());
                }
                process.WaitForExit();
                var result = stringBuilder.ToString();
                return result;
            }

    结果

    1、内存占用率有升有降,稳定在一个区间;perfect!

    2、打印在命令行又读取的网页源码格式有所变化,需要html格式化或者修改正则(如懒汉匹配);

    注意:

    无论python还是c#使用proxy时都需要忽略https证书错误!

  • 相关阅读:
    rsync--数据镜像备份_转
    netcat
    tcpdump抓包
    find命令应用exec及xargs
    traceroute/tracert--获取网络路由路径
    TCP/IP各层协议数据格式
    (转)mq经验总结-转
    (转)WebSphere MQ基础命令
    MQ通道搭建以及连通性检查
    (转)java并发之CountDownLatch、Semaphore和CyclicBarrier
  • 原文地址:https://www.cnblogs.com/Zdelta/p/14122327.html
Copyright © 2020-2023  润新知