• drillercppwebcrawler


    Who am i:
    my name is meir yanovich and im c++/java developer mostly doing server infra cross platform (unix/linux/window) stuff in my day job.
    but sometimes i like to experiment stuff in my spare time.
    also If you interested in facebook api and ways to interact this project may interest
    you:
    http://code.google.com/p/facebook-cpp-graph-api/
    or if you have young kids:
    http://code.google.com/p/kidsbrowser/


    you can find my on-line profile in here :
    http://il.linkedin.com/in/meiryanovich 
    if you have any cool ideas on how to use this code and you need help please email me
    Email: meiry242@gmail.com

    Implementation of web crawler / spider in c++ 
    ------------------------------------------------------------------------------------

    Web crawler / spider used for web data mining or data aggregations 

    • using regular expressions rules to collect data.
    • Programmed using pure c++ (stl) and bunch of open source libraries.
    • web spider that can fallow links based on single domain.
    • output to xml file with configurable tags.

     

    I tried to keep the "keep it simple keep it clean" rule , using as much of ready made open source c/c++ libraries.

    How to build it:
    The application only tested on windows xp 32 bit although I pay attention on using only cross platform libraries.
    and not to write OS depended code.
    The libraries the Driller depend on are :

    • pcre : for regular expressions.
    • Pthreads : for cross platform threads wrapper.
    • Curl + c-ares : for http requests / response.
    In Driller source code I supply visual studio express 2008 solution and project files and all the libraries are already build in debug mode. all you have to do is configure it and build it
    this will save you time on configuring and compiling to test the application.
    for more information see *how_to_build_drill*

    How to configure it:
    The driller web spider doesn’t come with fancy configuration GUI or configuration file.
    All configurations must be done in code , then compile it then run it and see the results come in.
    The reasons is because I used it for my personal use without much time in my hands and didn't planed to Open source it ..any way all those features will be added later.
    Step by step guide can be found here in how_to_configure_drill.


    if you find this useful consider to donate.
    all donations will go to charity.

  • 相关阅读:
    获取字符串中指定字符间的字符串
    删除一个xml
    读取文件夹下所有文件名,饼写入xml
    在现有xml增加一个新的节点
    某一时间执行某方法c# 写在global里
    Ubuntu下安装Adobe Flash Player
    Josephus(约瑟夫环)
    html5综合属性图表
    第一步
    框架学习的个人见解
  • 原文地址:https://www.cnblogs.com/lexus/p/2559700.html
Copyright © 2020-2023  润新知