• [爬虫]采用Go语言爬取天猫商品页面


    最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下:

    修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下:

    https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D

    明显进行编码了,首先我们需要进行解码,解码的在线网站如下:

    http://tool.chinaz.com/Tools/urlencode.aspx

    经过decode以后,我们得到:

    https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 无线专属"}]}

    我们需要的就是其中的"itemid":"7664169349"。

    然后我们通过访问https://detail.tmall.com/item.htm?id=7664169349,打开如下页面:

    这就是我们需要抓取的页面信息。广告交易平台将解析的ItemId放入到nsq中,爬虫系统从nsq中读取ItemId通过拼接URL抓取页面的关键信息,然后将关键信息发送到Kafka中,Hive和ES再从Kafka中获取相应的信息,进行查询操作。

    第一步

    第一步就是解析出ItemId,在广告交易平台我们可以获取需要解析的URL,接下来我们用代码对URL进行decode并且解析出相应的ItemId数值。由于项目采用的是Golang,所以这里以Golang为例,Python写其实更简单,原理一样。

    URL解析的方法,可以参考:

    https://gobyexample.com/url-parsing

    JSON序列化和反序列化,可以参考:

    https://www.cnblogs.com/liang1101/p/6741262.html

    这里给出我的代码:

    package main
    
    import (
        "encoding/json"
        "fmt"
        "net/url"
        "strconv"
    )
    //结构体的首字母大写
    type item struct {
        Images []string
        ItemId string
        ShortTitle string
    }
    
    func main() {
        var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D"
        unescape, err := url.QueryUnescape(urlstring)
        if err != nil {
            fmt.Println("err is", err)
        }
        fmt.Println(unescape)
        parse, err := url.Parse(unescape)
        fmt.Println(parse.RawQuery)
        query, err := url.ParseQuery(parse.RawQuery)
        fmt.Println(query)
        fmt.Printf("%T, %v
    ", query["content"][0], query["content"][0])
        m := make(map[string][]item)
        json.Unmarshal([]byte(query["content"][0]), &m)
        fmt.Println("m:", m)
        itemValue := m["items"][0]
        fmt.Println(itemValue.ItemId)
        //转成int64
        i, err := strconv.ParseInt(itemValue.ItemId, 10, 64)
        fmt.Printf("%T, %v", i, i)
    }

    运行结果:

    便可以得到我们需要的ItemId数值。

    第二步

    第二步就是拼接我们的URL进行页面内容的爬取。

    如何通过GoLang拉取网页呢?附上一个简单demo。

    package main
    import (
        "net/http"
        "io/ioutil"
        "fmt"
    )
    func main(){
        var website string = "http://www.future.org.cn"
        if resp,err := http.Get(website); err == nil{
            defer resp.Body.Close()
            if body, err := ioutil.ReadAll(resp.Body); err == nil {
                fmt.Println("HTML content:", string(body));
            }else{
                fmt.Println("Cannot read from connected http server:", err);
            }
        }else{
            fmt.Println("Cannot connect the server:", err);
        }
    }

    但是爬取页面以后,会发现个问题,就是中文显示乱码。

    中文乱码问题解决,参考:

    https://gocn.vip/article/364

    安装 iconv-go

    go get github.com/djimenez/iconv-go

    可以获取以后再转码,比如:

    func convFromGbk(s string) string {
        gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
        res, _ := gbkConvert.ConvertString(s)
        return res
    }

    也可以用如下方式转换Reader:

    req, err := http.NewRequest("GET", url, nil)
        if err != nil {
            return nil, err
        }
        req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))])
        rsp, err := j.client.Do(req)
        if err != nil {
            return nil, err
        }
        //转码
        utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8")
        //if body, err := ioutil.ReadAll(utfBody); err == nil {
        //    fmt.Println("HTML content:", string(body))
        //}

    爬取以后的页面我们需要进行解析,这里采用的XPath。

    关于使用XPath的方式,参考:

    http://www.w3school.com.cn/xpath/xpath_axes.asp

    非常简单,看完就明白了。

    因为爬取之后是html,你只需要获取自己想要的内容即可,说白了就是解析html。

    接下来还有一个难点,就是我们抓取的静态页面,很多信息都包含,但是价格信息不包含,因为它是动态加载的。

    我们不妨分析一下,

    我们将其点开,复制URL在浏览器打开,发现无法访问,403,不要着急,只需要在请求的Header中加上如下的参数即可。

    在代码中如下:

    referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
    req.Header.Set("Referer", referer)

    我们查看响应发现是一个JSON,

    格式化一下:格式化网址:http://tool.oschina.net/codeformat/json

    {
        "defaultModel": {
            "bannerDO": {
                "success": true
            }, 
            "deliveryDO": {
                "areaId": 110100, 
                "deliveryAddress": "浙江金华", 
                "deliverySkuMap": {
                    "6310159781": [
                        {
                            "arrivalNextDay": false, 
                            "arrivalThisDay": false, 
                            "forceMocked": false, 
                            "postage": "快递: 0.00 ", 
                            "postageFree": false, 
                            "skuDeliveryAddress": "浙江金华", 
                            "type": 0
                        }
                    ], 
                    "default": [
                        {
                            "arrivalNextDay": false, 
                            "arrivalThisDay": false, 
                            "forceMocked": false, 
                            "postage": "快递: 0.00 ", 
                            "postageFree": false, 
                            "skuDeliveryAddress": "浙江金华", 
                            "type": 0
                        }
                    ], 
                    "6310159797": [
                        {
                            "arrivalNextDay": false, 
                            "arrivalThisDay": false, 
                            "forceMocked": false, 
                            "postage": "快递: 0.00 ", 
                            "postageFree": false, 
                            "skuDeliveryAddress": "浙江金华", 
                            "type": 0
                        }
                    ], 
                    "3280089025135": [
                        {
                            "arrivalNextDay": false, 
                            "arrivalThisDay": false, 
                            "forceMocked": false, 
                            "postage": "快递: 0.00 ", 
                            "postageFree": false, 
                            "skuDeliveryAddress": "浙江金华", 
                            "type": 0
                        }
                    ], 
                    "3280089025136": [
                        {
                            "arrivalNextDay": false, 
                            "arrivalThisDay": false, 
                            "forceMocked": false, 
                            "postage": "快递: 0.00 ", 
                            "postageFree": false, 
                            "skuDeliveryAddress": "浙江金华", 
                            "type": 0
                        }
                    ]
                }, 
                "destination": "北京市", 
                "success": true
            }, 
            "detailPageTipsDO": {
                "crowdType": 0, 
                "hasCoupon": true, 
                "hideIcons": false, 
                "jhs99": false, 
                "minicartSurprise": 0, 
                "onlyShowOnePrice": false, 
                "priceDisplayType": 4, 
                "primaryPicIcons": [ ], 
                "prime": false, 
                "showCuntaoIcon": false, 
                "showDou11Style": false, 
                "showDou11SugPromPrice": false, 
                "showDou12CornerIcon": false, 
                "showDuo11Stage": 0, 
                "showJuIcon": false, 
                "showMaskedDou11SugPrice": false, 
                "success": true, 
                "trueDuo11Prom": false
            }, 
            "doubleEleven2014": {
                "doubleElevenItem": false, 
                "halfOffItem": false, 
                "showAtmosphere": false, 
                "showRightRecommendedArea": false, 
                "step": 0, 
                "success": true
            }, 
            "extendedData": { }, 
            "extras": { }, 
            "gatewayDO": {
                "changeLocationGateway": {
                    "queryDelivery": true, 
                    "queryProm": false
                }, 
                "success": true, 
                "trade": {
                    "addToBuyNow": { }, 
                    "addToCart": { }
                }
            }, 
            "inventoryDO": {
                "hidden": false, 
                "icTotalQuantity": 225, 
                "skuQuantity": {
                    "3280089025136": {
                        "quantity": 71, 
                        "totalQuantity": 71, 
                        "type": 1
                    }, 
                    "6310159781": {
                        "quantity": 33, 
                        "totalQuantity": 33, 
                        "type": 1
                    }, 
                    "6310159797": {
                        "quantity": 44, 
                        "totalQuantity": 44, 
                        "type": 1
                    }, 
                    "3280089025135": {
                        "quantity": 77, 
                        "totalQuantity": 77, 
                        "type": 1
                    }
                }, 
                "success": true, 
                "totalQuantity": 225, 
                "type": 1
            }, 
            "itemPriceResultDO": {
                "areaId": 110100, 
                "duo11Item": false, 
                "duo11Stage": 0, 
                "extraPromShowRealPrice": false, 
                "halfOffItem": false, 
                "hasDPromotion": false, 
                "hasMobileProm": false, 
                "hasTmallappProm": false, 
                "hiddenNonBuyPrice": false, 
                "hideMeal": false, 
                "priceInfo": {
                    "6310159781": {
                        "areaSold": true, 
                        "onlyShowOnePrice": false, 
                        "price": "178.00", 
                        "promotionList": [
                            {
                                "amountPromLimit": 0, 
                                "amountRestriction": "", 
                                "basePriceType": "IcPrice", 
                                "canBuyCouponNum": 0, 
                                "endTime": 1561651200000, 
                                "extraPromTextType": 0, 
                                "extraPromType": 0, 
                                "limitProm": false, 
                                "postageFree": false, 
                                "price": "75.00", 
                                "promType": "normal", 
                                "start": false, 
                                "startTime": 1546267717000, 
                                "status": 2, 
                                "tfCartSupport": false, 
                                "tmallCartSupport": false, 
                                "type": "火爆促销", 
                                "unLogBrandMember": false, 
                                "unLogShopVip": false, 
                                "unLogTbvip": false
                            }
                        ], 
                        "sortOrder": 0
                    }, 
                    "6310159797": {
                        "areaSold": true, 
                        "onlyShowOnePrice": false, 
                        "price": "178.00", 
                        "promotionList": [
                            {
                                "amountPromLimit": 0, 
                                "amountRestriction": "", 
                                "basePriceType": "IcPrice", 
                                "canBuyCouponNum": 0, 
                                "endTime": 1561651200000, 
                                "extraPromTextType": 0, 
                                "extraPromType": 0, 
                                "limitProm": false, 
                                "postageFree": false, 
                                "price": "75.00", 
                                "promType": "normal", 
                                "start": false, 
                                "startTime": 1546267717000, 
                                "status": 2, 
                                "tfCartSupport": false, 
                                "tmallCartSupport": false, 
                                "type": "火爆促销", 
                                "unLogBrandMember": false, 
                                "unLogShopVip": false, 
                                "unLogTbvip": false
                            }
                        ], 
                        "sortOrder": 0
                    }, 
                    "3280089025135": {
                        "areaSold": true, 
                        "onlyShowOnePrice": false, 
                        "price": "168.00", 
                        "promotionList": [
                            {
                                "amountPromLimit": 0, 
                                "amountRestriction": "", 
                                "basePriceType": "IcPrice", 
                                "canBuyCouponNum": 0, 
                                "endTime": 1561651200000, 
                                "extraPromTextType": 0, 
                                "extraPromType": 0, 
                                "limitProm": false, 
                                "postageFree": false, 
                                "price": "68.00", 
                                "promType": "normal", 
                                "start": false, 
                                "startTime": 1546267717000, 
                                "status": 2, 
                                "tfCartSupport": false, 
                                "tmallCartSupport": false, 
                                "type": "火爆促销", 
                                "unLogBrandMember": false, 
                                "unLogShopVip": false, 
                                "unLogTbvip": false
                            }
                        ], 
                        "sortOrder": 0
                    }, 
                    "3280089025136": {
                        "areaSold": true, 
                        "onlyShowOnePrice": false, 
                        "price": "168.00", 
                        "promotionList": [
                            {
                                "amountPromLimit": 0, 
                                "amountRestriction": "", 
                                "basePriceType": "IcPrice", 
                                "canBuyCouponNum": 0, 
                                "endTime": 1561651200000, 
                                "extraPromTextType": 0, 
                                "extraPromType": 0, 
                                "limitProm": false, 
                                "postageFree": false, 
                                "price": "68.00", 
                                "promType": "normal", 
                                "start": false, 
                                "startTime": 1546267717000, 
                                "status": 2, 
                                "tfCartSupport": false, 
                                "tmallCartSupport": false, 
                                "type": "火爆促销", 
                                "unLogBrandMember": false, 
                                "unLogShopVip": false, 
                                "unLogTbvip": false
                            }
                        ], 
                        "sortOrder": 0
                    }
                }, 
                "queryProm": false, 
                "success": true, 
                "successCall": true, 
                "tmallShopProm": [ ]
            }, 
            "memberRightDO": {
                "activityType": 0, 
                "level": 0, 
                "postageFree": false, 
                "shopMember": false, 
                "success": true, 
                "time": 1, 
                "value": 0.5
            }, 
            "miscDO": {
                "bucketId": 15, 
                "city": "北京", 
                "cityId": 110100, 
                "debug": { }, 
                "hasCoupon": false, 
                "region": "东城区", 
                "regionId": 110101, 
                "rn": "fa015e69c6a4ca4bb559805d670557e7", 
                "smartBannerFlag": "top", 
                "success": true, 
                "supportCartRecommend": false, 
                "systemTime": "1555232632711", 
                "town": "东华门街道", 
                "townId": 110101001
            }, 
            "regionalizedData": {
                "success": true
            }, 
            "sellCountDO": {
                "sellCount": "5", 
                "success": true
            }, 
            "servicePromise": {
                "has3CPromise": false, 
                "servicePromiseList": [
                    {
                        "description": "商品支持正品保障服务", 
                        "displayText": "正品保证", 
                        "icon": "无", 
                        "link": "//www.tmall.com/wow/portal/act/bzj", 
                        "rank": -1
                    }, 
                    {
                        "description": "极速退款是为诚信会员提供的退款退货流程的专享特权,额度是根据每个用户当前的信誉评级情况而定", 
                        "displayText": "极速退款", 
                        "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", 
                        "link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed", 
                        "rank": -1
                    }, 
                    {
                        "description": "卖家为您购买的商品投保退货运费险(保单生效以下单显示为准)", 
                        "displayText": "赠运费险", 
                        "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", 
                        "link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1", 
                        "rank": -1
                    }, 
                    {
                        "description": "七天无理由退换", 
                        "displayText": "七天无理由退换", 
                        "icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png", 
                        "link": "//pages.tmall.com/wow/seller/act/seven-day", 
                        "rank": -1
                    }
                ], 
                "show": true, 
                "success": true, 
                "titleInformation": [ ]
            }, 
            "soldAreaDataDO": {
                "currentAreaEnable": true, 
                "success": true, 
                "useNewRegionalSales": true
            }, 
            "tradeResult": {
                "cartEnable": true, 
                "cartType": 2, 
                "miniTmallCartEnable": true, 
                "startTime": 1554812946000, 
                "success": true, 
                "tradeEnable": true
            }, 
            "userInfoDO": {
                "activeStatus": 0, 
                "companyPurchaseUser": false, 
                "loginMember": false, 
                "loginUserType": "buyer", 
                "success": true, 
                "userId": 0
            }
        }, 
        "isSuccess": true
    }

    我们发现JSON的内容非常多,我们要是每个都解析,岂不是很累?这里我们只需要获取price的信息,也就是priceInfo,所以我们想寻求一种方法,类似XPath的方式解析,这里我们采用JSONPath。

    参考:https://github.com/DarrenChanChenChi/jsonpath

    用法和XPath大同小异。

    解析出我们想要的代码即可。

    整体代码

    common.go:

    package main
    
    import (
        "github.com/djimenez/iconv-go"
        "time"
        "net"
        "net/http"
        "gopkg.in/xmlpath.v2"
        "strings"
        "fmt"
        "math/rand"
    )
    
    type Msg struct{
        AdID int64 `json:"ad_id"`
        SourceID int64 `json:"source_id"`
        Source string `json:"source"`
        ItemID int64 `json:"item_id"`
        URL string `json:"url"`
        UID int64 `json:"uid"`
        DID int64 `json:"did"`
    }
    
    func convFromGbk(s string) string {
        gbkConvert, _ := iconv.NewConverter("gbk", "utf-8")
        res, _ := gbkConvert.ConvertString(s)
        return res
    }
    
    func newHTTPClient() *http.Client {
        client := &http.Client{
            Transport: &http.Transport{
                Dial: func(netw, addr string) (net.Conn, error) {
                    return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond))
                },
                MaxIdleConnsPerHost: 200,
            },
            Timeout: time.Duration(1500 * time.Millisecond),
        }
        return client
    }
    
    //只获取首元素
    func parseNode(node *xmlpath.Node, xpath string) string {
        path, err := xmlpath.Compile(xpath)
        if err != nil {
            fmt.Errorf("%s",err)
            return ""
        }
    
        it := path.Iter(node)
        for it.Next() {
            s := strings.TrimSpace(it.Node().String())
            if len(s) != 0 {
                //return convFromGbk(s)
                return s
            }
        }
        return ""
    }
    
    //获取所有元素
    func parseNodeForAll(node *xmlpath.Node, xpath string) []string {
        path, err := xmlpath.Compile(xpath)
        if err != nil {
            fmt.Errorf("%s",err)
            return nil
        }
    
        it := path.Iter(node)
        elements := []string{}
        for it.Next() {
            s := strings.TrimSpace(it.Node().String())
            if len(s) != 0 {
                //return convFromGbk(s)
                elements = append(elements, s)
            }
        }
        return elements
    }
    
    // percent returns the possibility of pct
    func percent(pct int) bool {
        if pct < 0 || pct > 100 {
            return false
        }
        return pct > rand.Intn(100)
    }

    ali_spider.go:

    package main
    
    import (
        "code.byted.org/gopkg/logs"
        "encoding/json"
        "fmt"
        "github.com/djimenez/iconv-go"
        "github.com/ngaut/logging"
        "github.com/oliveagle/jsonpath"
        "gopkg.in/xmlpath.v2"
        "io/ioutil"
        "math/rand"
        "net/http"
        "strconv"
        "strings"
    )
    
    const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d"
    const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip&timestamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i"
    

    var ualist
    = []string{ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }

    type AliSpider struct { client
    *http.Client } func NewAliSpider() *AliSpider { return &AliSpider{ client: newHTTPClient(), } } func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) { req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //转码 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //} node, err := xmlpath.ParseHTML(utfBody) rsp.Body.Close() return node, err } func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) { priceURL := fmt.Sprintf(priceURLPatternAli, itemID) req, err := http.NewRequest("GET", priceURL, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID) req.Header.Set("Referer", referer) rsp, err := j.client.Do(req) if err != nil { return nil, err } priceInfoRaw, err := ioutil.ReadAll(rsp.Body) if err != nil { return nil, err } priceInfo := string(priceInfoRaw) jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1 rightIndex := strings.Index(jsonStr, ")") var json_data interface{} json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity") if err != nil { logs.Info("json path is err, err is %v", err) } skuQuantityMap := skuQuantity.(map[string]interface{}) itemPriceResultMap := map[string]map[string]float64{} itemPriceResultDetailMap := map[string]float64{} for skuQuantityId, _ := range skuQuantityMap { //fmt.Println(key, value) jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId) jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId) price, err := jsonpath.JsonPathLookup(json_data, jpathPrice) if err != nil { logs.Info("jpathPrice is err, err is %v", err) } promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice) if err != nil { logs.Info("jpathPromotionPrice is err, err is %v", err) } priceStr := price.(string) promotionPriceStr := promotionPrice.(string) itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64) itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64) itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap } return itemPriceResultMap, err } func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) { defer func() { if r := recover(); r != nil { logging.Errorf("parse msg %v, error %v", *msg, r) return } }() itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID) node, err := j.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) return nil, err } //metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]") //详情描述 /** 产品名称:纽曼 品牌: 纽曼 型号: EX16 功能: 睡眠监测 计步 防水 */ details := parseNodeForAll(node, "//ul[@id="J_AttrUL"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class="slogo-shopname"]") //描述 服务 物流 shopinfos := parseNodeForAll(node, "//span[@class="shopdsr-score-con"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //价格(多个型号,price是标准价格,promotion_price是促销价格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{} res["source"] = "Ali" res["source_id"] = msg.SourceID res["id"] = msg.ItemID res["ad_id"] = msg.AdID res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["uid"] = msg.UID res["did"] = msg.DID res["item_price"] = itemPriceResultMap // 选几个必须包含的类别校验 if res["name"] == "" && res["shopname"] == "" { return nil, fmt.Errorf("invalid html page %s", itemURL) } return res, nil }

    ali_spider_test.go:

    package main
    
    import (
        "encoding/json"
        "fmt"
        "strconv"
        "strings"
        "testing"
    )
    
    func TestName(t *testing.T) {
        //conf, err := ssconf.LoadSsConfFile(confFile)
        //if err != nil {
        //    panic(err)
        //}
        aliSpider := NewAliSpider()
        //554867117919 585758506034
        var itemId int64 = 7664169349
        itemURL := fmt.Sprintf(itemURLPatternAli, itemId)
        node, err := aliSpider.loadPage(itemURL)
        if err != nil {
            fmt.Errorf("%s",err)
        }
        //fmt.Println(node)
        name := parseNode(node, "//h1[@data-spm]")
        //详情描述
        /**
        产品名称:纽曼
        品牌: 纽曼
        型号: EX16
        功能: 睡眠监测 计步 防水
         */
        details := parseNodeForAll(node, "//ul[@id="J_AttrUL"]/li")
        detailsMap := make(map[string]string, len(details))
        for _, detail := range details {
            split := strings.Split(detail, ":")
            if(len(split) > 1){
                detailsMap[split[0]] = strings.TrimSpace(split[1])
            }
        }
    
        shopname := parseNode(node, "//a[@class="slogo-shopname"]")
    
        //描述 服务 物流
        shopinfos := parseNodeForAll(node, "//span[@class="shopdsr-score-con"]")
        describe, _ := strconv.ParseFloat(shopinfos[0], 64)
        service, _ := strconv.ParseFloat(shopinfos[1], 64)
        logistics, _ := strconv.ParseFloat(shopinfos[2], 64)
        //价格(多个型号,price是标准价格,promotion_price是促销价格)
        //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]]
        itemPriceResultMap, err := aliSpider.parsePrice(itemId)
    
        res := map[string]interface{}{}
        res["source"] = "Ali"
        res["url"] = itemURL
        res["name"] = name
        res["details"] = detailsMap
        res["shopname"] = shopname
        res["describe"] = describe
        res["service"] = service
        res["logistics"] = logistics
        res["item_price"] = itemPriceResultMap
    
        bytes, err := json.Marshal(res)
        if err != nil {
            fmt.Println("error is ", err)
        }
        fmt.Println(string(bytes))
    }

    运行结果:

    {"describe":4.9,"details":{"上市时间":"2014年冬季","乒乓底板材质":"其他","品牌":"Palio/拍里奥","型号":"TNT-1","层数":"9层","拍柄重量":"头沉柄轻","是否商场同款":"是","系列":"拍里奥TNT-1","货号":"TNT-1","颜色分类":"TNT-1直拍(短柄)1只+赠送:1海绵护边【7木+2碳】 TNT-1横拍(长柄)1只+赠送:1海绵护边【7木+2碳】 新TNT直拍(短柄)1只+赠送:1海绵护边【5木+2碳】 新TNT横拍(长柄)1只+赠送:1海绵护边【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奥乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"玺源运动专营店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}
  • 相关阅读:
    CEGUI的使用简单说明 冷夜
    Dword、LPSTR、LPWSTR、LPCSTR、LPCWSTR、LPTSTR、LPCTSTR 冷夜
    WideCharToMultiByte和MultiByteToWideChar函数的用法 冷夜
    CEGUI数据文件 冷夜
    vc 编译连接选项 冷夜
    Microsoft Visual Studio is waiting for an internal operation to complete 解决方法 冷夜
    加载Torchlight(火炬之光)的layout布局文件 冷夜
    Jstl函数库
    <logic:empty/>,<logic:present/>和<logic:iterator/>标签
    动态ActionFormDynaActionForm
  • 原文地址:https://www.cnblogs.com/DarrenChan/p/10706019.html
Copyright © 2020-2023  润新知