咸糖记录编程的地方

Do what you love and the money will follow.

目录
京东全站爬虫项目准备
/  

京东全站爬虫项目准备

由于淘宝的页面参差不齐 所以转向爬取京东
不过京东也不是很简单
需要进行抓包

评论数 url 分析

https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds=5225342,5168128,4422256,4730611,7102519,4888355,5962242,4357281,5425401,6076583,7028389,5148299,5456134,6949859,3948454,6383824,4274539,7269873,7061526,4331155,6888608,6023686,6949959,5113099,6076609,6888588,4752515,6031883,26512882412,1378536,5512841,5225346,4335139,5363894,6529483,5029717,4335674,5696028,5834183,6072622,6818156,6031973,4824715,5148309,6736174,7271737,5483596,7275691,6736180,5005927,4752553,5537833,5005929,5020872,6339280,4331185,6043002,4338107,1593516,5148275

后面的数字就是商品的id 可以从页面中提取

domo:

content=requests.get("https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds=5225342,5168128,4422256,4730611,7102519,4888355,5962242,4357281,5425401,6076583,7028389,5148299,5456134,6949859,3948454,6383824,4274539,7269873,7061526,4331155,6888608,6023686,6949959,5113099,6076609,6888588,4752515,6031883,26512882412,1378536,5512841,5225346,4335139,5363894,6529483,5029717,4335674,5696028,5834183,6072622,6818156,6031973,4824715,5148309,6736174,7271737,5483596,7275691,6736180,5005927,4752553,5537833,5005929,5020872,6339280,4331185,6043002,4338107,1593516,5148275",
                   headers=header).json()
for i in content['CommentsCount']:
    print(i['ProductId'])

页面的所有的评论

content=requests.get("https://p.3.cn/prices/mgets?ext=11000000&pin=&type=1&area=1_72_2799_0&skuIds=J_1434455450%2CJ_1466272503%2CJ_1466272508%2CJ_1466272514%2CJ_1519116908%2CJ_1519116915%2CJ_1519117739%2CJ_28072638552%2CJ_28072638547%2CJ_13464864437%2CJ_13464864430%2CJ_27724755402%2CJ_27788642881%2CJ_27788642878%2CJ_28003980407%2CJ_28003980421%2CJ_28053456103%2CJ_28053448991%2CJ_27958477784%2CJ_27781523471%2CJ_27781064430%2CJ_10194370150%2CJ_27958099807%2CJ_10193529301%2CJ_27728328381%2CJ_27728328387%2CJ_27576387869%2CJ_27576387872%2CJ_27727779978%2CJ_27727779981%2CJ_27956120834%2CJ_27956120861%2CJ_27956120840%2CJ_27866710401%2CJ_27866703972%2CJ_27866703985%2CJ_27866703977%2CJ_27866703991%2CJ_27847092260%2CJ_27819773265%2CJ_27799405477%2CJ_27933278831%2CJ_27871443148%2CJ_11676644172%2CJ_11676644176%2CJ_11816768869%2CJ_25748218517%2CJ_25748218504%2CJ_25748218524%2CJ_25748218508%2CJ_25748218530%2CJ_27898320183%2CJ_27515649559%2CJ_27515649557%2CJ_11809273727%2CJ_11809273710%2CJ_11809273704%2CJ_11809273711%2CJ_11809273724%2CJ_11809273716%J_26820030852,J_26820030854,J_27818982583,J_27816465779,J_27796973907,J_28072638542,J_13464864431,J_27857534144,J_27636865207,J_27622452488,J_27724748896,J_27788642869,J_27896955987,J_28003980416,J_28053448998,J_27958477776,J_27781523480,J_27781064439,J_10091014168,J_27958089199,J_28027496834,J_27853088484,J_26454632985,J_27896984645,J_26926788611,J_10089684359,J_27896133952,J_27894868874,J_27728328375,J_27576387862,J_27713847701,J_27727779974,J_27956120850,J_27256476147,J_13184248017,J_27791001471,J_27859908239,J_27800807316,J_28004448107,J_27866710408,J_27847092261,J_27654425536,J_27371264993,J_27310773320,J_27819773259,J_27799405469,J_27933278824,J_27871443151,J_11816768875,J_25748207096,J_26783265758,J_26608305917,J_27898320184,J_11885391619,J_12172612568,J_11809273725,J_26820030856,J_27716982142,J_10564246875,J_27275697600&pdbp=0&pdtk=&pdpin=&pduid=1277234422&source=list_pc_front&_=1526460614494",
                   headers=header).json()
print(content)

起始页domo:京东全部分类

import requests
from  lxml import etree
header={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
}
content=requests.get("https://www.jd.com/allSort.aspx",
                   headers=header).text
selector=etree.HTML(content)
count=0
lis=[]
for i in selector.xpath("/html/body/div[5]/div[2]/div[1]/div[2]/div[1]/div"):
    for j in i.xpath("./div[2]/div[3]/dl/dd/a"):
        url=j.xpath("./@href")[0]

        if 'html?cat=' in url:
            lis.append(url)
            print(url)
            print(j.xpath("./text()"))
        count+=1
print(len(lis))

#['坡跟鞋']
#//list.jd.com/list.html?cat=11729,11731,12062
#['松糕鞋']
#//list.jd.com/list.html?cat=11729,11731,12063
#['内增高']
#//list.jd.com/list.html?cat=11729,11731,12064
#['防水台']
#594

共有594条

——————————-5/25———————————————–

没想到京东是每页先加载一半的商品 然后鼠标滚动 才会采集下面的商品

而且直接访问是不行的 要加上referer referer是指向的是上个页面

import requests
from lxml import etree
header={
'Referer': 'https://search.jd.com/Search?keyword=%E7%94%B5%E8%A7%86%E6%9C%BA32%E5%AF%B8&enc=utf-8&spm=2.1.1'
}
html=requests.get('https://search.jd.com/s_new.php?keyword=%E7%94%B5%E8%A7%86%E6%9C%BA32%E5%AF%B8&enc=utf-8&qrst=1&rt=1&stop=1&spm=2.1.1&vt=2&stock=1&page=2&s=28&scrolling=y&log_id=1527258651.80467&tpl=1_M&show_items=4620979,647948,6930429,1257425,6316018,941189,1242386,4939521,6642769,4026319,2076435,11190449495,7430597,1879521,6255702,2763391,2397940,5051427,1766255,6958126,4902977,4272782,2127675,14458294028,2029437,26517322985,6716598,4622697,4208339,6916461'
                  ,headers=header).content.decode('utf-8')
selector=etree.HTML(html)
count=0
for sel in selector.xpath('//li'):
    # //*[@id="plist"]/ul/li[1]/div/div[4]/a/em/text()
    productUrl = sel.xpath('./div/div[@class="p-img"]/a/img[1]/@src|./div/div[@class="p-img"]/a/img/@data-lazy-img')
    print(productUrl)
    count+=1
print(count)

标题:京东全站爬虫项目准备
作者:xiantang
地址:http://xiantang.info/articles/2019/06/03/1559551048453.html

评论