之前写过一个采集Azure partner信息的 python 脚本,后来也忘记放哪了,因为昨天刚研究了下AWS的,今天就研究下Azure的,其查询页面为:https://appsource.microsoft.com/en-us/marketplace/partner-dir?filter=products%3DAzure,分析发现其真正的数据是在 https://main.prod.marketplacepartnerdirectory.azure.com/api/partners?filter= 这个JSON页里,后面会根上不同的过滤条件。

分析发现其对应的过滤规律类似如下:

1products=Azure;sort=1;pageSize=18;pageOffset=18;onlyThisCountry=true;country=BR;radius=100;locname=Brazil;locationNotRequired=true
2products=Azure;sort=1;pageSize=18;pageOffset=36;onlyThisCountry=true;country=BR;radius=100;locname=Brazil;locationNotRequired=true
3products=Azure;sort=1;pageSize=18;pageOffset=54;onlyThisCountry=true;country=BR;radius=100;locname=Brazil;locationNotRequired=true
4products=Azure;sort=1;pageSize=18;pageOffset=72;onlyThisCountry=true;country=BR;radius=100;locname=Brazil;locationNotRequired=true
5products=Azure;sort=1;pageSize=18;pageOffset=90;onlyThisCountry=true;country=BR;radius=100;locname=Brazil;locationNotRequired=true
  • sort=1,0 0和1代表两种不同的排序方式,暂未发现区别在哪里,不过输出的结果都是选择的区域的数据
  • radius=100 代表距离该地方100公里以内,可以不写
  • onlyThisCountry=true;country=BR;radius=100;locname=Brazil 这里的country和locname建议都写,比你在页面上检索brazil,会发现有一个地方指的美国,都写会更精确

了解其大致规律后,对应的脚本内容如下:

 1# code from blog.361way.com
 2import requests,json,xlsxwriter,argparse
 3import urllib.parse
 4
 5def crawl(shortname,country):
 6    num = 0
 7    while True:
 8        #params = 'products=Azure;sort=1;pageSize=18;pageOffset=' + str(num) + ';onlyThisCountry=true;country=BR;locname=Brazil;locationNotRequired=true'
 9        params = 'products=Azure;sort=1;pageSize=18;pageOffset=' + str(num) + ';onlyThisCountry=true;country=' + shortname + ';locname=' + country +';locationNotRequired=true'
10        url = 'https://main.prod.marketplacepartnerdirectory.azure.com/api/partners?filter=' + (urllib.parse.quote(params))
11        num = num + 18
12        response = requests.get(url)
13        partners = response.text
14        data = json.loads(partners)
15        pnum = data['matchingPartners']['totalCount']
16        pdata = data['matchingPartners']['items']
17        for partner in pdata:
18            partnerId = partner['partnerId']
19            name = partner['name']
20            description = partner['description']
21            product = '\n'.join(partner['product'])
22            solutions = '\n'.join(partner['solutions'])
23            serviceType = '\n'.join(partner['serviceType'])
24            address = str(partner['location']['address'])
25            linkedIn = partner['linkedInOrganizationProfile']
26            print(partnerId)
27            data_arry.append([partnerId,name,country,description,product,solutions,serviceType,address,linkedIn])
28
29        if pnum < 18:
30            break
31
32def main(shortname,country):
33    crawl(shortname,country)
34    workbook = xlsxwriter.Workbook(country + '.xlsx')
35    worksheet = workbook.add_worksheet('Sheet1')
36    bold = workbook.add_format({'bold': 1})
37    headings = ['PartnerId', 'Name', 'Country','Description','Product','Solutions','ServiceType','Address','LinkedIn']
38    worksheet.write_row('A1', headings, bold)
39    row = 1
40    col = 0
41    for linev in data_arry:
42        #print linev
43        worksheet.write_row(row,col,linev)
44        row += 1
45    workbook.close()
46
47if __name__ == "__main__":
48    data_arry = []
49    # 创建 ArgumentParser 对象
50    parser = argparse.ArgumentParser(description='Input the country shortname and country name, you can get the azure partner informations')
51
52    # 添加命令行参数
53    parser.add_argument('shortname', type=str, help='Please input country shortname , for example: Brazil is BR')
54    parser.add_argument('country', type=str, help='Please input the aws partner country name, for examle: Brazil')
55
56
57    # 解析命令行参数
58    args = parser.parse_args()
59
60    # 调用 main 函数,并将解析后的参数传递给它
61    main(args.shortname, args.country)

这里指定了两个参数,一个是国家短代码,一个是国家代码,比如墨西哥的国家短代码是MX,国家代码为Mexico,这样传入这两个参数就会采集相关信息。因为是每次JSON取18条数据,这里的判断是,当一次获取的数据条目小于18时,就自动跳出循环 — 最后一页。

多个国家的数据抓取可以写在一个bash里,内容如下:

 1python azure_partner.py MX Mexico
 2python azure_partner.py CL Chile
 3python azure_partner.py AR Argentina
 4python azure_partner.py CO Colombia
 5python azure_partner.py CR 'Costa Rica'
 6python azure_partner.py DO 'Dominican Republic'
 7python azure_partner.py GT Guatemala
 8python azure_partner.py HN Honduras
 9python azure_partner.py PA 'Panama Canal, Panama'
10python azure_partner.py PE Peru
11python azure_partner.py EC Ecuador

因为部分国家中间有空格分隔,这里就使用单引号引起来即可。

注:该代码仅用于技术研究,请不要用于非法采集。