之前写过一个采集AWS partner信息的 python 脚本,后来忘记放哪了,今天没事又重新写了一个,内容如下:

 1# code from blog.361way.com
 2import requests,json,xlsxwriter,argparse
 3
 4def increment(start, end, step):
 5    """
 6    递增函数
 7
 8    :param start: 开始值
 9    :param end: 结束值
10    :param step: 步长
11    :return: 递增序列
12    """
13    sequence = []
14    for i in range(start, end + 1, step):
15        sequence.append(i)
16    return sequence
17
18def crawl(num,country):
19    url = 'https://api.finder.partners.aws.a2z.com/search?locale=en&highlight=on&sourceFilter=searchPage&size=10&location=' + country + '&from=' + str(num)
20
21    response = requests.get(url)
22    partners = response.text
23    data = json.loads(partners)
24    pdata = data["message"]["results"]
25
26    #print(data)
27    #print(data["message"]["results"][0])
28    for i in range((len(pdata))):
29        dcompany = data["message"]["results"][i]
30        id = dcompany['_id']
31        name = dcompany['_source']['name']
32        country = country
33        brief_description = dcompany['_source']['brief_description']
34        current_program_status = dcompany['_source']['current_program_status']
35        customer_type = dcompany['_source']['current_program_status']
36        description = dcompany['_source']['description']
37        website = dcompany['_source']['website']
38        data_arry.append([id,name,country,brief_description,current_program_status,customer_type,description,website])
39
40def main(country, number):
41    #data_arry = []
42    country = country
43    sequence = increment(0, number, 10)
44    for num in sequence:
45        crawl(num,country)
46
47    workbook = xlsxwriter.Workbook(country + '.xlsx')
48    worksheet = workbook.add_worksheet('Sheet1')
49    bold = workbook.add_format({'bold': 1})
50    headings = ['ID', 'Name', 'Country','Brief_description','Current_program_status','Customer_type','Description','Website']
51    worksheet.write_row('A1', headings, bold)
52    row = 1
53    col = 0
54    for linev in  data_arry:
55        #print linev
56        worksheet.write_row(row,col,linev)
57        row += 1
58    workbook.close()
59
60if __name__ == "__main__":
61    data_arry = []
62    # 创建 ArgumentParser 对象
63    parser = argparse.ArgumentParser(description='Input the country name and number, you can get the aws partner informations')
64
65    # 添加命令行参数
66    parser.add_argument('country', type=str, help='Please input the aws partner country name')
67    parser.add_argument('number', type=int, help='Please input how many partners need crawl')
68
69    # 解析命令行参数
70    args = parser.parse_args()
71
72    # 调用 main 函数,并将解析后的参数传递给它
73    main(args.country, args.number)

这里指定了两个参数,一个是国家一个是采集的条目数,这个可以在打开AWS伙伴时可以看到。比如打开 https://partners.amazonaws.com/search/partners/?loc=Brazil 该页面,可以看到有274个条目,这里时候就可以使用 python aws_partner.py Brazil 274 运行获取结果信息了,运行后的信息会存在 brazil.xlsx里。

注:该代码仅用于技术研究,请不要用于非法采集。