在进行json处理的时候,经常会遇到需要提取不同级别的字段内容并合并成表格,见下图所示。本篇就总结下遇到这种情况下如何进行文本内容的提取。在知乎上看到一篇《骚操作!嵌套 JSON 秒变 Dataframe》似乎和我遇到的情况类似,不过在实际处理的时候又有不同。

json-nested
json-nested

先看我要处理的原始数据:

 1{
 2  "code": 0,
 3  "error": "",
 4  "message": "",
 5  "data": {
 6    "page": 1,
 7    "page_size": 3,
 8    "total": 1026,
 9    "list": [
10      [
11        {
12          "_object_id": "App",
13          "instanceId": "5b655ab36b9c3",
14          "name": "nginx"
15        },
16        {
17          "_object_id": "VIRTUAL_MACHINE",
18          "instanceId": "5b793e7f8f932",
19          "name": "CSSCU_VDU-18"
20        },
21        {
22          "_object_id": "PHYSICAL_MACHINE",
23          "instanceId": "5b471103effff",
24          "name": "IDC-0D12-S-SRV-41"
25        },
26        {
27          "_object_id": "SWITCH",
28          "instanceId": "5b4714cb11685",
29          "name": "IDC-0D11-M-TOR-01"
30        },
31        {
32          "_object_id": "SWITCH",
33          "instanceId": "5b4714ca14f5d",
34          "name": "IDC-0G06-M-EOR-02"
35        }
36      ],
37      [
38        {
39          "_object_id": "App",
40          "instanceId": "5b655ab36b9c3",
41          "name": "nginx"
42        },
43        {
44          "_object_id": "VIRTUAL_MACHINE",
45          "instanceId": "5b793e7f8f932",
46          "name": "CSSCU_VDU-18"
47        },
48        {
49          "_object_id": "PHYSICAL_MACHINE",
50          "instanceId": "5b471103effff",
51          "name": "IDC-0D12-S-SRV-41"
52        },
53        {
54          "_object_id": "SWITCH",
55          "instanceId": "5b4714cb11685",
56          "name": "IDC-0D11-M-TOR-01"
57        },
58        {
59          "_object_id": "SWITCH",
60          "instanceId": "5b4714c98cbd1",
61          "name": "IDC-0F06-M-EOR-01"
62        }
63      ],
64      [
65        {
66          "_object_id": "App",
67          "instanceId": "5b655ab36b9c3",
68          "name": "nginx"
69        },
70        {
71          "_object_id": "VIRTUAL_MACHINE",
72          "instanceId": "5b793e7f8f932",
73          "name": "CSSCU_VDU-18"
74        },
75        {
76          "_object_id": "PHYSICAL_MACHINE",
77          "instanceId": "5b471103effff",
78          "name": "IDC-0D12-S-SRV-41"
79        },
80        {
81          "_object_id": "SWITCH",
82          "instanceId": "5b4714ca966ae",
83          "name": "IDC-0D12-M-TOR-02"
84        },
85        {
86          "_object_id": "SWITCH",
87          "instanceId": "5b4714ca14f5d",
88          "name": "IDC-0G06-M-EOR-02"
89        }
90      ]
91    ]
92  }
93}

上面是一条CMDB路径处理下的三条数据,对应的路径是 APP1(nginx) — 对应的虚拟机—虚拟机所在的物理机 — TOR交换 — EOR交换机。直接通过pandas进行读取截取时,可以获取到如下信息:

1import json
2import pandas as pd
3f = open("vhost.json")
4results = json.load(f)
5print(results["data"]["list"])
6df = pd.DataFrame(results["data"]["list"],columns = ["APP","VM" , "PY", "TOR", "EOR"])

此时处理过以后,虽然可以正常返回对应的五列数据,不过每列的数据里又有json。

pandas-json
pandas-json

参考w3resource 上关于json的处理方法,多次折腾后,发现还是不对。所以就换笨方法,for循环解决。代码如下:

1for col in ["APP","VM" , "PY", "TOR", "EOR"]:
2    df[col] = df.apply(lambda df: df[col]["name"], axis=1)

处理完可以正常返回每个单元格里的json里的name字段。由于返回的这里是有重复值的,比如APP这里都是nginx,所以还会涉及到汇总去重统计,可以直接调用df.nunique()函数搞定。需要拿到每列的去重具体结果,可以使用DataFrame.drop_duplicates方法搞定。