1.Source data_parser_plugin
- 从数据源获取的数据,然后对其解析,提取想要的数据字段
2.支持插件
Plugins |
说明 |
文档衔接 |
json |
解析json数据格式数据 |
|
xml |
使用XMLReader提取解析器获取要迁移的XML数据 |
|
simple_xml |
使用SimpleXML API获取要迁移的XML数据 |
|
soap |
获取用于迁移的SOAP数据 |
|
dom_crawler |
DomCrawler组件处理解析HTML数据 |
|
3. JSON配置结构
- ==备注: 数据源参考下面 #示例数据源格式#==
- 示例数据源参考 #5.1
data_parser_plugin: json,
item_selector: entities // 选择一个数组key
fields:
-
"name":"roles",
"label":"roles",
"selector":"post/name" //获取entites.post.name的值
-
"name":"id",
"label":"roles",
"selector":"post/id"
3.1.字段说明
变量 |
值 |
说明 |
item_selector |
数据源key |
获取entities数组全部数据 |
name |
自定义字段名称 |
|
label |
说明 |
|
selector |
{key}/{key} |
是有可以加/模式后期索引的值, post/name等同于 entites["post"]["name"] |
4. dom_crawler 配置结构
- ==备注: 数据源参考下面==
- 主要用于html数据采集 #5.2
data_parser_plugin: dom_crawler,
item_selector: ".winstyle55195 tr[height=\"30\"]" //选择html范围
fields:
-
"name":"link",
"label":"html",
"selector":"td > a" //选择 td a 里的数据
"attribute":"href" //选择a标签里href属性的值
-
"name":"title",
"label":"text",
"selector":"td > a"
"attribute":"title" //读取 a标签里的属性title值
-
"name":"a_info",
"label":"a_info",
"selector":"td > a"
"attribute": ["title","href"] //读取 a标签里的属性title和href
-
"name":"date",
"label":"date",
"selector":"td > span.timestyle55195"
"attribute":"_text" // 获取timestyle55195里所有的文本数据
4.1.字段说明
变量 |
值 |
说明 |
item_selector |
css标识符 |
获取html结构范围 |
name |
自定义字段名称 |
|
label |
说明 |
|
selector |
css标识符 |
选择具体css要提取的数据 |
attribute |
href,title,_text,src等 |
提取标签里的属性,一次获取多个属性使用:["href","title"],_text标书获取一个纯文本数据 |
5.演示数据源
5.1 JSON数据源
{
"errno": 0,
"ecode": "SUCCEED",
"error": "Succeed.",
"entities": [
{
"post": {
"code": "QUYUXIAOSHOUJINGLI",
"formal": true,
"id": "060be905-4eb2-11e8-877e-00163e051882",
"name": "区域销售经理"
},
"dept": {
"code": "2060006",
"parent": "20600",
"independent": false,
"id": "6cdd2f9a-4651-11e9-af16-00163e051882",
"name": "华东南京营销"
},
"code": "200003020",
"source": "PULL"
},
{
"post": {
"code": "YUANGONG",
"formal": true,
"id": "060a0b5c-4eb2-11e8-877e-00163e051882",
"name": "员工"
},
"dept": {
"code": "2060006",
"parent": "20600",
"independent": false,
"id": "6cdd2f9a-4651-11e9-af16-00163e051882",
"name": "华东南京营销"
},
"code": "201803020",
"source": "PULL"
},
{
"post": {
"code": "FACULTY",
"formal": false,
"id": "9231b576-e0e4-11e5-aac5-00163e0226a1",
"name": "Faculty"
},
"dept": {
"code": "10800",
"parent": "0",
"independent": true,
"id": "05f36f2e-4eb2-11e8-877e-00163e051882",
"name": "营销部"
},
"code": "",
"source": "BUILTIN"
}
]
}
5.2 html数据源
备注:来源 http://www.dgpt.edu.cn/index/tzgg.htm
<table width="100%" class="winstyle55195">
<tbody><tr id="line55195_0" height="30">
<td width="1" nowrap=""><span class="leaderfont55195">· </span></td>
<td width="100%" style="font-size:9pt">
<a class="c55195" href="../info/1009/9946.htm" target="_blank" title="东莞职业技术学院2020届高校(东莞)毕业生春季网络招聘会邀请函">东莞职业技术学院2020届高校(东莞)毕业生春季网络招聘会邀请函
</a>
</td>
<td width="1%" nowrap=""><span class="timestyle55195">2020-03-28 </span></td>
<td width="1%" nowrap=""></td>
</tr>
<tr id="line55195_1" height="30">
<td width="1" nowrap=""><span class="leaderfont55195">· </span></td>
<td width="100%" style="font-size:9pt">
<a class="c55195" href="../info/1009/9855.htm" target="_blank" title="关于对2020年广东省科技创新战略专项资金项目(攀登计划专项)拟推报项目的公示">关于对2020年广东省科技创新战略专项资金项目(攀登计划专项)拟推报项目的公...
</a>
</td>
<td width="1%" nowrap=""><span class="timestyle55195">2020-01-03 </span></td>
<td width="1%" nowrap=""></td>
</tr>
<tr><td colspan="3" align="left">
<table cellpadding="0" cellspacing="0" border="0">
<tbody><tr><td colspan="0"><table cellspacing="0" class="headStyle1h43iuqoza" width="100%" cellpadding="1"><tbody><tr valign="middle"><td nowrap="" align="left" width="1%" id="fanye55195">共38条 1/2 </td><td nowrap="" align="left"><div><span class="PrevDisabled">首页</span><span class="PrevDisabled">上页</span><a href="tzgg/1.htm" class="Next">下页</a><a href="tzgg/1.htm" class="Next">尾页</a> <input align="absmiddle" type="button" class="defaultButtonStyle" id="gotopagebut" name="a55195Find" value="转到" onclick="javascript:a55195_gopage_fun()"><input size="2" align="absmiddle" class="defaultInputStyle" name="a55195GOPAGE" id="a55195GOPAGE" value="" style="margin-left:1px;margin-right:1px">页</div></td></tr></tbody></table>
</td></tr></tbody></table>
</td></tr>
</tbody></table>