1. URL endpoint
- ==备注:配置使用yml文件说明,所有参数都可以在后台接口管理界面里使用Form可视化配置.==
- 演示1: 通过oauth2 授权获取 token,然后通过token获取infoplus岗位信息
- 演示2: 抓上html,然后提取相关信息,进行转换处理
- 门户已经集成了常用的第三方数据接口配置,可在后台查看具体配置参数
- Infoplus / oauth2
- 网易邮箱 / ras加密
- QQ 邮箱 / oauth2
- 迪塔维 / http
- 致远OA / token
- 希嘉 / oauth2 token
- 抓取Html
2. 通过 oauth2 授权获取 infoplus岗位信息
2.1. 获取 infoplus token的配置
settings:
api:
etl_config:
source:
data_fetcher_plugin: guzzle_http
http:
method: GET
timeout: 15
authentication:
plugin: oauth2_client
oauth2:
grant_type: client_credentials
urlAccessToken: http://sandbox.qtgl.com.cn/infoplus/oauth2/token
clientId: demo
clientSecret: demosecret
scopes: sys_profile
timeout: '15'
plugin: url
ids: []
api_id: '3'
process: []
destination:
plugin: api_response_data
id: api_id_3
label: 获infoplus的token
migration_group: API
id: '3'
type: url
status: 1
title: 获infoplus的token
url_alias: "/demo/infoplus/accesstoken"
access_roles:
- authenticated
2.2. 通过token获取positions接口的信息,然后处理,配置如下
settings:
api:
etl_config:
source:
urls:
- http://sandbox.qtgl.com.cn/infoplus/apis/v2/user/[current-user:name]/positions
data_fetcher_plugin: guzzle_http
http:
method: GET
query:
access_token: "[api:svc:pm.parser::getApiResponseData(3,access_token)]"
timeout: 15
data_parser_plugin: json
item_selector: entities
fields:
- name: roles
label: roles
selector: "/post/name"
- name: dept
label: dept
selector: dept/name
plugin: url
ids: []
api_id: '6'
process:
roles:
- plugin: get
source: roles
rids:
- plugin: entity_lookup
source: roles
entity_type: user_role
value_key: label
dept:
- plugin: entity_lookup
source: dept
entity_type: taxonomy_term
bundle_key: vid
bundle: department
ignore_case: 'true'
value_key: name
dept_name:
- plugin: get
source: dept
destination:
plugin: api_response_data
id: api_id_6
label: 获取岗位
migration_group: API
id: '6'
type: url
status: 1
title: 获取岗位
url_alias: "/demo/infoplus/me/positions"
access_roles:
- authenticated
2.2. 返回最终数据
[
[
{
"roles": "区域销售经理",
"dept": "16",
"dept_name": "华东南京营销"
}
],
[
{
"roles": "员工",
"rids": "yuangong",
"dept": "16",
"dept_name": "华东南京营销"
}
],
[
{
"roles": "Faculty",
"dept_name": "营销部"
}
]
]
2.3. 结构说明
变量 |
值 |
说明 |
source |
url |
定义source插件 |
urls |
[] |
请求的地址,支持多个,比如用于分页请求,数据源结构必须相同 |
data_fetcher_plugin |
guzzle_http |
一款处理http请求的插件 |
http |
{"method" ...} |
根据guzzle插件参数定义,具体可以参考手册
|
authentication |
{"plugin":"oauth2"} |
标准的oauth2协议支持 |
data_parser_plugin |
json |
xml |
item_selector |
{entites} |
根据接口返回的数据结构,选择想要读取的key下面的数据。infoplus返回的是结构是{"entites":{}} 所以我们想取entites下面的数据。 |
fields |
name,selector |
name:数据的名称,selector选择哪个字段,多维数组可以使用 / 分开 |
ids |
{"ids":{"<field_name>":{"type":""}}} |
唯一标识符,主要为了映射到目标比如数据里的唯一主键。支持多值 |
process |
{field_name}:["plugin":{get}],"source":{soucre_field_name} |
自定义的field name,然后通过各种插件转换source对应的字段. 查询相关process里面提供的plugins
|
destination |
api_response_data |
把结果生成json格式数据 |
3.抓取html,然后对其提取、转换
settings:
api:
etl_config:
source:
urls:
- http://www.dgpt.edu.cn/index/tzgg.htm
data_fetcher_plugin: guzzle_http
http:
method: GET
timeout: 15
data_parser_plugin: dom_crawler
item_selector: .winstyle55195 tr[height="30"]
fields:
- name: link
label: html
selector: td > a
attribute: href
- name: title
label: text
selector: td > a
attribute: title
- name: date
label: date
selector: td > span.timestyle55195
attribute: _text
ids:
title:
type: string
plugin: url
api_id: '11'
process:
title:
- plugin: get
source: title
url_redirection:
- plugin: get
source: link
- plugin: str_replace
source: link
search: "../info"
replace: http://www.dgpt.edu.cn/info
published_date:
- plugin: get
source: date
- plugin: str_replace
search: " "
replace: " 12:00"
- plugin: callback
callable: strtotime
category:
- plugin: default_value
default_value: "[13,14]"
destination:
default_bundle: news
plugin: api_response_data
id: api_id_11
label: 数据采集html实例
migration_group: API
id: '11'
type: url
status: 1
title: 数据采集html实例
url_alias: "/portal/api/v2/news/note/"
access_roles:
- authenticated
3.2 抓取结果
[
[
{
"title":"东莞职业技术学院2020届高校(东莞)毕业生春季网络招聘会邀请函",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9946.htm",
"published_date":1585368000,
"category":[
13,
14
]
}
],
[
{
"title":"关于对2020年广东省科技创新战略专项资金项目(攀登计划专项)拟推报项目的公示",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9855.htm",
"published_date":1578024000,
"category":[
13,
14
]
}
],
[
{
"title":"东莞职业技术学院关于设置2019年成人高等教育校外教学点的通告",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9823.htm",
"published_date":1577073600,
"category":[
13,
14
]
}
],
[
{
"title":"东莞职业技术学院2020届高校(东莞)毕业生供需见面会参会企业展位-30日",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9578.htm",
"published_date":1572494400,
"category":[
13,
14
]
}
]
]