Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P portal
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
Collapse sidebar
  • 科探开源
  • portal
  • Wiki
    • 数据管理手册
  • ETL06.URL endpoint 数据管理E

ETL06.URL endpoint 数据管理E · Changes

Page history
Update 数据管理手册/ETL06.URL endpoint 数据管理E authored Dec 21, 2020 by 杜安山's avatar 杜安山
Hide whitespace changes
Inline Side-by-side
Showing with 284 additions and 0 deletions
+284 -0
  • 数据管理手册/ETL06.URL-endpoint-数据管理E.md 数据管理手册/ETL06.URL-endpoint-数据管理E.md +284 -0
  • No files found.
数据管理手册/ETL06.URL-endpoint-数据管理E.md 0 → 100644
View page @ a87f3a41
## 1. URL endpoint
- ==备注:配置使用yml文件说明,所有参数都可以在后台接口管理界面里使用Form可视化配置.==
- 演示1: 通过oauth2 授权获取 token,然后通过token获取infoplus岗位信息
- 演示2: 抓上html,然后提取相关信息,进行转换处理
- 门户已经集成了常用的第三方数据接口配置,可在后台查看具体配置参数
- Infoplus / oauth2
- 网易邮箱 / ras加密
- QQ 邮箱 / oauth2
- 迪塔维 / http
- 致远OA / token
- 希嘉 / oauth2 token
- 抓取Html
### 2. 通过 oauth2 授权获取 infoplus岗位信息
#### 2.1. 获取 infoplus token的配置
```
settings:
api:
etl_config:
source:
data_fetcher_plugin: guzzle_http
http:
method: GET
timeout: 15
authentication:
plugin: oauth2_client
oauth2:
grant_type: client_credentials
urlAccessToken: http://sandbox.qtgl.com.cn/infoplus/oauth2/token
clientId: demo
clientSecret: demosecret
scopes: sys_profile
timeout: '15'
plugin: url
ids: []
api_id: '3'
process: []
destination:
plugin: api_response_data
id: api_id_3
label: 获infoplus的token
migration_group: API
id: '3'
type: url
status: 1
title: 获infoplus的token
url_alias: "/demo/infoplus/accesstoken"
access_roles:
- authenticated
```
#### 2.2. 通过token获取positions接口的信息,然后处理,配置如下
```
settings:
api:
etl_config:
source:
urls:
- http://sandbox.qtgl.com.cn/infoplus/apis/v2/user/[current-user:name]/positions
data_fetcher_plugin: guzzle_http
http:
method: GET
query:
access_token: "[api:svc:pm.parser::getApiResponseData(3,access_token)]"
timeout: 15
data_parser_plugin: json
item_selector: entities
fields:
- name: roles
label: roles
selector: "/post/name"
- name: dept
label: dept
selector: dept/name
plugin: url
ids: []
api_id: '6'
process:
roles:
- plugin: get
source: roles
rids:
- plugin: entity_lookup
source: roles
entity_type: user_role
value_key: label
dept:
- plugin: entity_lookup
source: dept
entity_type: taxonomy_term
bundle_key: vid
bundle: department
ignore_case: 'true'
value_key: name
dept_name:
- plugin: get
source: dept
destination:
plugin: api_response_data
id: api_id_6
label: 获取岗位
migration_group: API
id: '6'
type: url
status: 1
title: 获取岗位
url_alias: "/demo/infoplus/me/positions"
access_roles:
- authenticated
```
#### 2.2. 返回最终数据
```
[
[
{
"roles": "区域销售经理",
"dept": "16",
"dept_name": "华东南京营销"
}
],
[
{
"roles": "员工",
"rids": "yuangong",
"dept": "16",
"dept_name": "华东南京营销"
}
],
[
{
"roles": "Faculty",
"dept_name": "营销部"
}
]
]
```
#### 2.3. 结构说明
变量 | 值 |说明
---|---|---
source | url | 定义source插件
urls | [] | 请求的地址,支持多个,比如用于分页请求,数据源结构必须相同
data_fetcher_plugin | guzzle_http | 一款处理http请求的插件
http|{"method" ...}| 根据guzzle插件参数定义,具体可以[参考手册](https://github.com/guzzle/guzzle)
authentication|{"plugin":"oauth2"}| 标准的oauth2协议支持
data_parser_plugin | json|xml|dom_crawler|simple_xml|soap| 对各种数据类型进行解析
item_selector|{entites}| 根据接口返回的数据结构,选择想要读取的key下面的数据。infoplus返回的是结构是{"entites":{}} 所以我们想取entites下面的数据。
fields|name,selector| name:数据的名称,selector选择哪个字段,多维数组可以使用 / 分开
ids|{"ids":{"<field_name>":{"type":"<string>"}}}|唯一标识符,主要为了映射到目标比如数据里的唯一主键。支持多值
process|{field_name}:["plugin":{get}],"source":{soucre_field_name}|自定义的field name,然后通过各种插件转换source对应的字段. [查询相关process里面提供的plugins](https://note.youdao.com/)
destination|api_response_data |把结果生成json格式数据
### 3.抓取html,然后对其提取、转换
#### 3.1 抓取 www.dgpt.edu.cn/index/tzgg.htm
```
settings:
api:
etl_config:
source:
urls:
- http://www.dgpt.edu.cn/index/tzgg.htm
data_fetcher_plugin: guzzle_http
http:
method: GET
timeout: 15
data_parser_plugin: dom_crawler
item_selector: .winstyle55195 tr[height="30"]
fields:
- name: link
label: html
selector: td > a
attribute: href
- name: title
label: text
selector: td > a
attribute: title
- name: date
label: date
selector: td > span.timestyle55195
attribute: _text
ids:
title:
type: string
plugin: url
api_id: '11'
process:
title:
- plugin: get
source: title
url_redirection:
- plugin: get
source: link
- plugin: str_replace
source: link
search: "../info"
replace: http://www.dgpt.edu.cn/info
published_date:
- plugin: get
source: date
- plugin: str_replace
search: " "
replace: " 12:00"
- plugin: callback
callable: strtotime
category:
- plugin: default_value
default_value: "[13,14]"
destination:
default_bundle: news
plugin: api_response_data
id: api_id_11
label: 数据采集html实例
migration_group: API
id: '11'
type: url
status: 1
title: 数据采集html实例
url_alias: "/portal/api/v2/news/note/"
access_roles:
- authenticated
```
#### 3.2 抓取结果
```
[
[
{
"title":"东莞职业技术学院2020届高校(东莞)毕业生春季网络招聘会邀请函",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9946.htm",
"published_date":1585368000,
"category":[
13,
14
]
}
],
[
{
"title":"关于对2020年广东省科技创新战略专项资金项目(攀登计划专项)拟推报项目的公示",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9855.htm",
"published_date":1578024000,
"category":[
13,
14
]
}
],
[
{
"title":"东莞职业技术学院关于设置2019年成人高等教育校外教学点的通告",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9823.htm",
"published_date":1577073600,
"category":[
13,
14
]
}
],
[
{
"title":"东莞职业技术学院2020届高校(东莞)毕业生供需见面会参会企业展位-30日",
"url_redirection":"http://www.dgpt.edu.cn/info/1009/9578.htm",
"published_date":1572494400,
"category":[
13,
14
]
}
]
]
```
Clone repository
  • Home
  • 数据管理手册
    • 094.Views twig 配置
    • ETL01.数据管理使用手册 V2.0
    • ETL02.Source 数据请求 data_fetcher_plugin
    • ETL03.Source 解析 data_parser_plugin
    • ETL04.Source 认证插件 authentication
    • ETL05.Porcess plugins 明细
    • ETL06.URL endpoint 数据管理E
    • ETL07.Mysql 数据管理
    • ETL08.MSSQL 数据管理
    • ETL09.Oracle 数据管理
    • ETL10.Token 列表
  • 门户V2 API 文档
    • 01.Portal Rest API v2
    • 01.Resource API
    • 02.App API
View All Pages