Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P portal
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
Collapse sidebar
  • 科探开源
  • portal
  • Wiki
    • 数据管理手册
  • ETL06.URL endpoint 数据管理E

Last edited by asdu Dec 21, 2020
Page history

ETL06.URL endpoint 数据管理E

1. URL endpoint

  • ==备注:配置使用yml文件说明,所有参数都可以在后台接口管理界面里使用Form可视化配置.==
  • 演示1: 通过oauth2 授权获取 token,然后通过token获取infoplus岗位信息
  • 演示2: 抓上html,然后提取相关信息,进行转换处理
  • 门户已经集成了常用的第三方数据接口配置,可在后台查看具体配置参数
    • Infoplus / oauth2
    • 网易邮箱 / ras加密
    • QQ 邮箱 / oauth2
    • 迪塔维 / http
    • 致远OA / token
    • 希嘉 / oauth2 token
    • 抓取Html

2. 通过 oauth2 授权获取 infoplus岗位信息

2.1. 获取 infoplus token的配置

settings:
  api:
    etl_config:
      source:
        data_fetcher_plugin: guzzle_http
        http:
          method: GET
          timeout: 15
        authentication:
          plugin: oauth2_client
          oauth2:
            grant_type: client_credentials
            urlAccessToken: http://sandbox.qtgl.com.cn/infoplus/oauth2/token
            clientId: demo
            clientSecret: demosecret
            scopes: sys_profile
            timeout: '15'
        plugin: url
        ids: []
        api_id: '3'
      process: []
      destination:
        plugin: api_response_data
      id: api_id_3
      label: 获infoplus的token
      migration_group: API
    id: '3'
    type: url
    status: 1
    title: 获infoplus的token
    url_alias: "/demo/infoplus/accesstoken"
    access_roles:
    - authenticated

2.2. 通过token获取positions接口的信息,然后处理,配置如下

settings:
  api:
    etl_config:
      source:
        urls:
        - http://sandbox.qtgl.com.cn/infoplus/apis/v2/user/[current-user:name]/positions
        data_fetcher_plugin: guzzle_http
        http:
          method: GET
          query:
            access_token: "[api:svc:pm.parser::getApiResponseData(3,access_token)]"
          timeout: 15
        data_parser_plugin: json
        item_selector: entities
        fields:
        - name: roles
          label: roles
          selector: "/post/name"
        - name: dept
          label: dept
          selector: dept/name
        plugin: url
        ids: []
        api_id: '6'
      process:
        roles:
        - plugin: get
          source: roles
        rids:
        - plugin: entity_lookup
          source: roles
          entity_type: user_role
          value_key: label
        dept:
        - plugin: entity_lookup
          source: dept
          entity_type: taxonomy_term
          bundle_key: vid
          bundle: department
          ignore_case: 'true'
          value_key: name
        dept_name:
        - plugin: get
          source: dept
      destination:
        plugin: api_response_data
      id: api_id_6
      label: 获取岗位
      migration_group: API
    id: '6'
    type: url
    status: 1
    title: 获取岗位
    url_alias: "/demo/infoplus/me/positions"
    access_roles:
    - authenticated

2.2. 返回最终数据

[
  [
    {
      "roles": "区域销售经理",
      "dept": "16", 
      "dept_name": "华东南京营销"
    }
  ],
  [
    {
      "roles": "员工",
      "rids": "yuangong",
      "dept": "16",
      "dept_name": "华东南京营销"
    }
  ],
  [
    {
      "roles": "Faculty",
      "dept_name": "营销部"
    }
  ]
]

2.3. 结构说明

变量 值 说明
source url 定义source插件
urls [] 请求的地址,支持多个,比如用于分页请求,数据源结构必须相同
data_fetcher_plugin guzzle_http 一款处理http请求的插件
http {"method" ...} 根据guzzle插件参数定义,具体可以参考手册
authentication {"plugin":"oauth2"} 标准的oauth2协议支持
data_parser_plugin json xml
item_selector {entites} 根据接口返回的数据结构,选择想要读取的key下面的数据。infoplus返回的是结构是{"entites":{}} 所以我们想取entites下面的数据。
fields name,selector name:数据的名称,selector选择哪个字段,多维数组可以使用 / 分开
ids {"ids":{"<field_name>":{"type":""}}} 唯一标识符,主要为了映射到目标比如数据里的唯一主键。支持多值
process {field_name}:["plugin":{get}],"source":{soucre_field_name} 自定义的field name,然后通过各种插件转换source对应的字段. 查询相关process里面提供的plugins
destination api_response_data 把结果生成json格式数据

3.抓取html,然后对其提取、转换

3.1 抓取 www.dgpt.edu.cn/index/tzgg.htm

settings:
  api:
    etl_config:
      source:
        urls:
        - http://www.dgpt.edu.cn/index/tzgg.htm
        data_fetcher_plugin: guzzle_http
        http:
          method: GET
          timeout: 15
        data_parser_plugin: dom_crawler
        item_selector: .winstyle55195 tr[height="30"]
        fields:
        - name: link
          label: html
          selector: td > a
          attribute: href
        - name: title
          label: text
          selector: td > a
          attribute: title
        - name: date
          label: date
          selector: td > span.timestyle55195
          attribute: _text
        ids:
          title:
            type: string
        plugin: url
        api_id: '11'
      process:
        title:
        - plugin: get
          source: title
        url_redirection:
        - plugin: get
          source: link
        - plugin: str_replace
          source: link
          search: "../info"
          replace: http://www.dgpt.edu.cn/info
        published_date:
        - plugin: get
          source: date
        - plugin: str_replace
          search: " "
          replace: " 12:00"
        - plugin: callback
          callable: strtotime
        category:
        - plugin: default_value
          default_value: "[13,14]"
      destination:
        default_bundle: news
        plugin: api_response_data
      id: api_id_11
      label: 数据采集html实例
      migration_group: API
    id: '11'
    type: url
    status: 1
    title: 数据采集html实例
    url_alias: "/portal/api/v2/news/note/"
    access_roles:
    - authenticated

3.2 抓取结果

[
  [
    {
      "title":"东莞职业技术学院2020届高校(东莞)毕业生春季网络招聘会邀请函",
      "url_redirection":"http://www.dgpt.edu.cn/info/1009/9946.htm",
      "published_date":1585368000,
      "category":[
        13,
        14
      ]
    }
  ],
  [
    {
      "title":"关于对2020年广东省科技创新战略专项资金项目(攀登计划专项)拟推报项目的公示",
      "url_redirection":"http://www.dgpt.edu.cn/info/1009/9855.htm",
      "published_date":1578024000,
      "category":[
        13,
        14
      ]
    }
  ],
  [
    {
      "title":"东莞职业技术学院关于设置2019年成人高等教育校外教学点的通告",
      "url_redirection":"http://www.dgpt.edu.cn/info/1009/9823.htm",
      "published_date":1577073600,
      "category":[
        13,
        14
      ]
    }
  ],
  [
    {
      "title":"东莞职业技术学院2020届高校(东莞)毕业生供需见面会参会企业展位-30日",
      "url_redirection":"http://www.dgpt.edu.cn/info/1009/9578.htm",
      "published_date":1572494400,
      "category":[
        13,
        14
      ]
    }
  ]
]
Clone repository
  • Home
  • 数据管理手册
    • 094.Views twig 配置
    • ETL01.数据管理使用手册 V2.0
    • ETL02.Source 数据请求 data_fetcher_plugin
    • ETL03.Source 解析 data_parser_plugin
    • ETL04.Source 认证插件 authentication
    • ETL05.Porcess plugins 明细
    • ETL06.URL endpoint 数据管理E
    • ETL07.Mysql 数据管理
    • ETL08.MSSQL 数据管理
    • ETL09.Oracle 数据管理
    • ETL10.Token 列表
  • 门户V2 API 文档
    • 01.Portal Rest API v2
    • 01.Resource API
    • 02.App API
View All Pages