Search International and National Patent Collections

1. (WO2018103540) WEBPAGE CONTENT EXTRACTION METHOD, DEVICE, AND DATA STORAGE MEDIUM

Pub. No.:    WO/2018/103540    International Application No.:    PCT/CN2017/112866
Publication Date: Fri Jun 15 01:59:59 CEST 2018 International Filing Date: Sat Nov 25 00:59:59 CET 2017
IPC: G06F 17/30
Applicants: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED
腾讯科技(深圳)有限公司
Inventors: ZHAO, Mingxin
赵铭鑫
Title: WEBPAGE CONTENT EXTRACTION METHOD, DEVICE, AND DATA STORAGE MEDIUM
Abstract:
A webpage content extraction method applicable to a network apparatus is provided. The method comprises: determining a plurality of candidate areas in a webpage, wherein each of the candidate areas comprises one or more page elements adjacent to each other in the webpage (S21); extracting, with respect to each of the plurality of candidate areas, extraction values of a plurality of visual features in the candidate area, wherein the visual features are features in the webpage perceived by human eyes, and the extraction values of the visual features are values of the visual features configured in data in the webpage (S22); and determining, according to the extraction values of the plurality of visual features, a target area in the plurality of candidate areas and consistent with an extraction rule, and extracting content information from the target area (S23).