Search International and National Patent Collections

1. (WO2018184588) TEXT DEDUPLICATION METHOD AND DEVICE AND STORAGE MEDIUM

Pub. No.:    WO/2018/184588    International Application No.:    PCT/CN2018/082107
Publication Date: Fri Oct 12 01:59:59 CEST 2018 International Filing Date: Mon Apr 09 01:59:59 CEST 2018
IPC: G06F 17/27
Applicants: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED
腾讯科技(深圳)有限公司
Inventors: XU, Wei
许维
ZHONG, Li
钟黎
WANG, Li
王励
LIU, Lichun
刘黎春
Title: TEXT DEDUPLICATION METHOD AND DEVICE AND STORAGE MEDIUM
Abstract:
Disclosed in embodiments of the present application are a text deduplication method and device and storage medium. The method comprises: acquiring a set of texts, the set of texts comprising multiple texts to be deduplicated; for each text to be deduplicated, capturing corresponding substrings of texts from the text to be deduplicated; determining, in the set of texts, the text to be deduplicated having the same substring of texts to obtain a subset of texts corresponding to each substring of texts; performing text deduplication on the subset of texts corresponding to each substring of texts respectively to obtain a deduplicated set of texts corresponding to each substring of texts; and acquiring a resulting set of texts after the deduplication of the set of texts according to the deduplicated set of texts corresponding to each substring of texts.