Search International and National Patent Collections

1. (WO2015047920) TITLE AND BODY EXTRACTION FROM WEB PAGE

Pub. No.:    WO/2015/047920    International Application No.:    PCT/US2014/056704
Publication Date: Fri Apr 03 01:59:59 CEST 2015 International Filing Date: Tue Sep 23 01:59:59 CEST 2014
IPC: G06F 17/22
G06F 17/27
Applicants: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventors: SONG, Ruihua
GAO, Guangping
ZHANG, Qian
LIU, Ming
NARAYANAN, Raman
GU, Shelley Summer
GOUW, Yanti Aruswati
Title: TITLE AND BODY EXTRACTION FROM WEB PAGE
Abstract:
Technologies are generally provided for extracting a body and a title of an article displayed on a web page. A web page may display content such as advertisements, images and links in addition to the web page article. A user may select to view the article in a reader application without the additional content, and the reader application may extract the body and the title from the web page. Title candidates may be selected by identifying meta tags associated with the title and removing website names from the meta tags. Body candidates may be selected by identifying clusters of text nodes based on a font size and depth in a document object model tree for the web page. A best cluster that is most likely the body may be selected and a corresponding title candidate maybe selected as the best title.