
Proceedings Paper
Webpage text extraction algorithm based on text block density and tag path featuresFormat | Member Price | Non-Member Price |
---|---|---|
$17.00 | $21.00 |
Paper Abstract
In addition to the body text, web pages also contain a lot of noise information such as advertisements and navigation bars. Accurately extracting text content from web pages is a key technology to improve the quality of web page analysis. The web page itself is a highly heterogeneous special text, and different types of web pages have different web page structures, which increases the difficulty of web page text extraction. After a lot of analysis, we found that there is a potential correlation between the body text and the tag path and text block density, so we propose a webpage text extraction method based on the tag path feature and the text block density feature. Combining the advantages and disadvantages of tag path features and text block density features, we design a fusion strategy to solve the problem of low accuracy of web page text extraction. The method does not require training, and improves the efficiency of webpage text extraction. The experimental results on the dataset constructed in this paper show that the classification accuracy of this method reaches 81.11%, the recall rate reaches 83.15%, and the average accuracy on all datasets is 17.7% higher than that of the BDF algorithm and 6.21% higher than that of the CEPF algorithm, the experiments show that the method has strong generalization ability.
Paper Details
Date Published: 23 August 2022
PDF: 6 pages
Proc. SPIE 12330, International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022), 123301G (23 August 2022); doi: 10.1117/12.2646343
Published in SPIE Proceedings Vol. 12330:
International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022)
Yuanchang Zhong, Editor(s)
PDF: 6 pages
Proc. SPIE 12330, International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022), 123301G (23 August 2022); doi: 10.1117/12.2646343
Show Author Affiliations
Renjie Wang, Beijing Information Science & Technology Univ. (China)
Yangsen Zhang, Beijing Information Science & Technology Univ. (China)
Zhenyu Hou, Beijing Information Science & Technology Univ. (China)
Jianlong Li, Beijing Information Science & Technology Univ. (China)
Yangsen Zhang, Beijing Information Science & Technology Univ. (China)
Zhenyu Hou, Beijing Information Science & Technology Univ. (China)
Jianlong Li, Beijing Information Science & Technology Univ. (China)
Zhenjiang Su, Beijing Information Science & Technology Univ. (China)
Shaohui Xie, Beijing Information Science & Technology Univ. (China)
Zhuofan Huang, Beijing Information Science & Technology Univ. (China)
Shaohui Xie, Beijing Information Science & Technology Univ. (China)
Zhuofan Huang, Beijing Information Science & Technology Univ. (China)
Published in SPIE Proceedings Vol. 12330:
International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022)
Yuanchang Zhong, Editor(s)
© SPIE. Terms of Use
