爬行
网络爬虫
计算机科学
机制(生物学)
互联网
透视图(图形)
数据科学
领域(数学)
万维网
钥匙(锁)
数据挖掘
计算机安全
人工智能
医学
数学
解剖
认识论
哲学
纯数学
作者
Xinkai Gao,Fengshan Yuan,Jihui Fan
摘要
In the field of artificial intelligence and machine learning, only when enterprises obtain a large amount of data can they train enough reliable models [1]. How to obtain massive data at a low cost has become one of the key prerequisites for the success of data intelligence enterprises. Mastering a large amount of data is an important prerequisite for gaining competitive advantage [2]. There is a cognitive trend among enterprises with massive data. If the massive data as their advantages are collected by peers, their advantages will be weakened or even lost. Therefore, more and more massive data owners adopt various mechanisms to protect their public data in network applications and avoid data being crawled by crawlers [3]. From the perspective of data collectors, this paper introduces some common anti crawling mechanisms in details based on the Scrapy framework and the recruitment website of a well-known internet enterprise, and then gives some techniques to circumvent the above crawling mechanism. Finally, it successfully crawls all the job information on the recruitment website of the enterprise. The experimental results show that the techniques provided by the paper can effectively bypass the anti-crawling mechanism of some large websites, so as to help collectors obtain massive data.
科研通智能强力驱动
Strongly Powered by AbleSci AI