Shanghai Longfeng case not to repeat the strategy of the crawler crawl

return to the topic, not repeat capture, will need to determine whether repeated. You need to remember crawling behavior before, we give a simple example. You are in my QQ group (9060800), I see a URL link, then you first see this link to my hair, then click and open to see the specific content in the browser. This is seen only after the crawler crawl. How is that record? We see a picture:


does not repeat the crawl? There are many beginners may think. The crawler is not depth and breadth of priority two grasping strategy? This is how you have another not to repeat the fetching strategy? Actually I these days have more than once heard someone say to add the same page links in different pages, in order to ensure the collection. I really think I can guarantee? Problems included not only relates to catch no catch? Also to extend this article today, not to repeat the crawl strategy, to illustrate the crawler in a certain period of time there is such a rule, of course there are many other strategies, such as preference after organic will say, grab "revisiting strategy, strategy and so on.

as above, assume this is a web page on all links, when the crawler crawl the page links all found. Of course, crawling (read, find links) and crawl (understood as crawling simultaneously). A discovery told another, then in front of the climb back to catch. Grab finished keep up and mark, as shown above, we found second records and sixth records is repeated. Then when the crawler second, and take up to sixth that this information has been crawled over, then no longer grab. Crawler is not as far as possible grasp something more? Why judge repeated


actually, we can think about it. How many Internet website and how many pages? Zhao Yangang is really not verified, but the magnitude should be frightening. The search engine itself crawling and crawling are performing a section of code or a function. A means to spend a little bit of resources. If the magnitude of repeated crawl to tens of billions of levels and will make the crawler doing much of the cost of search engine? How much? This is the cost of money, reduce the cost is to reduce expenditure. Of course not repeat capture is not only reflected in here, but this is the most obviously. You know the popular recommendation, which is similar to the details page related articles, random recommendation, the latest articles repeat much? Is not all the same page? If you are the same, then can be adjusted appropriately, without affecting the user experience of the website itself under the premise, to do some proper adjustment. After all, the site is to the user, a search engine is an important entrance to get traffic, a way of marketing is an important