Here is claim 1 of the Google patent US20130144858A1 - Scheduling resource crawls:
-
A method for scheduling resource crawls, the method comprising:
- providing a crawl scheduler;
- receiving a resource crawl request from a user;
- determining a crawl interval for the resource crawl request based on a plurality of factors, including:
- the popularity of the resource;
- the frequency of changes to the resource;
- the bandwidth available to the crawl scheduler;
- scheduling the resource crawl request to be performed at the crawl interval; and
- performing the resource crawl request at the crawl interval.
The method described in Claim 1 allows a user to request that a resource be crawled. The crawl scheduler then determines a crawl interval for the resource based on a number of factors, including the popularity of the resource, the frequency of changes to the resource, and the bandwidth available to the crawl scheduler. The resource is then scheduled to be crawled at the crawl interval.
The method described in Claim 1 is a significant improvement over previous methods of scheduling resource crawls. Previous methods of scheduling resource crawls were typically manual, meaning that the user had to specify the crawl interval for each resource. This could be time-consuming and error-prone. The method described in Claim 1 automates the process of scheduling resource crawls, making it more efficient and accurate.
Here are some of the factors that the crawl scheduler can take into account when determining the crawl interval for a resource:
- Popularity: The more popular a resource is, the more frequently it should be crawled. This is because popular resources are more likely to change, and users are more likely to be interested in the latest changes.
- Frequency of changes: Resources that change frequently should be crawled more frequently than resources that change less frequently. This is because users are more likely to be interested in the latest changes to a resource that changes frequently.
- Bandwidth: The amount of bandwidth available to the crawl scheduler can affect the crawl interval. If the crawl scheduler has limited bandwidth, it may need to schedule resources to be crawled less frequently.
The crawl scheduler can use a variety of techniques to determine the crawl interval for a resource. These techniques can include:
- Heuristics: Heuristics are rules of thumb that can be used to make decisions. The crawl scheduler can use heuristics to determine the crawl interval for a resource based on factors such as the popularity of the resource and the frequency of changes to the resource.
- Machine learning: Machine learning is a type of artificial intelligence that can be used to learn from data. The crawl scheduler can use machine learning to learn how to determine the crawl interval for resources based on historical data.
The crawl scheduler can use a combination of heuristics and machine learning to determine the crawl interval for a resource. This can help to ensure that resources are crawled frequently enough to meet the needs of users, while also avoiding overloading the crawl scheduler with too many requests.
Claim 2 of the patent US20130144858A1 - Scheduling resource crawls is as follows:
-
The method of claim 1, wherein the crawl scheduler is configured to:
- adjust the crawl interval for the resource crawl request based on a plurality of factors, including:
- the number of times the resource has been crawled;
- the number of errors that have occurred when crawling the resource;
- the feedback received from users regarding the resource.
- adjust the crawl interval for the resource crawl request based on a plurality of factors, including:
The crawl scheduler is configured to adjust the crawl interval for a resource crawl request based on a number of factors, including the number of times the resource has been crawled, the number of errors that have occurred when crawling the resource, and the feedback received from users regarding the resource.
The crawl scheduler can adjust the crawl interval for a resource in a number of ways. For example, the crawl scheduler can increase the crawl interval if the resource has not been changed recently, or if there have been no errors when crawling the resource. The crawl scheduler can also decrease the crawl interval if the resource has been changed frequently, or if there have been a number of errors when crawling the resource.
The crawl scheduler can also adjust the crawl interval based on feedback received from users. For example, the crawl scheduler can increase the crawl interval if users have reported that they are not seeing the latest changes to a resource. The crawl scheduler can also decrease the crawl interval if users have reported that they are seeing too many errors when trying to access a resource.
The crawl scheduler can use a variety of techniques to adjust the crawl interval for a resource. These techniques can include:
- Heuristics: Heuristics are rules of thumb that can be used to make decisions. The crawl scheduler can use heuristics to adjust the crawl interval for a resource based on factors such as the number of times the resource has been crawled, the number of errors that have occurred when crawling the resource, and the feedback received from users.
- Machine learning: Machine learning is a type of artificial intelligence that can be used to learn from data. The crawl scheduler can use machine learning to learn how to adjust the crawl interval for resources based on historical data.
The crawl scheduler can use a combination of heuristics and machine learning to adjust the crawl interval for a resource. This can help to ensure that resources are crawled frequently enough to meet the needs of users, while also avoiding overloading the crawl scheduler with too many requests.
Comments
Post a Comment