| Modifier and Type | Field and Description |
|---|---|
protected static org.apache.log4j.Logger |
logger |
protected CrawlController |
myController
The controller instance that has created this crawler thread.
|
protected int |
myId
The id associated to the crawler thread running this instance
|
| Constructor and Description |
|---|
WebCrawler() |
| Modifier and Type | Method and Description |
|---|---|
CrawlController |
getMyController() |
int |
getMyId()
Get the id of the current crawler instance
|
Object |
getMyLocalData()
The CrawlController instance that has created this crawler instance will
call this function just before terminating this crawler thread.
|
Thread |
getThread() |
protected void |
handlePageStatusCode(WebURL webUrl,
int statusCode,
String statusDescription)
This function is called once the header of a page is fetched.
|
void |
init(int id,
CrawlController crawlController)
Initializes the current instance of the crawler
|
boolean |
isNotWaitingForNewURLs() |
void |
onBeforeExit()
This function is called just before the termination of the current
crawler instance.
|
protected void |
onContentFetchError(WebURL webUrl)
This function is called if the content of a url could not be fetched.
|
protected void |
onParseError(WebURL webUrl)
This function is called if there has been an error in parsing the
content.
|
void |
onStart()
This function is called just before starting the crawl by this crawler
instance.
|
void |
run() |
void |
setThread(Thread myThread) |
boolean |
shouldVisit(WebURL url)
Classes that extends WebCrawler can overwrite this function to tell the
crawler whether the given url should be crawled or not.
|
void |
visit(Page page)
Classes that extends WebCrawler can overwrite this function to process
the content of the fetched and parsed page.
|
protected static final org.apache.log4j.Logger logger
protected int myId
protected CrawlController myController
public void init(int id,
CrawlController crawlController)
id - the id of this crawler instancecrawlController - the controller that manages this crawling sessionpublic int getMyId()
public CrawlController getMyController()
public void onStart()
public void onBeforeExit()
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription)
webUrl - statusCode - statusDescription - protected void onContentFetchError(WebURL webUrl)
webUrl - protected void onParseError(WebURL webUrl)
webUrl - public Object getMyLocalData()
public boolean shouldVisit(WebURL url)
url - the url which we are interested to know whether it should be
included in the crawl or not.public void visit(Page page)
page - the page object that is just fetched and parsed.public Thread getThread()
public void setThread(Thread myThread)
public boolean isNotWaitingForNewURLs()
Copyright © 2013. All Rights Reserved.