A B C D E F G H I L M N O P R S T U V W 

A

add(String) - Method in class edu.uci.ics.crawler4j.robotstxt.RuleSet
 
addAllow(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
addDisallow(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
addSeed(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Adds a new seed URL.
addSeed(String, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Adds a new seed URL.
addSeenUrl(String, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
This function can called to assign a specific document id to a url.
addUrlAndDocId(String, int) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
allows(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
allows(WebURL) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 

B

BinaryParseData - Class in edu.uci.ics.crawler4j.parser
 
BinaryParseData() - Constructor for class edu.uci.ics.crawler4j.parser.BinaryParseData
 
byteArray2Int(byte[]) - Static method in class edu.uci.ics.crawler4j.util.Util
 
byteArray2Long(byte[]) - Static method in class edu.uci.ics.crawler4j.util.Util
 

C

characters(char[], int, int) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
close() - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
close() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
close() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
close() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
config - Variable in class edu.uci.ics.crawler4j.crawler.Configurable
 
config - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
Configurable - Class in edu.uci.ics.crawler4j.crawler
Several core components of crawler4j extend this class to make them configurable.
Configurable(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.crawler.Configurable
 
connectionManager - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
connectionMonitorThread - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
contains(String) - Method in class edu.uci.ics.crawler4j.url.TLDList
 
containsPrefixOf(String) - Method in class edu.uci.ics.crawler4j.robotstxt.RuleSet
 
contentCharset - Variable in class edu.uci.ics.crawler4j.crawler.Page
The charset of the content.
contentData - Variable in class edu.uci.ics.crawler4j.crawler.Page
The content of this page in binary format.
contentEncoding - Variable in class edu.uci.ics.crawler4j.crawler.Page
The encoding of the content.
contentType - Variable in class edu.uci.ics.crawler4j.crawler.Page
The ContentType of this page.
Counters - Class in edu.uci.ics.crawler4j.frontier
 
Counters(Environment, CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.frontier.Counters
 
counters - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
Counters.ReservedCounterNames - Class in edu.uci.ics.crawler4j.frontier
 
Counters.ReservedCounterNames() - Constructor for class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
counterValues - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
CrawlConfig - Class in edu.uci.ics.crawler4j.crawler
 
CrawlConfig() - Constructor for class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
CrawlController - Class in edu.uci.ics.crawler4j.crawler
The controller that manages a crawling session.
CrawlController(CrawlConfig, PageFetcher, RobotstxtServer) - Constructor for class edu.uci.ics.crawler4j.crawler.CrawlController
 
crawlersLocalData - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Once the crawling session finishes the controller collects the local data of the crawler threads and stores them in this List.
customData - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
The 'customData' object can be used for passing custom crawl-related configurations to different components of the crawler.
CustomFetchStatus - Class in edu.uci.ics.crawler4j.fetcher
 
CustomFetchStatus() - Constructor for class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
 

D

delete(int) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
deleteFolder(File) - Static method in class edu.uci.ics.crawler4j.util.IO
 
deleteFolderContents(File) - Static method in class edu.uci.ics.crawler4j.util.IO
 
discardContentIfNotConsumed() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
docIDsDB - Variable in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
docIdServer - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
DocIDServer - Class in edu.uci.ics.crawler4j.frontier
 
DocIDServer(Environment, CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.frontier.DocIDServer
 
docIdServer - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 

E

edu.uci.ics.crawler4j.crawler - package edu.uci.ics.crawler4j.crawler
 
edu.uci.ics.crawler4j.fetcher - package edu.uci.ics.crawler4j.fetcher
 
edu.uci.ics.crawler4j.frontier - package edu.uci.ics.crawler4j.frontier
 
edu.uci.ics.crawler4j.parser - package edu.uci.ics.crawler4j.parser
 
edu.uci.ics.crawler4j.robotstxt - package edu.uci.ics.crawler4j.robotstxt
 
edu.uci.ics.crawler4j.url - package edu.uci.ics.crawler4j.url
 
edu.uci.ics.crawler4j.util - package edu.uci.ics.crawler4j.util
 
endElement(String, String, String) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
entity - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
entryToObject(TupleInput) - Method in class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
env - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
env - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
equals(Object) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
ExtractedUrlAnchorPair - Class in edu.uci.ics.crawler4j.parser
 
ExtractedUrlAnchorPair() - Constructor for class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 

F

FatalTransportError - Static variable in class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
 
fetchContent(Page) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
fetchedUrl - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
fetchHeader(WebURL) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
fetchResponseHeaders - Variable in class edu.uci.ics.crawler4j.crawler.Page
Headers which were present in the response of the fetch request
finish() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
finished - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Is the crawling of this session finished?
frontier - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
Frontier - Class in edu.uci.ics.crawler4j.frontier
 
Frontier(Environment, CrawlConfig, DocIDServer) - Constructor for class edu.uci.ics.crawler4j.frontier.Frontier
 

G

get(int) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getAnchor() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
getAnchor() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the anchor string.
getBaseUrl() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getBodyText() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getCacheSize() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
getCanonicalURL(String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
getCanonicalURL(String, String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
getConfig() - Method in class edu.uci.ics.crawler4j.crawler.Configurable
 
getConnectionTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getContentCharset() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns the charset of the content.
getContentData() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns the content of this page in binary format.
getContentEncoding() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns the encoding of the content.
getContentType() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns the ContentType of this page.
getCrawlersLocalData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Once the crawling session finishes the controller collects the local data of the crawler threads and stores them in a List.
getCrawlStorageFolder() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getCustomData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getDatabaseEntryKey(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getDepth() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the crawl depth at which this Url is first observed.
getDocCount() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
getDocId(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
Returns the docid of an already seen url.
getDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the unique document id assigned to this Url.
getDocIdServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the domain of this Url.
getEntity() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getFetchedUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getFetchResponseHeaders() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns headers which were present in the response of the fetch request
getFrontier() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getHref() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
getHtml() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getHttpClient() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
getInstance() - Static method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
getInstance() - Static method in class edu.uci.ics.crawler4j.url.TLDList
 
getLastAccessTime() - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
getLength() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getMaxConnectionsPerHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxDepthOfCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxDownloadSize() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxOutgoingLinksToFollow() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxPagesToFetch() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxTotalConnections() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMovedToUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getMyController() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
getMyId() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Get the id of the current crawler instance
getMyLocalData() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
The CrawlController instance that has created this crawler instance will call this function just before terminating this crawler thread.
getNewDocID(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
getNextURLs(int, List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getNumberOfAssignedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getNumberOfProcessedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getPageFetcher() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getParentDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the unique document id of the parent page.
getParentUrl() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the url of the parent page.
getParseData() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns the parsed data generated for this page by parsers
getPath() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the path of this Url.
getPolitenessDelay() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getPriority() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the priority for crawling this URL.
getProxyHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyPassword() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyPort() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyUsername() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getQueueLength() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getResponseHeaders() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getRobotstxtServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getSocketTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getStatusCode() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getStatusDescription(int) - Static method in class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
 
getSubDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getText() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getTextContent() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
getThread() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
getTitle() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getURL() - Method in class edu.uci.ics.crawler4j.url.WebURL
Returns the Url string
getUserAgentName() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
getUserAgentString() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getValue(String) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
getWebURL() - Method in class edu.uci.ics.crawler4j.crawler.Page
 

H

handlePageStatusCode(WebURL, int, String) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called once the header of a page is fetched.
hasBinaryContent(String) - Static method in class edu.uci.ics.crawler4j.util.Util
 
hashCode() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
hasPlainTextContent(String) - Static method in class edu.uci.ics.crawler4j.util.Util
 
host2directivesCache - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
HostDirectives - Class in edu.uci.ics.crawler4j.robotstxt
 
HostDirectives() - Constructor for class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
HtmlContentHandler - Class in edu.uci.ics.crawler4j.parser
 
HtmlContentHandler() - Constructor for class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
HtmlParseData - Class in edu.uci.ics.crawler4j.parser
 
HtmlParseData() - Constructor for class edu.uci.ics.crawler4j.parser.HtmlParseData
 
httpClient - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 

I

IdleConnectionMonitorThread - Class in edu.uci.ics.crawler4j.fetcher
 
IdleConnectionMonitorThread(PoolingClientConnectionManager) - Constructor for class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 
increment(String) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
increment(String, long) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
init(int, CrawlController) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Initializes the current instance of the crawler
inProcessPages - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
InProcessPagesDB - Class in edu.uci.ics.crawler4j.frontier
This class maintains the list of pages which are assigned to crawlers but are not yet processed.
InProcessPagesDB(Environment) - Constructor for class edu.uci.ics.crawler4j.frontier.InProcessPagesDB
 
int2ByteArray(int) - Static method in class edu.uci.ics.crawler4j.util.Util
 
IO - Class in edu.uci.ics.crawler4j.util
 
IO() - Constructor for class edu.uci.ics.crawler4j.util.IO
 
isEnabled() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
isFinished() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
isFinished - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
isFinished() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
isFollowRedirects() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isIncludeBinaryContentInCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isIncludeHttpsPages() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isNotWaitingForNewURLs() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
isResumableCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isSeenBefore(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
isShuttingDown() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 

L

lastDocID - Variable in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
lastFetchTime - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
load(HttpEntity) - Method in class edu.uci.ics.crawler4j.crawler.Page
Loads the content of this page from a fetched HttpEntity.
logger - Static variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
logger - Static variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
logger - Static variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
logger - Static variable in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
logger - Static variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
logger - Static variable in class edu.uci.ics.crawler4j.parser.Parser
 
long2ByteArray(long) - Static method in class edu.uci.ics.crawler4j.util.Util
 

M

movedToUrl - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
mutex - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
myController - Variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
The controller instance that has created this crawler thread.
myId - Variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
The id associated to the crawler thread running this instance

N

needsRefetch() - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 

O

objectToEntry(WebURL, TupleOutput) - Method in class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
onBeforeExit() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called just before the termination of the current crawler instance.
onContentFetchError(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if the content of a url could not be fetched.
onParseError(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if there has been an error in parsing the content.
onStart() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called just before starting the crawl by this crawler instance.

P

Page - Class in edu.uci.ics.crawler4j.crawler
This class contains the data for a fetched and parsed page.
Page(WebURL) - Constructor for class edu.uci.ics.crawler4j.crawler.Page
 
pageFetcher - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
PageFetcher - Class in edu.uci.ics.crawler4j.fetcher
 
PageFetcher(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
pageFetcher - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
PageFetchResult - Class in edu.uci.ics.crawler4j.fetcher
 
PageFetchResult() - Constructor for class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
PageTooBig - Static variable in class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
 
parse(Page, String) - Method in class edu.uci.ics.crawler4j.parser.Parser
 
parse(String, String) - Static method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtParser
 
parseData - Variable in class edu.uci.ics.crawler4j.crawler.Page
The parsed data populated by parsers
ParseData - Interface in edu.uci.ics.crawler4j.parser
 
Parser - Class in edu.uci.ics.crawler4j.parser
 
Parser(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.parser.Parser
 
PROCESSED_PAGES - Static variable in class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
put(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
putIntInByteArray(int, byte[], int) - Static method in class edu.uci.ics.crawler4j.util.Util
 

R

removeURL(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.InProcessPagesDB
 
resolveUrl(String, String) - Static method in class edu.uci.ics.crawler4j.url.UrlResolver
Resolves a given relative URL against a base URL.
responseHeaders - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
resumable - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
RobotstxtConfig - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtConfig() - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
RobotstxtParser - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtParser() - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtParser
 
robotstxtServer - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
RobotstxtServer - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtServer(RobotstxtConfig, PageFetcher) - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
RuleSet - Class in edu.uci.ics.crawler4j.robotstxt
 
RuleSet() - Constructor for class edu.uci.ics.crawler4j.robotstxt.RuleSet
 
run() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
run() - Method in class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 

S

schedule(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
scheduleAll(List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
SCHEDULED_PAGES - Static variable in class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
scheduledPages - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
setAnchor(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
setAnchor(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setCacheSize(int) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setConnectionTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Connection timeout in milliseconds
setContentCharset(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentData(byte[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentEncoding(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentType(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setCrawlStorageFolder(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
The folder which will be used by crawler for storing the intermediate crawl data.
setCustomData(Object) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setDepth(short) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setDocIdServer(DocIDServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setEnabled(boolean) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setEntity(HttpEntity) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setFetchedUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setFetchResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setFollowRedirects(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should we follow redirects?
setFrontier(Frontier) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setHref(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
setHtml(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setIncludeBinaryContentInCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should we fetch binary content such as images, audio, ...?
setIncludeHttpsPages(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should we also crawl https pages?
setMaxConnectionsPerHost(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum Connections per host
setMaxDepthOfCrawling(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum depth of crawling For unlimited depth this parameter should be set to -1
setMaxDownloadSize(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Max allowed size of a page.
setMaxOutgoingLinksToFollow(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Max number of outgoing links which are processed from a page
setMaxPagesToFetch(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum number of pages to fetch For unlimited number of pages, this parameter should be set to -1
setMaxTotalConnections(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum total connections
setMovedToUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setOutgoingUrls(List<WebURL>) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setPageFetcher(PageFetcher) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setParentDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setParentUrl(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setParseData(ParseData) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setPath(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setPolitenessDelay(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Politeness delay in milliseconds (delay between sending two requests to the same host).
setPriority(byte) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setProcessed(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
setProxyHost(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If crawler should run behind a proxy, this parameter can be used for specifying the proxy host.
setProxyPassword(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If crawler should run behind a proxy and user/pass is needed for authentication in proxy, this parameter can be used for specifying the password.
setProxyPort(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If crawler should run behind a proxy, this parameter can be used for specifying the proxy port.
setProxyUsername(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If crawler should run behind a proxy and user/pass is needed for authentication in proxy, this parameter can be used for specifying the username.
setResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setResumableCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If this feature is enabled, you would be able to resume a previously stopped/crashed crawl.
setRobotstxtServer(RobotstxtServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setSocketTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Socket timeout in milliseconds
setStatusCode(int) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setText(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setTextContent(String) - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
setThread(Thread) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
setTitle(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setURL(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setUserAgentName(String) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setUserAgentString(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
user-agent string that is used for representing your crawler to web servers.
setValue(String, long) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
setWebURL(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
shouldVisit(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Classes that extends WebCrawler can overwrite this function to tell the crawler whether the given url should be crawled or not.
shutdown() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Set the current crawling session set to 'shutdown'.
shutdown() - Method in class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 
shutDown() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
shuttingDown - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Is the crawling session set to 'shutdown'.
sleep(int) - Static method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
start(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and wait for it to finish.
start(Class<T>, int, boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
startElement(String, String, String, Attributes) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
startNonBlocking(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and return immediately.
statisticsDB - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
statusCode - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
sync() - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
sync() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
sync() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
sync() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 

T

TextParseData - Class in edu.uci.ics.crawler4j.parser
 
TextParseData() - Constructor for class edu.uci.ics.crawler4j.parser.TextParseData
 
TLDList - Class in edu.uci.ics.crawler4j.url
 
toString() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
toString() - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
toString() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
toString() - Method in interface edu.uci.ics.crawler4j.parser.ParseData
 
toString() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
toString() - Method in class edu.uci.ics.crawler4j.url.WebURL
 

U

UnknownError - Static variable in class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
 
url - Variable in class edu.uci.ics.crawler4j.crawler.Page
The URL of this page.
URLCanonicalizer - Class in edu.uci.ics.crawler4j.url
See http://en.wikipedia.org/wiki/URL_normalization for a reference Note: some parts of the code are adapted from: http://stackoverflow.com/a/4057470/405418
URLCanonicalizer() - Constructor for class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
UrlResolver - Class in edu.uci.ics.crawler4j.url
 
UrlResolver() - Constructor for class edu.uci.ics.crawler4j.url.UrlResolver
 
urlsDB - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
Util - Class in edu.uci.ics.crawler4j.util
 
Util() - Constructor for class edu.uci.ics.crawler4j.util.Util
 

V

validate() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Validates the configs specified by this instance.
visit(Page) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Classes that extends WebCrawler can overwrite this function to process the content of the fetched and parsed page.

W

waitingList - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
waitingLock - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
waitUntilFinish() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Wait until this crawling session finishes.
WebCrawler - Class in edu.uci.ics.crawler4j.crawler
WebCrawler class in the Runnable class that is executed by each crawler thread.
WebCrawler() - Constructor for class edu.uci.ics.crawler4j.crawler.WebCrawler
 
WebURL - Class in edu.uci.ics.crawler4j.url
 
WebURL() - Constructor for class edu.uci.ics.crawler4j.url.WebURL
 
webURLBinding - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
WebURLTupleBinding - Class in edu.uci.ics.crawler4j.frontier
 
WebURLTupleBinding() - Constructor for class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
workQueues - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
WorkQueues - Class in edu.uci.ics.crawler4j.frontier
 
WorkQueues(Environment, String, boolean) - Constructor for class edu.uci.ics.crawler4j.frontier.WorkQueues
 
writeBytesToFile(byte[], String) - Static method in class edu.uci.ics.crawler4j.util.IO
 
A B C D E F G H I L M N O P R S T U V W 

Copyright © 2013. All Rights Reserved.