- get(int) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
-
- getAnchor() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
-
- getAnchor() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the anchor string.
- getBaseUrl() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
-
- getBodyText() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
-
- getCacheSize() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
-
- getCanonicalURL(String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
-
- getCanonicalURL(String, String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
-
- getConfig() - Method in class edu.uci.ics.crawler4j.crawler.Configurable
-
- getConnectionTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getContentCharset() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns the charset of the content.
- getContentData() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns the content of this page in binary format.
- getContentEncoding() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns the encoding of the content.
- getContentType() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns the ContentType of this page.
- getCrawlersLocalData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
Once the crawling session finishes the controller collects the local data
of the crawler threads and stores them in a List.
- getCrawlStorageFolder() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getCustomData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- getDatabaseEntryKey(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
-
- getDepth() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the crawl depth at which this Url is first observed.
- getDocCount() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
-
- getDocId(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
-
Returns the docid of an already seen url.
- getDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the unique document id assigned to this Url.
- getDocIdServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- getDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the domain of this Url.
- getEntity() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- getFetchedUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- getFetchResponseHeaders() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns headers which were present in the response of the
fetch request
- getFrontier() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- getHref() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
-
- getHtml() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- getHttpClient() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
-
- getInstance() - Static method in class edu.uci.ics.crawler4j.parser.BinaryParseData
-
- getInstance() - Static method in class edu.uci.ics.crawler4j.url.TLDList
-
- getLastAccessTime() - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
-
- getLength() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
-
- getMaxConnectionsPerHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMaxDepthOfCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMaxDownloadSize() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMaxOutgoingLinksToFollow() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMaxPagesToFetch() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMaxTotalConnections() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getMovedToUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- getMyController() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
- getMyId() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
Get the id of the current crawler instance
- getMyLocalData() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
The CrawlController instance that has created this crawler instance will
call this function just before terminating this crawler thread.
- getNewDocID(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
-
- getNextURLs(int, List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- getNumberOfAssignedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- getNumberOfProcessedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
-
- getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- getPageFetcher() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- getParentDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the unique document id of the parent page.
- getParentUrl() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the url of the parent page.
- getParseData() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
Returns the parsed data generated for this page by parsers
- getPath() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the path of this Url.
- getPolitenessDelay() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getPriority() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the priority for crawling this URL.
- getProxyHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getProxyPassword() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getProxyPort() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getProxyUsername() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getQueueLength() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- getResponseHeaders() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- getRobotstxtServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- getSocketTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getStatusCode() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- getStatusDescription(int) - Static method in class edu.uci.ics.crawler4j.fetcher.CustomFetchStatus
-
- getSubDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- getText() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- getTextContent() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
-
- getThread() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
- getTitle() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- getURL() - Method in class edu.uci.ics.crawler4j.url.WebURL
-
Returns the Url string
- getUserAgentName() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
-
- getUserAgentString() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
- getValue(String) - Method in class edu.uci.ics.crawler4j.frontier.Counters
-
- getWebURL() - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- schedule(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- scheduleAll(List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- SCHEDULED_PAGES - Static variable in class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
-
- scheduledPages - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
-
- setAnchor(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
-
- setAnchor(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setCacheSize(int) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
-
- setConnectionTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Connection timeout in milliseconds
- setContentCharset(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setContentData(byte[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setContentEncoding(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setContentType(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setCrawlStorageFolder(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
The folder which will be used by crawler for storing the intermediate
crawl data.
- setCustomData(Object) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- setDepth(short) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setDocIdServer(DocIDServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- setEnabled(boolean) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
-
- setEntity(HttpEntity) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- setFetchedUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- setFetchResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setFollowRedirects(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Should we follow redirects?
- setFrontier(Frontier) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- setHref(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
-
- setHtml(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- setIncludeBinaryContentInCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Should we fetch binary content such as images, audio, ...?
- setIncludeHttpsPages(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Should we also crawl https pages?
- setMaxConnectionsPerHost(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Maximum Connections per host
- setMaxDepthOfCrawling(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Maximum depth of crawling For unlimited depth this parameter should be
set to -1
- setMaxDownloadSize(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Max allowed size of a page.
- setMaxOutgoingLinksToFollow(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Max number of outgoing links which are processed from a page
- setMaxPagesToFetch(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Maximum number of pages to fetch For unlimited number of pages, this
parameter should be set to -1
- setMaxTotalConnections(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Maximum total connections
- setMovedToUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- setOutgoingUrls(List<WebURL>) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- setPageFetcher(PageFetcher) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- setParentDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setParentUrl(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setParseData(ParseData) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- setPath(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setPolitenessDelay(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Politeness delay in milliseconds (delay between sending two requests to
the same host).
- setPriority(byte) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setProcessed(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- setProxyHost(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
If crawler should run behind a proxy, this parameter can be used for
specifying the proxy host.
- setProxyPassword(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
If crawler should run behind a proxy and user/pass is needed for
authentication in proxy, this parameter can be used for specifying the
password.
- setProxyPort(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
If crawler should run behind a proxy, this parameter can be used for
specifying the proxy port.
- setProxyUsername(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
If crawler should run behind a proxy and user/pass is needed for
authentication in proxy, this parameter can be used for specifying the
username.
- setResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- setResumableCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
If this feature is enabled, you would be able to resume a previously
stopped/crashed crawl.
- setRobotstxtServer(RobotstxtServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- setSocketTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
Socket timeout in milliseconds
- setStatusCode(int) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- setText(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- setTextContent(String) - Method in class edu.uci.ics.crawler4j.parser.TextParseData
-
- setThread(Thread) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
- setTitle(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
-
- setURL(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
-
- setUserAgentName(String) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
-
- setUserAgentString(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
-
user-agent string that is used for representing your crawler to web
servers.
- setValue(String, long) - Method in class edu.uci.ics.crawler4j.frontier.Counters
-
- setWebURL(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.Page
-
- shouldVisit(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
-
Classes that extends WebCrawler can overwrite this function to tell the
crawler whether the given url should be crawled or not.
- shutdown() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
Set the current crawling session set to 'shutdown'.
- shutdown() - Method in class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
-
- shutDown() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
-
- shuttingDown - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
-
Is the crawling session set to 'shutdown'.
- sleep(int) - Static method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- start(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
Start the crawling session and wait for it to finish.
- start(Class<T>, int, boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
- startElement(String, String, String, Attributes) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
-
- startNonBlocking(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
-
Start the crawling session and return immediately.
- statisticsDB - Variable in class edu.uci.ics.crawler4j.frontier.Counters
-
- statusCode - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
-
- sync() - Method in class edu.uci.ics.crawler4j.frontier.Counters
-
- sync() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
-
- sync() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
-
- sync() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
-