Web Crawler example in XQueryDescription of a web crawler example in XQuery.Entire script can be seen here: web crawler script The idea is to crawl through the pages of a website and store a list with external pages and internal pages and check if they work or not. This example uses Zorba's http module for accessing the webpages, and the html module for converting the html to xml. The complete code can be found in the test directory of the html convertor module (link_crawler2.xq2). import module namespace http = "http://www.zorba-xquery.com/modules/http-client"; import module namespace map = "http://zorba.io/modules/store/data-structures/unordered-map"; import module namespace html = "http://www.zorba-xquery.com/modules/converters/html"; import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml"; import module namespace file = "http://expath.org/ns/file";The internal pages are checked recursively, while the external ones are only checked for existence. The distinction between internal and external links is made by comparing the URI with a global string variable $uri-host. Change this variable to point to your website, or a subdirectory on your website. declare variable $top-uri as xs:string := "http://zorba.io/"; declare variable $uri-host as xs:string := "http://zorba.io"; declare function local:is-internal($x as xs:string) as xs:boolean { starts-with($x, $uri-host) };The crawling starts from the URI pointed by $top-uri.Visited links are stored as nodes in two maps, one for internal pages and one for external pages. The keys are the URIs, and the values are the strings "broken" or "clean", plus error codes if processing failed. The maps are used to avoid parsing the same page twice. declare variable $local:processed-internal-links := xs:QName("processed-internal-links"); declare variable $local:processed-external-links := xs:QName("processed-external-links"); declare %an:sequential function local:create-containers() { map:create($local:processed-internal-links, xs:QName("xs:string")); map:create($local:processed-external-links, xs:QName("xs:string")); }; declare %an:sequential function local:delete-containers(){ for $x in map:available-maps() return map:delete($x); };After parsing an internal page with html module, all the links are extracted and parsed recursively, if they haven't been parsed. The html module uses tidy library, so we use tidy options to setup for converting from html to xml. Some html tags are marked to be ignored in new-inline-tags param, this being a particular case of this website. You can add or remove tags to suit your website needs. The spaces in the url links are trimmed and normalized, and the fragment part is ignored. declare variable $local:tidy-options := <options xmlns="http://www.zorba-xquery.com/modules/converters/html-options" > <tidyParam name="output-xml" value="yes" /> <tidyParam name="doctype" value="omit" /> <tidyParam name="quote-nbsp" value="no" /> <tidyParam name="char-encoding" value="utf8" /> <tidyParam name="newline" value="LF" /> <tidyParam name="tidy-mark" value="no" /> <tidyParam name="new-inline-tags" value="nav header section article footer xqdoc:custom d c options json-param" /> </options>; declare %an:sequential function local:get-real-link($href as xs:string, $start-uri as xs:string) as xs:string? { variable $absuri; try{ $absuri := local:my-substring-before(resolve-uri(fn:normalize-space($href), $start-uri), "#"); } catch * { map:insert($local:processed-external-links, (<FROM>{$start-uri}</FROM>, <MESSAGE>malformed</MESSAGE>, <RESULT>broken</RESULT>), $href); } $absuri }; declare %an:sequential function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* { distinct-values( for $y in ($content//*:a/string(@href), $content//*:link/string(@href), $content//*:script/string(@src), $content//*:img/string(@src), $content//*:area/string(@href) ) return local:get-real-link($y, $uri)) }; declare %an:sequential function local:map-insert-result($map-name as xs:QName, $url as xs:string, $http-result as item()*) { if(count($http-result) ge 1) then map:insert($map-name, (<STATUS>{fn:string($http-result[1]/@status)}</STATUS>, <MESSAGE>{fn:string($http-result[1]/@message)}</MESSAGE>, <RESULT>{if(local:alive($http-result)) then "Ok" else if(local:is-redirect($http-result)) then "redirect" else "broken" }</RESULT>), $url); else map:insert($map-name, <RESULT>broken</RESULT>, $url); if(local:is-redirect($http-result)) then map:insert($map-name, <REDIRECT>{fn:string($http-result[1]/httpsch:header[@name = "Location"]/@value)}</REDIRECT>, $url); else {} }; declare %an:sequential function local:process-internal-link($x as xs:string, $baseUri as xs:string, $n as xs:integer){ if(not(empty(map:get($local:processed-internal-links, $x)))) then exit returning false(); else {} fn:trace($x, "GET internal link"); map:insert($local:processed-internal-links, <FROM>{$baseUri}</FROM>, $x); variable $http-call:=(); try{ $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}" follow-redirect="false"/>, (), ()); } catch * { } if(local:is-redirect($http-call)) then { local:map-insert-result($local:processed-internal-links, $x, $http-call); try{ $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}"/>, (), ()); } catch * { } } else {} if( not(local:alive($http-call))) then { local:map-insert-result($local:processed-internal-links, $x, $http-call); exit returning ();} else {} if(not (local:get-media-type($http-call[1]) = "text/html")) then { local:map-insert-result($local:processed-internal-links, $x, $http-call); exit returning ();} else {} variable $string-content := string($http-call[2]); variable $content:=(); try{ $content:=html:parse($string-content,$local:tidy-options ); local:map-insert-result($local:processed-internal-links, $x, $http-call); } catch * { map:insert($local:processed-internal-links, (<MESSAGE>{concat("cannot tidy: ", $err:description)}</MESSAGE>, <RESULT>broken</RESULT>), $x); try{ $content:=parse-xml:parse-xml-fragment ($string-content, ""); } catch * { map:insert($local:processed-internal-links, <MESSAGE>{concat("cannot parse: ", $err:description)}</MESSAGE>, $x);} } variable $links :=(); if(empty($content)) then $links:=local:get-out-links-unparsed($string-content, fn:trace($x, "parse with regex, because tidy failed")); else $links:=local:get-out-links-parsed($content, $x); for $l in $links return local:process-link($l, $x, $n+1); };For each parsed link, we store the FROM, STATUS, MESSAGE and RESULT. The RESULT is "Ok" if everything went fine, or "broken" if the page couldn't be retrieved or passed, and in this case MESSAGE contains the error message. The FROM element contains the parent url for that link. Some html pages have errors, and tidy library is very strict with checking errors. When the parsing fails, we fallback to using regex for extracting the links. declare %an:sequential function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{ distinct-values( let $search := fn:analyze-string($content, "(<|&lt;|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7") for $other-uri2 in $search//group[@nr=8]/string() return local:get-real-link($other-uri2, $uri) ) };For external links, we just check if they exist, so the http command requests only for HEAD. Some websites return error for HEAD, in this case we revert to use GET. declare %an:sequential function local:process-external-link($x as xs:string, $baseUri as xs:string){ if(not(empty(map:get($local:processed-external-links, $x)))) then exit returning false(); else {} fn:trace($x, "HEAD external link"); map:insert($local:processed-external-links, <FROM>{$baseUri}</FROM>, $x); variable $http-call:=(); try{ $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}"/>, (), ()); if((count($http-call) ge 1) and fn:not($http-call[1]/@status eq 200)) then { if(local:is-redirect($http-call)) then { local:map-insert-result($local:processed-external-links, $x, $http-call); } else {} $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}"/>, (), ()); local:map-insert-result($local:processed-external-links, $x, $http-call); } else {} } catch * { $http-call:=();} local:map-insert-result($local:processed-external-links, $x, $http-call); };After parsing, the results are returned in xml format. declare function local:print-results() as element()* { for $x in map:keys($local:processed-internal-links)/map:attribute/@value/string() return <INTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-internal-links,$x)}</RESULT></INTERNAL>, for $x in map:keys($local:processed-external-links)/map:attribute/@value/string() return <EXTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-external-links,$x)}</RESULT></EXTERNAL> };The main program calls the recursive function local:process-link for the $top-uri. (:========================================== ===========================================:) variable $uri:= $top-uri; variable $result; local:create-containers(); local:process-link($uri, "", 1); $result:=local:print-results() ; local:delete-containers(); file:write(fn:resolve-uri("link_crawler_result.xml"), <result>{$result}</result>, <output:serialization-parameters> <output:indent value="yes"/> </output:serialization-parameters>) |