{"id":965,"date":"2014-12-01T06:29:40","date_gmt":"2014-12-01T06:29:40","guid":{"rendered":"http:\/\/invisiblezero.net\/?p=965"},"modified":"2024-03-11T19:31:08","modified_gmt":"2024-03-11T19:31:08","slug":"php-crawl-websites-from-command-line-interface","status":"publish","type":"post","link":"http:\/\/ndthanh.com\/php-crawl-websites-from-command-line-interface\/","title":{"rendered":"PHP – Crawl websites from command line interface"},"content":{"rendered":"
Recently, i wrote a new crawler script to warn caches on some Magento websites. Today i’d like to share it with you, because i wrote it in a way that works with many websites other than Magento and many platforms.<\/p>\n
You can see the help content by running the crawler in command interface like below, make sure there is no sitemap.xml file or you have -help option as parameter in your command line.<\/p>\n
<\/p>\n
\nphp -f iz_crawler.php\nUsage: php -f crawler.php -- [options]\n\n -sitemap &lt;list of files&gt; List of sitemap xml files, delimit by semicolon ; . Default is 'sitemap.xml'\n -website &lt;website&gt; Website url for input. Will be ignored if -sitemap option selected or there is sitemap.xml file in the same directory with this crawler\n -depth &lt;number&gt; Set depth level. Default is 0\n -interval &lt;number&gt; Set scrap interval, measure in second(s). Defalt is 0\n -exclude &lt;extensions&gt; Exclude link extensions like png, css, js, etc... delimit by semicolon ; . Default is &quot;jpg;png;jpeg;pdf;7z;zip;rar;mp3;aac;mp4;apk;bat;tar;swf;iso&quot;\n -verbose Display crawler output. Default is false\n -help This help\n\n Note: sitemap.xml default location is at root, and it will add initial urls for crawler, use -depth to make most use of sitemap.xml\n\n Example : php -f crawler.php -- -website http:\/\/www.google.com -depth 1 -interval 0.5 -verbose -exclude &quot;png;pdf;html&quot;\n<\/pre>\nBecause you can figure out a lot from the help content, so i will only show you how it looks here. i placed iz_crawler.php at the root directory of my website and execute this command “php -f iz_crawler.php — -verbose” :<\/p>\n
\nphp -f iz_crawler.php -- -verbose\nHome<\/a><\/blockquote>