Web Crawler PHP

Nov 2, 2006

I am supposed to construct a page that searches in specific websites to extract information, like those sites from where you can rent a car for example. There is a form in the site where the user selects some fields (for instance departure and drop-off date), then the data are submitted to the other page that searches 2-3 sites and finds which cars are available on those dates.

I wanted to ask if there are ready scripts to do that, if not, some hints on how to start. I am familiar with PHP forms and data extraction from mysql databases, but when you extract data from other sites, I have no clue how I can begin and deal with it...

Based Web Crawler Or JAVA Based Web Crawler?

i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based

PHP Web Crawler

I am working on a PHP Web Crawler and am having trouble parsing links out of a page all that happens is that array is printed out here is the script.

$f = fopen("http://www.google.com","r");
$inputStream = fread($f,65535);
if (preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$inputStream,$matches)) {
$matches= strip_tags($matches);

Can some one please help me?

Web Crawler

I have a script that parses out links in a page, now I want to figure out how to follow those links. Here is the script:

$f = fopen("http://www.theotaku.com","r");
while( $buf = fgets($f,1024) )
preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$buf,$words);

for( $i = 0; $words[$i]; $i++ )
for( $j = 0; $words[$i][$j]; $j++ )
$cur_word = strtolower($words[$i][$j]);
print "Indexing: $cur_word<br>";

Php Based Crawler

The problem is, im trying to make a central portal so that all of ma frieds blogs recent post can be seen on it. so that its easy to see who posted wot and all...

the process needs to be that when i add a URL, the crawler then keeps cheking on the URL's. if theres a new post made it has to appear on ma central portal with the title and descriotion..

so is there a way to do this or any script out there that is currently doing this..

Image Crawler

how to script image crawler? i'm developing using windows OS and php4. is it true that we can manipulate image easily using php5 only?

Crawler Identifier

I am running a website with specific functions which collects informations about users preferences on that website. But often crawlers came to my site and my scripts insert records about their visits. Is there a quick and easy solution to identify crawlers so I could neglect crawler informations.

Keep Crawler In One Domain?

I'm writing a simple php crawler, essentially a class which recursively crawls the website by detecting link tags and going deeper. The problem is that I would like to contain it within the domain it's crawling, otherwise it will start following links to other domains and start a never ending chain reaction. My idea was to scrabble a regex that would dissect the bare domain name of the website (eg. domain.com) and check every link against it. The regex itself will have to be quite long, since I will have to include all TLDs in it.

UPDATE: parse_url is not a solution - only can give HOST name not DOMAIN name.

Parse A Url For Crawler?

i am writting an small crawler that extract some 5 to 10 sites while getting the links i am getting some urls like this../tets/index.htmlif it is /test/index.html we can add with base url http://www.example.com/test/index.html

What Is The Working Of Web Crawler?

Will web crawler crawl the web and create a database of the web or it will just create a searchable index of web? If suppose it creates an index, who will exactly will gather the data of web pages and store it in database?

PHP: BOT, Web, Crawler, Spider ?

I am looking for how to make one of these, but my searches on google and other search engines turn to failure. I have yet to produce the results I need to produce... Im trying to make a bot that is capable of logging into a site, storing cookies and going between two pages, to keep me logged in.

while im not actually on the page. This is not going to be used for... 'Cheating' purposes of any like. I merely want to fake my logged in time on a site whicth is actually my school page.

Checking If Referrer Is Web Crawler

I have a book affiliate website. Whenever a visitor clicks on one
of the books, a script adds one to a field in a mysql database and then
takes the visitor to the shopping basket on the book website.

I have noticed that the book links are getting lots of hit. At first, I
was pleased about the potential income this might mean - but then it
occurred to me that many of these hits are web crawlers (this was
confirmed by webaliser).

Any suggestions of ways of checking if the link is being "clicked" by a
webcrawler so that I can not increment the field in the sql database?

I've checked HTTP_REFERER but it seems to be empty for what I assume
are crawled clicks.

Visiting Other Sites (crawler)

is it possible to visit other sites in PHP, in the code. I need this for a crawler for my search engine.

Website Crawler And Indexer

i am trying to create a crawler and indexer for my site and its search page. what i want to know is, is there an easy way for me to extract each link for a page or is it possible to do this with a php function. I am doing it this way cos i am gonna have a crawler that logs all the links with my site an then a indexer will go along and index the page and it contents.

Make A Simple Crawler?

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

Web Crawler For Competitive Pricing

I am thinking of writing an application that will pseudo-track competing websites to ensure that our prices stay competitive, etc. I looked at possibly using the Google Shopping Search API, but I felt that it could possibly be lacking in flexibility and not all of our competitors are fully listed or updated regularly. My question, is where is a good place to start with a PHP based webcrawler? I obviously want a crawler that is respectful (even to our competitors), so it will hopefully obey the robots.txt and throttling. (To be fair, I think I am even going to host this on a third party server and have it crawl our websites to show no biases.) I looked around via google and I couldn't find any mature packages -- only some poorly written sourceforge scripts that haven't been maintained in over a year, despite being labeled as beta or alpha.

Get Data From Crawler To Site

what is the best way to get data from external crawler to my DATA BASE, to my site i work in LAMP environment, is web services is good idea ? the crawler run every 15 minutes.

Create A Web Crawler In Python?

i created a web crawler in python and it gets data from specific parts in websites and stores this data in a mysql database which is later displayed in my website. however when i display the data in my website it appears with weird characters like this:After many years of theft, there�s still more to steal and Here�s how to reclaim forests,notice the question mark in the triangle. when i used the function mb_detect_encoding, it tells me the data is in ascii yet the default collation is latin_swedish_ci, but when i save the data in the database i override the default and use utf-8 instead please tell me what could be wrong.

Build A Web Crawler Such As MLBot?

I am looking to build a web crawler such as MLBot. It must recognise robots.txt and ROBOTS meta tag, but in saying that when a site such as Wordpress shows visitor stats it lists the crawler (eg MLBot [URL].... So how can I build a crawler that will list as HackAliveBot or HackAliveCrawler and will recognise robots.txt and ROBOTS meta tag

Teaching A Crawler To Identify A Blog

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

xml tags

and/or php code.

I do know that cms(content management system) is used for several
blogs, does anyone else have any suggestions to help me determine

I am aware that any criteria is subjective, especially when
considering sites such as slashdot which has been around longer than

Crawler For Ajax Based Websites?

Maybe this is gonna sound naive and all, but is there something even remotely close to a php crawler for ajax based websites?

Creating A Simple Site Crawler

Basically, for my final year project I am making a webcomic site. But i wanted a feature that told you when comics hosted on external servers (such as explosm, penny arcade etc) were updated.

I know this can be done, it's done on [URL]. I've tried looking into the technology needed, and as far as I can see, I need CURL.

I was just wondering if anyone could maybe point me in the right direction to a tutorial, pseudo code, or if this is possible to implement (at least for a few major comics) by a week on wednesday.

and if any one is interested, my site is [URL], still heavily under construction

Simple Crawler To Echo Links?

I wanted to make a simple crawler in php that would let me get the links in a web page, echo their url, and crawl to other pages to do the same under a certain domain. Would using cURL be necessary here? Also..how would one specify depth of the crawler. I have this so far :

$dom = new DOMDocument;
foreach( $dom->getElementsByTagName('a') as $node ) {

Mp3 Link Crawler For Dynamic Links?

i am writing an crawler that will go around a specific set of websites and crawl all the mp3 links into the database. I don't want to download the files, just crawl the link, index them and be able to search them.using php and how some sites linke [URL]....

Crawler Index/crawl Session?

I am new to php and want to know if I store data in php session in a page will crawlers crawl the data in the sessions? Will crawler still crawl the rest of the page?

Site Crawler Died While It's Running?

I wrote a site crawler to get links and images to create site map but it killed while running! so it's not my whole class

class pageCrawler {
private $links = array();
public function __construct ( $url ) {
ignore_user_abort ( true );
register_shutdown_function ( array ( $this, 'callRegisteredShutdown' ) );
$this->host = $urlParts [ 'host' ];


it's general trend of my class it's work but suddenly it's crashed suddenly. i set set_time_limit(0) to do forever but my process dosent't finish because my shoutdoown function dosent execute !

