Web Crawler PHP

Nov 2, 2006

I am supposed to construct a page that searches in specific websites to extract information, like those sites from where you can rent a car for example. There is a form in the site where the user selects some fields (for instance departure and drop-off date), then the data are submitted to the other page that searches 2-3 sites and finds which cars are available on those dates.

I wanted to ask if there are ready scripts to do that, if not, some hints on how to start. I am familiar with PHP forms and data extraction from mysql databases, but when you extract data from other sites, I have no clue how I can begin and deal with it...

ADVERTISEMENT

Based Web Crawler Or JAVA Based Web Crawler?

Jul 27, 2010

i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based

View 2 Replies View Related

PHP Web Crawler

Sep 12, 2006

I am working on a PHP Web Crawler and am having trouble parsing links out of a page all that happens is that array is printed out here is the script.

<?
$f = fopen("http://www.google.com","r");
$inputStream = fread($f,65535);
fclose($f);
if (preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$inputStream,$matches)) {
$matches= strip_tags($matches);
print_r($matches);
}
?>

Can some one please help me?

View 1 Replies View Related

Web Crawler

Sep 24, 2006

I have a script that parses out links in a page, now I want to figure out how to follow those links. Here is the script:

<?
$f = fopen("http://www.theotaku.com","r");
while( $buf = fgets($f,1024) )
{
preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$buf,$words);

for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
$cur_word = strtolower($words[$i][$j]);
print "Indexing: $cur_word<br>";
}
}

View 1 Replies View Related

Php Based Crawler

Oct 11, 2006

The problem is, im trying to make a central portal so that all of ma frieds blogs recent post can be seen on it. so that its easy to see who posted wot and all...

the process needs to be that when i add a URL, the crawler then keeps cheking on the URL's. if theres a new post made it has to appear on ma central portal with the title and descriotion..

so is there a way to do this or any script out there that is currently doing this..

View 1 Replies View Related

Image Crawler

Mar 3, 2006

how to script image crawler? i'm developing using windows OS and php4. is it true that we can manipulate image easily using php5 only?

View 1 Replies View Related

Crawler Identifier

Jun 26, 2006

I am running a website with specific functions which collects informations about users preferences on that website. But often crawlers came to my site and my scripts insert records about their visits. Is there a quick and easy solution to identify crawlers so I could neglect crawler informations.

View 3 Replies View Related

Keep Crawler In One Domain?

May 21, 2011

I'm writing a simple php crawler, essentially a class which recursively crawls the website by detecting link tags and going deeper. The problem is that I would like to contain it within the domain it's crawling, otherwise it will start following links to other domains and start a never ending chain reaction. My idea was to scrabble a regex that would dissect the bare domain name of the website (eg. domain.com) and check every link against it. The regex itself will have to be quite long, since I will have to include all TLDs in it.

UPDATE: parse_url is not a solution - only can give HOST name not DOMAIN name.

View 1 Replies View Related

Parse A Url For Crawler?

Sep 6, 2010

i am writting an small crawler that extract some 5 to 10 sites while getting the links i am getting some urls like this../tets/index.htmlif it is /test/index.html we can add with base url http://www.example.com/test/index.html

View 3 Replies View Related

What Is The Working Of Web Crawler?

Aug 17, 2010

Will web crawler crawl the web and create a database of the web or it will just create a searchable index of web? If suppose it creates an index, who will exactly will gather the data of web pages and store it in database?

View 1 Replies View Related

PHP: BOT, Web, Crawler, Spider ?

Sep 5, 2007

I am looking for how to make one of these, but my searches on google and other search engines turn to failure. I have yet to produce the results I need to produce... Im trying to make a bot that is capable of logging into a site, storing cookies and going between two pages, to keep me logged in.

while im not actually on the page. This is not going to be used for... 'Cheating' purposes of any like. I merely want to fake my logged in time on a site whicth is actually my school page.

View 1 Replies View Related

Checking If Referrer Is Web Crawler

Jun 28, 2006

I have a book affiliate website. Whenever a visitor clicks on one
of the books, a script adds one to a field in a mysql database and then
takes the visitor to the shopping basket on the book website.

I have noticed that the book links are getting lots of hit. At first, I
was pleased about the potential income this might mean - but then it
occurred to me that many of these hits are web crawlers (this was
confirmed by webaliser).

Any suggestions of ways of checking if the link is being "clicked" by a
webcrawler so that I can not increment the field in the sql database?

I've checked HTTP_REFERER but it seems to be empty for what I assume
are crawled clicks.

View 4 Replies View Related

Visiting Other Sites (crawler)

Dec 16, 2001

is it possible to visit other sites in PHP, in the code. I need this for a crawler for my search engine.

View 2 Replies View Related

Website Crawler And Indexer

May 11, 2007

i am trying to create a crawler and indexer for my site and its search page. what i want to know is, is there an easy way for me to extract each link for a page or is it possible to do this with a php function. I am doing it this way cos i am gonna have a crawler that logs all the links with my site an then a indexer will go along and index the page and it contents.

View 3 Replies View Related

Make A Simple Crawler?

Feb 22, 2010

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

View 6 Replies View Related

Web Crawler For Competitive Pricing

Jan 18, 2011

I am thinking of writing an application that will pseudo-track competing websites to ensure that our prices stay competitive, etc. I looked at possibly using the Google Shopping Search API, but I felt that it could possibly be lacking in flexibility and not all of our competitors are fully listed or updated regularly. My question, is where is a good place to start with a PHP based webcrawler? I obviously want a crawler that is respectful (even to our competitors), so it will hopefully obey the robots.txt and throttling. (To be fair, I think I am even going to host this on a third party server and have it crawl our websites to show no biases.) I looked around via google and I couldn't find any mature packages -- only some poorly written sourceforge scripts that haven't been maintained in over a year, despite being labeled as beta or alpha.

View 1 Replies View Related

Get Data From Crawler To Site

Jun 15, 2009

what is the best way to get data from external crawler to my DATA BASE, to my site i work in LAMP environment, is web services is good idea ? the crawler run every 15 minutes.

View 1 Replies View Related

Create A Web Crawler In Python?

Jul 29, 2009

i created a web crawler in python and it gets data from specific parts in websites and stores this data in a mysql database which is later displayed in my website. however when i display the data in my website it appears with weird characters like this:After many years of theft, there�s still more to steal and Here�s how to reclaim forests,notice the question mark in the triangle. when i used the function mb_detect_encoding, it tells me the data is in ascii yet the default collation is latin_swedish_ci, but when i save the data in the database i override the default and use utf-8 instead please tell me what could be wrong.

View 1 Replies View Related

Build A Web Crawler Such As MLBot?

Jun 8, 2010

I am looking to build a web crawler such as MLBot. It must recognise robots.txt and ROBOTS meta tag, but in saying that when a site such as Wordpress shows visitor stats it lists the crawler (eg MLBot [URL].... So how can I build a crawler that will list as HackAliveBot or HackAliveCrawler and will recognise robots.txt and ROBOTS meta tag

View 1 Replies View Related

Teaching A Crawler To Identify A Blog

Jul 17, 2005

I am currently trying to teach a web crawler how to identify blogs,
that is I am trying to determine a fairly inclusive set of criteria
that will help my crawler to identify them.

I have noticed that many Blogs include

div class=blogsomething (A format class conveniantly named blog)

xml tags

and/or php code.

I do know that cms(content management system) is used for several
blogs, does anyone else have any suggestions to help me determine
criteria.

I am aware that any criteria is subjective, especially when
considering sites such as slashdot which has been around longer than
Blogs...

View 2 Replies View Related

Crawler For Ajax Based Websites?

May 20, 2011

Maybe this is gonna sound naive and all, but is there something even remotely close to a php crawler for ajax based websites?

View 2 Replies View Related

Creating A Simple Site Crawler

Apr 21, 2009

Basically, for my final year project I am making a webcomic site. But i wanted a feature that told you when comics hosted on external servers (such as explosm, penny arcade etc) were updated.

I know this can be done, it's done on [URL]. I've tried looking into the technology needed, and as far as I can see, I need CURL.

I was just wondering if anyone could maybe point me in the right direction to a tutorial, pseudo code, or if this is possible to implement (at least for a few major comics) by a week on wednesday.

and if any one is interested, my site is [URL], still heavily under construction

View 3 Replies View Related

Simple Crawler To Echo Links?

Jul 6, 2011

I wanted to make a simple crawler in php that would let me get the links in a web page, echo their url, and crawl to other pages to do the same under a certain domain. Would using cURL be necessary here? Also..how would one specify depth of the crawler. I have this so far :

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach( $dom->getElementsByTagName('a') as $node ) {
[code]....

View 1 Replies View Related

Mp3 Link Crawler For Dynamic Links?

Mar 18, 2010

i am writing an crawler that will go around a specific set of websites and crawl all the mp3 links into the database. I don't want to download the files, just crawl the link, index them and be able to search them.using php and how some sites linke [URL]....

View 1 Replies View Related

Crawler Index/crawl Session?

Mar 11, 2011

I am new to php and want to know if I store data in php session in a page will crawlers crawl the data in the sessions? Will crawler still crawl the rest of the page?

View 1 Replies View Related

Site Crawler Died While It's Running?

Jun 27, 2011

I wrote a site crawler to get links and images to create site map but it killed while running! so it's not my whole class

class pageCrawler {
.......
private $links = array();
public function __construct ( $url ) {
ignore_user_abort ( true );
set_time_limit ( 0 );
register_shutdown_function ( array ( $this, 'callRegisteredShutdown' ) );
$this->host = $urlParts [ 'host' ];

[Code]...

it's general trend of my class it's work but suddenly it's crashed suddenly. i set set_time_limit(0) to do forever but my process dosent't finish because my shoutdoown function dosent execute !

View 1 Replies View Related

Searching Flash Crawler Script?

Jul 16, 2009

i am searching flash crawler script in php.but i did not found last two days

View 1 Replies View Related

Crawler Script Suddenly End With No Error?

Oct 24, 2009

I have written a web crawler script. It will visit a large number of URL's with cURL.

After around 2-3 minutes of running, it will just stop, with no error output or notices.

I have these settings:
Code: [Select]set_time_limit(0);
ini_set('display_errors',1);
error_reporting(E_ALL|E_STRICT);

View 6 Replies View Related

Showing Crawler Agent Display Name In Forums End. Etc. ?

Oct 6, 2006

My crawler is best , but " Don't Showing Crawler Agent display name in forums end. etc. ( and error this "only , Unnamed Spider, Unknown Spider , Unknows Crawler etc. ) Code:

View 2 Replies View Related

Write A Web Crawler For Specific User Agent?

May 14, 2011

I need to write a web crawler, and want to be able to crawl using a known user agent. For example, I want my crawler to act as an iphone to crawl the mobile site of a website, then crawl again using Mozilla PC agent, etc.

That way, Ill be able to crawl every "type" of site (mobile & PC). However, I also want to be able to set my crawler's user agent, so webmasters also see in their stats that it's a crawler that visited their whole website, not real users.

So my question is, do you guys know how to set a mobile agent + a crawler agent at the same time, in PHP?

View 3 Replies View Related

Crawler Coding: Determine If Pages Have Been Crawled?

Aug 27, 2010

I am working on a crawler in PHP that expects m URLs at which it finds a set of n links to n pages (internal pages) which are crawled for data. Links may be added or removed from the n set of links. I need to keep track of the links/pages so that i know which have been crawled, which ones are removed and which ones are new.How should i go about to keep track of which m and n pages are crawled so that next crawl fetches new urls, re-checks still existing urls and ignores obsolete urls?

View 1 Replies View Related

Crawler - Rebuild Safari Web Clip Functionality?

Apr 6, 2010

is there a way to rebuild Mac OSX Snow Leopard's Dashboard Widget 'Web Clip' on a PHP website?Something like a crawler or scraper.I thought about using file_get_contents to getting the page content into the page, but how do I select a section on the external page? And does this work with session/login content as well?

View 1 Replies View Related

Hyperlink - Web Crawler Links/page Logic?

Dec 11, 2008

I'm writing a basic crawler that simply caches pages with PHP.All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a> - at the moment it returns:

Array {
[url] => URL
[desc] => DESCRIPTION
}

The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory.It could be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.

View 3 Replies View Related

Web Crawler - Find External Links And Get Data?

Aug 15, 2010

Possible Duplicate: Finding and Printing all Links within a DIV I'm trying to make a mini crawler..when i specify a site.. it does file_get_contents()..then get the data i want.. which i've already done.. now i want to add code that enables it to find..any external links on the site it is on.. and get the data ..basically..instead of me specifying a site..it just follows external links and get the data if available...here is what i have..

<?php
$link = strip_tags($_GET['s']);
[code]....

View 2 Replies View Related

Make A Crawler To Fetch Particular Web Page's Content?

Jan 3, 2008

i try to make a crawler that crawls a web page & retrieves the stock information from google,but can't do it .

View 5 Replies View Related

Crawler Mandatory Agecheck Page In Drupal?

Aug 19, 2009

we have a big community website build in drupal, where the site has a mandatory agecheck before you can access the content of the website it checks for a cookie to be present, if not, you get redirected to the agecheck page.now we believe crawlers get stuck on this part, they get redirected to the agecheck and never get to crawl the full website.what would be the best way to deal with something like this?

one of the issues with crawlers is also that when someone in the community posts something to his wall on facebook, facebook crawls the page back to fetch images and description(which are specified in meta tags)but facebook gets also redirected to the agecheck page.would a useragentcheck work if i add the facebook crawler?if so:would anyone know the facebook crawlers exact name then?The solution below is one that we also came a cross on the net.if adding the facebook crawler to that list works then it would solve all the problems we are having with this agecheck page.

View 2 Replies View Related

Web Crawler - Doesn't Update Page Until Finishing Loading

May 7, 2011

I want to write a crawler script with php and it is necessery to show pages which is indexing online. however, php doesn't update page real time, sometimes it write a few echos together and wait until finishing loading, sometimes nothing seems in page until finishing loading. here is an example about what I'm talking:

<?php
echo '1<br>';
sleep(2);
echo '2<br>';
sleep(2);
echo '3<br>';
sleep(2);
echo '4<br>';
?>

I tried on wamp and lamp and results were same. is there any way to show echos real time?

note: I found an online crawler which has this feature: [URL]

View 2 Replies View Related

Redirection Affects Way Crawler Or Robot Views Website?

Aug 25, 2010

for example if in my index.php i have something like:

<?php
header('Location: /mypublicsite/index.php');
?>

what do the crawlers and/or robots get? just a blank page? or they actually arrive to /mypublicsite/index.php?

View 4 Replies View Related

Web Crawler - Search Certain Sites (remote) For Certain Type Of Files

Jul 11, 2007

I want to search certain sites (remote) for certain type of files. I dont know from where to start.

View 4 Replies View Related

Crawler - Generate A List Of All The Pages Contained In A Website Programmatically?

Jan 28, 2010

How is it possibe to generate a list of all the pages of a given website programatically using PHP?

What I'm basically trying to achieve is to generate something like an sitemap, in nested unordered list with links for all the pages contained in a website.

View 2 Replies View Related

Protect Files In Uploaded Folder In Website From Google Crawler?

Feb 6, 2011

Should I protect files in the uploaded folder in my web from google crawler, because I upload pictures should buy by eBay and my programme lang is php

View 3 Replies View Related

Build A Crawler Which Can Read Data From Around 20 Daily Deal Websites?

Dec 28, 2010

Specifically how to build a crawler which can read data from around 20 daily deal websites, and display on the clone site.

View 1 Replies View Related

Custom - Local Or Server Based Crawler That Detects The Version Of A Script

Aug 18, 2010

I have custom scripts that are were written near the change of php4 and php5 and I am looking to find all of the php4 versions easily so that I can rewrite them so that they work with php5.3. Is there a tool that can perform this on my local machine or the server?

View 1 Replies View Related

Crawler Detection / Regular Readers To Be Show A Html Sitemap On The Page

Jul 24, 2009

I'm trying to write a sitemap.php which acts differently depending on who is looking.

I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page.

This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not.

function getIsCrawler($userAgent) {
$crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);
code....

It looks pretty simple, but as you can see i've added firefox into the agent list, and sure enough I'm not being redirected..

View 3 Replies View Related

Identyfying Web Crawler / Search Engine Bots Like Google, Yahoo Using Basic Functions?

Nov 3, 2010

how to determine whether Visitor is a Search Engine Bot like f.e. Google Bot or not?

View 1 Replies View Related

ADVERTISEMENT