PHP: BOT, Web, Crawler, Spider ?

Sep 5, 2007

I am looking for how to make one of these, but my searches on google and other search engines turn to failure. I have yet to produce the results I need to produce... Im trying to make a bot that is capable of logging into a site, storing cookies and going between two pages, to keep me logged in.

while im not actually on the page. This is not going to be used for... 'Cheating' purposes of any like. I merely want to fake my logged in time on a site whicth is actually my school page.

ADVERTISEMENT

Based Web Crawler Or JAVA Based Web Crawler?

Jul 27, 2010

i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based

View 2 Replies View Related

Like A Spider Or Bot

Mar 6, 2000

I want to get all the emails from a site. Is it possible to do that in php3 with mysql?

View 2 Replies View Related

Spider?

Sep 25, 1999

i programmed a little link database. Now i want to build a script which takes the url out of the db and tests if the site is still available or not. Any ideas how i could do that?!?

View 1 Replies View Related

How Do You Know That Something Is NOT A Spider?

Jul 21, 2005

I would like spiders to find a particular php page but... within the page an email is sent to advise me it's been accessed.

I'd like to NOT get the email everytime the page is crawled so it was suggested I list all the bots I know.

I thought it would be simpler to send the email when the UA starts with "Mozilla". Is it that simple or are there other starts to the UA string for browsers?

View 1 Replies View Related

Spider A Url

Sep 1, 2007

I'm looking for a simple php script to spider a url and get information on links from that page. Does anyone have any ideas of where to look for such a script?

View 3 Replies View Related

PHP Spider/Bot Detection

May 7, 2002

I was just wondering if anyone had any PHP code that could detect a bot/spider crawling your site? Similare to browser detection.

View 1 Replies View Related

Where To Get 'website Spider' Bot For PHP?

Feb 5, 2005

Hi, I have a somewhat weird request, but I have a legit reason. My new CMS is having performance problems, and we are trying to resolve the issue by implementing a bot or spider that grabs the HTML of the site every 30s.

I was wondering if you guys could kindly tell me where to get such bots. The program will be installed on another server with cronjob to run every 15s.

The 'website performance' check services out there are limited to 2 minute interval. I need something that hits the target site in question more frequently.

View 1 Replies View Related

URL Contains Sessionids, Spider In SE?

Apr 11, 2003

I have a article directory each of my articles submitted have sessionids for example,

http://www.listbuildersuccess.com/i...m?cat_id=4&id=1

http://www.listbuildersuccess.com/i...m?cat_id=4&id=2

and so on.. What can i do so that every articles submitted to our article directory can be indexed and spider in search engine?

View 14 Replies View Related

Spider Friendly URL's

Jan 2, 2002

I know I've seen tutorials on this kind of thing before, but now that I'm ready to do it I can't find one. I know it has something to do with editing the conf file....can anyone point me in the right direction?

I figured out how to change the title tag dynamically so now search engines would see different tags if they could just spider my pages such as http://321webmaster.com/index.php?cat_id=3&subcat_id=79

View 14 Replies View Related

PHP Spider Trap

Mar 7, 2004

I recently set up my own spider trap after reading about it here. I finally got sick of site-suckers driving up my bandwidth to the point I had to upgrade my hosting package twice.

So anyway, I don't use Perl much and decided to make a PHP trap. It's working nicely and just wanted to post it up here in case anyone wants to use it.

*Notes:

View 1 Replies View Related

Php To Spider A Website

Jul 17, 2005

I am looking for a script that I can use to spider a website, and then pull
the images... I know how to do it for a single page, but, I would like to be
able to do this for the entire site.

View 4 Replies View Related

Spider Problem

Oct 28, 2004

i have build a site with php nuke and i have problem with spiders get indexed my site.Google index only my first page. Code:

View 1 Replies View Related

Quick Spider

Sep 23, 2004

I'd like to be able to quickly spider the referring page from which a specific page on my site is accessed, using php. I'd then like to be able to extract the title and description from the referring page, and display that. Is ther a script available that would do that, or most of it?

View 1 Replies View Related

What Are The Limitations Of PHP As A Web Spider?

Aug 8, 2006

What are the limitations of PHP as a web spider? Most PHP Spiders I have seen usually can index around a 100,000 pages. Can a PHP Spider be designed to index millions or even billions of pages? What are some of the limitations of MySql to store that information.

View 2 Replies View Related

PHP Spider Script

Sep 7, 2006

I'm just wondering out of curiosity, how are spider/crawler scripts made? Just the basic setup and stuff is all i'm wondering (and also how do you get it to "follow" links?) I already searched a couple of times on google and found some stuff, but some of it just didn't have the info i was looking for.

View 2 Replies View Related

Detecting A Spider

Jul 24, 2005

I have made a visitor counter for a site that adds a record to the database each time a page is viewed. To make it a little more accurate, I was wondering if there is a way to detect if the page is being viewed by a search engine spider instead of a human? That way I could use a condition to not execute the database update if the visitor is a spider... or mark the record "Spider".

View 2 Replies View Related

Simple Spider

Oct 4, 2007

I reposted this from "Regex within PHP" because I feel this is a PHP lproblem not Regex. And what I am trying to do is start at a pre-defined page, find all the links on that page and run the spider on all of the pages that were found, and return the results from those pages that were spidered. Code:

View 4 Replies View Related

Server Software Spider?

Sep 21, 2001

What I want to be able to do is spider lots of web sites and return the type of server software they use (like Netcraft). I want it to be automated - so it might have to follow links on pages in order to get to other servers..

View 1 Replies View Related

Spider Trap Clarification

Jul 28, 2005

In the PHP spider trap solution the following code is added to the .htaccess file:

SetEnvIf Request_URI "^(/403.*.htm¦/robots.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

What exactly is the string inside the SetEnvIf meant to be doing?

It looks to me like "If the user is requesting a file called "403<#*$!>.htm" or "robots.txt" set the env to allowsome. I'm kind of confused because it doesn't look like regular RegEx to me (the grouping and the forward slashes look odd to me).

Before I ask a load (more?) of silly questions, am I reading this correctly?

View 1 Replies View Related

Web Spider/domain Copy

Sep 7, 2006

I work for a company which has just switched over to a new web system. The old system is VERY unstable and the database is completely unreadable, yet they want a backup of the old system before they take it offline. I figure that the easiest thing to do would be to launch a spider (or crawler, some people are picky about terminology) that will go through the entire domain and copy the content to flat files. I just need a snapshot of the domain, but I can't find any software to do it in Linux, so I have turned to PHP.

View 1 Replies View Related

I Need A Spider Simulator For My Site

Mar 25, 2005

Does anyone know if I can get a spider simulator like the Sim Spider on WebmasterWorld. I would like to offer my visitors something like that. So they could put their URL in the http field and have it output a list of pages found on their site or the site they want to spider.

View 1 Replies View Related

Trying To Make A FTP Spider... With Loop

Jul 2, 2006

I've been trying to code a FTP spider, to function as a search engine for the FTP servers on our network.

The code writes down all of the files with the ftp server name in front in text files.
It's about to work, just one little problem...

When I want to spider a directory with directories inside of it I have a small problem. The spider only writes down the first dir in the list. However, when there are only files inside of a directory that I want to spider the script writes them down perfectly...

A week or two ago I had gotten this almost working, however because of the end of school and such I had to postpone the coding, now I'm kinda lost in my own code...

<?php
$ftp = $_GET["ftp"];
$dir = $_GET["dir"];
$ftp_host = $ftp;
$ftp_user = "anonymous";
$ftp_password = "anonymous";

echo "Connecting to $ftp_host via FTP...<BR>";
flush();
ob_flush();
$conn = ftp_connect($ftp_host);
$login = ftp_login($conn, $ftp_user, $ftp_password);

$mode = ftp_pasv($conn, TRUE);

if ((!$conn) || (!$login) || (!$mode)) {
   die("FTP connection has failed !");
}
echo "Login Ok.<BR>";
flush();
ob_flush();
$mode = ftp_pasv($conn, TRUE);

function itemize_dir($contents) {
   foreach ($contents as $file) {
       if(ereg("([-dl][rwxstST-]+).* ([0-9]*) ([a-zA-Z0-9]+).* ([a-zA-Z0-9]+).* ([0-9]*) ([a-zA-Z]+[0-9: ]*[0-9])[ ]+(([0-9]{2}:[0-9]{2})|[0-9]{4}) (.+)", $file, $regs)) {
           $type = (int) strpos("-dl", $regs[1]{0});
           $tmp_array['line'] = $regs[0];
           $tmp_array['type'] = $type;
           $tmp_array['rights'] = $regs[1];
           $tmp_array['number'] = $regs[2];
           $tmp_array['user'] = $regs[3];
           $tmp_array['group'] = $regs[4];
           $tmp_array['size'] = $regs[5];
           $tmp_array['date'] = date("m-d",strtotime($regs[6]));
           $tmp_array['time'] = $regs[7];
           $tmp_array['name'] = $regs[9];
       }
       $dir_list[] = $tmp_array;
   }
   //return $tmp_array;
   return $dir_list;
}

function spider($ftp, $dir, $conn) {
//$conn = $ftp_host;
$buff = ftp_rawlist($conn, $dir);
$items = itemize_dir($buff);
foreach($items as $line=>$item)
  {
  
$file = $ftp . $dir . $item['name'];  
$fh = fopen('ftp.txt', 'a') or die("can't open file");
fwrite($fh, $file . "");
fclose($fh);

  echo $file;
  echo "<BR>";
  //echo $item['name'];


if(strpos("/", $ftp, $dir . $item['name']) == TRUE) {
$ydir = "/";
}

if($item['type'] == 1) {
return $dir . $item['name'] . "/";
  
  } } }

$work = spider($ftp, '/', $conn );

while($var = 0) {  
if( $work != 0 ){
$works = spider($ftp, $work, $conn );
}
else {
$var = 1;}
}  
?>



And I do specify the ftp server in the variable: ?ftp=

View 3 Replies View Related

.htaccess Https And A Spider

Nov 17, 2005

I need to write a program that runs from a web server, logs into a site that uses .htaccess on a secure server, pass a string to the script on the secure server, spider all the results of the string then input all the results into a MySQL database.

Is there a way for php to log into a web site that uses .htaccess?

Is there a way to write a spider that will spider the results and populate the results into a MySQL database?

View 1 Replies View Related

Web Spider / Bot - Automatically Scroll

Nov 23, 2010

i need it so you submit your url then it automatically scrolls it for the description and keywords then puts it in the database. my html code:

Code: [Select]<form action="submit_url.php" method="get">
<input type="text" name="url" value="url" />
<input type="submit" name="submit" value="submit" />
</form>
php code so far:
<html>
<head>
[Code]....

all it does is reads the website if you put 'echo $url' at the bottom it just reads and prints the web page.

View 6 Replies View Related

Spider Site To Check For Updates

Aug 31, 2007

I am looking to create a spider preferably in PHP (if this cannot be done in php then any other language) to check an entire website for updates. I want to be able to have something set up to check a site when an update is made. That is the mostly what I need, just so I know when and what files are updated.
I would also like to be able to then scrape the site and compare the page that has been changed to one that I define. I have a site that I monitor and a global site that is very slightly different. I can match the pages up to compare to one another as to be aware of a change on the global site to change on my site.

View 1 Replies View Related

Local Spider &amp; Search Recomendations?

Sep 3, 2005

I'm looking to have my own (prebuilt and free) spider and search index on my site...I'm long overdue and I'm looking for suggestions!

- PHP / MySQL
- Admin Control Panel
- Ability to configure the spider to crawl aautomated times to scan or manual scan.
- Ability to have results on my own pages instead of some styled up page thats not mine nor looks like my site.

The first thing I came across is PHP Dig but I don't want to miss any gems if they make this one look like a rock.

View 1 Replies View Related

Does Search Engines Can Spider PHP Pages?

Feb 17, 2004

Does anyone know if the search engines can spider PHP pages?

View 6 Replies View Related

Spider To Respect The Robots.txt Protocol?

Aug 20, 2006

How when you use fopen() to read a web page, do you parse the text of a web page and insert it into a database? How do you follow links? How do you set up your php spider to respect the robots.txt protocol.

View 1 Replies View Related

Execute A Spider / Scraper But Without It Timing Out

Feb 25, 2009

I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to. The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser will throw up an error saying that the page is taking too long to respond yadda yadda yadda, plus I will have to keep the page open.

If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right? I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.

View 4 Replies View Related

Spider That Will Grab Links Off Of A Page

Mar 25, 2007

I am writing a simple spider that will grab links off of a page. So far I have no problem grabbing all the links off the first seed page. But I am stuck on how I can get it to follow those links to the next page and grab additional links perpetually. Code:

View 5 Replies View Related

Web Crawler PHP

Nov 2, 2006

I am supposed to construct a page that searches in specific websites to extract information, like those sites from where you can rent a car for example. There is a form in the site where the user selects some fields (for instance departure and drop-off date), then the data are submitted to the other page that searches 2-3 sites and finds which cars are available on those dates.

I wanted to ask if there are ready scripts to do that, if not, some hints on how to start. I am familiar with PHP forms and data extraction from mysql databases, but when you extract data from other sites, I have no clue how I can begin and deal with it...

View 2 Replies View Related

PHP Web Crawler

Sep 12, 2006

I am working on a PHP Web Crawler and am having trouble parsing links out of a page all that happens is that array is printed out here is the script.

<?
$f = fopen("http://www.google.com","r");
$inputStream = fread($f,65535);
fclose($f);
if (preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$inputStream,$matches)) {
$matches= strip_tags($matches);
print_r($matches);
}
?>

Can some one please help me?

View 1 Replies View Related

Web Crawler

Sep 24, 2006

I have a script that parses out links in a page, now I want to figure out how to follow those links. Here is the script:

<?
$f = fopen("http://www.theotaku.com","r");
while( $buf = fgets($f,1024) )
{
preg_match_all("/<a.*? href="(.*?)".*?>(.*?)</a>/i",$buf,$words);

for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
$cur_word = strtolower($words[$i][$j]);
print "Indexing: $cur_word<br>";
}
}

View 1 Replies View Related

Search Engine Spider Simulator Script, I Am Looking For One.

Apr 19, 2004

I am looking for a Search Engine Spider Simulator Script like

http://www.searchengineworld.com/cgi-bin/sim_spider.cgi

It really doesn't matter if it's php or cgi, I just like one for my visitors use. I have search for it and have not even seen one for sale or free, is there any on the web?

View 1 Replies View Related

How Many Dynamic Elements On A Page Can The SE Spider Before Giving Up?

May 4, 2007

I having a site built using PHP and MySQL to target individual searches. So my question is - how many dynamic elements on a page can the SE spider before giving up? Will mod rewrites help this issue? Are there any other work-arounds?

View 1 Replies View Related

Php Based Crawler

Oct 11, 2006

The problem is, im trying to make a central portal so that all of ma frieds blogs recent post can be seen on it. so that its easy to see who posted wot and all...

the process needs to be that when i add a URL, the crawler then keeps cheking on the URL's. if theres a new post made it has to appear on ma central portal with the title and descriotion..

so is there a way to do this or any script out there that is currently doing this..

View 1 Replies View Related

Image Crawler

Mar 3, 2006

how to script image crawler? i'm developing using windows OS and php4. is it true that we can manipulate image easily using php5 only?

View 1 Replies View Related

Crawler Identifier

Jun 26, 2006

I am running a website with specific functions which collects informations about users preferences on that website. But often crawlers came to my site and my scripts insert records about their visits. Is there a quick and easy solution to identify crawlers so I could neglect crawler informations.

View 3 Replies View Related

Keep Crawler In One Domain?

May 21, 2011

I'm writing a simple php crawler, essentially a class which recursively crawls the website by detecting link tags and going deeper. The problem is that I would like to contain it within the domain it's crawling, otherwise it will start following links to other domains and start a never ending chain reaction. My idea was to scrabble a regex that would dissect the bare domain name of the website (eg. domain.com) and check every link against it. The regex itself will have to be quite long, since I will have to include all TLDs in it.

UPDATE: parse_url is not a solution - only can give HOST name not DOMAIN name.

View 1 Replies View Related

Parse A Url For Crawler?

Sep 6, 2010

i am writting an small crawler that extract some 5 to 10 sites while getting the links i am getting some urls like this../tets/index.htmlif it is /test/index.html we can add with base url http://www.example.com/test/index.html

View 3 Replies View Related

What Is The Working Of Web Crawler?

Aug 17, 2010

Will web crawler crawl the web and create a database of the web or it will just create a searchable index of web? If suppose it creates an index, who will exactly will gather the data of web pages and store it in database?

View 1 Replies View Related

Multiple Servers - We Want Google To Be Able To Spider The Main Server (www).

Jun 3, 2006

My client expects an enormous amount of traffic so we're going to set up multiple servers at hosting companies around the country. The main server (www.mysite.com) will randomly redirect people to one of the other servers (www1, www2, etc) where they will do their shopping. This will spread the load around.

My problem is that we want Google to be able to spider the main server (www). I can easily put robots.txt files on the other servers so that Google doesn't spider, for example, www47 but how can I make sure that Google is able to traverse the "www" machine while real customers are sent to the mirror machines? Code:

View 2 Replies View Related

How Do You Spider A Page And Search For A Specific Anchor Text

Aug 28, 2004

I've scoured the net for some simple code and my regex experience is nill. I find the darn thing totally confusing. Anyone have some snippet of code that will grab a url, skip the headers, pull out all links in the body and provide anchor text to the link?

View 1 Replies View Related

Checking If Referrer Is Web Crawler

Jun 28, 2006

I have a book affiliate website. Whenever a visitor clicks on one
of the books, a script adds one to a field in a mysql database and then
takes the visitor to the shopping basket on the book website.

I have noticed that the book links are getting lots of hit. At first, I
was pleased about the potential income this might mean - but then it
occurred to me that many of these hits are web crawlers (this was
confirmed by webaliser).

Any suggestions of ways of checking if the link is being "clicked" by a
webcrawler so that I can not increment the field in the sql database?

I've checked HTTP_REFERER but it seems to be empty for what I assume
are crawled clicks.

View 4 Replies View Related

ADVERTISEMENT