I am looking for how to make one of these, but my searches on google and other search engines turn to failure. I have yet to produce the results I need to produce... Im trying to make a bot that is capable of logging into a site, storing cookies and going between two pages, to keep me logged in.
while im not actually on the page. This is not going to be used for... 'Cheating' purposes of any like. I merely want to fake my logged in time on a site whicth is actually my school page.
i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based
Hi, I have a somewhat weird request, but I have a legit reason. My new CMS is having performance problems, and we are trying to resolve the issue by implementing a bot or spider that grabs the HTML of the site every 30s.
I was wondering if you guys could kindly tell me where to get such bots. The program will be installed on another server with cronjob to run every 15s.
The 'website performance' check services out there are limited to 2 minute interval. I need something that hits the target site in question more frequently.
I know I've seen tutorials on this kind of thing before, but now that I'm ready to do it I can't find one. I know it has something to do with editing the conf file....can anyone point me in the right direction?
I figured out how to change the title tag dynamically so now search engines would see different tags if they could just spider my pages such as http://321webmaster.com/index.php?cat_id=3&subcat_id=79
I'd like to be able to quickly spider the referring page from which a specific page on my site is accessed, using php. I'd then like to be able to extract the title and description from the referring page, and display that. Is ther a script available that would do that, or most of it?
What are the limitations of PHP as a web spider? Most PHP Spiders I have seen usually can index around a 100,000 pages. Can a PHP Spider be designed to index millions or even billions of pages? What are some of the limitations of MySql to store that information.
I'm just wondering out of curiosity, how are spider/crawler scripts made? Just the basic setup and stuff is all i'm wondering (and also how do you get it to "follow" links?) I already searched a couple of times on google and found some stuff, but some of it just didn't have the info i was looking for.
I have made a visitor counter for a site that adds a record to the database each time a page is viewed. To make it a little more accurate, I was wondering if there is a way to detect if the page is being viewed by a search engine spider instead of a human? That way I could use a condition to not execute the database update if the visitor is a spider... or mark the record "Spider".
I reposted this from "Regex within PHP" because I feel this is a PHP lproblem not Regex. And what I am trying to do is start at a pre-defined page, find all the links on that page and run the spider on all of the pages that were found, and return the results from those pages that were spidered. Code:
What I want to be able to do is spider lots of web sites and return the type of server software they use (like Netcraft). I want it to be automated - so it might have to follow links on pages in order to get to other servers..
In the PHP spider trap solution the following code is added to the .htaccess file:
SetEnvIf Request_URI "^(/403.*.htm¦/robots.txt)$" allowsome <Files *> order deny,allow deny from env=getout allow from env=allowsome </Files>
What exactly is the string inside the SetEnvIf meant to be doing?
It looks to me like "If the user is requesting a file called "403<#*$!>.htm" or "robots.txt" set the env to allowsome. I'm kind of confused because it doesn't look like regular RegEx to me (the grouping and the forward slashes look odd to me).
Before I ask a load (more?) of silly questions, am I reading this correctly?
I work for a company which has just switched over to a new web system. The old system is VERY unstable and the database is completely unreadable, yet they want a backup of the old system before they take it offline. I figure that the easiest thing to do would be to launch a spider (or crawler, some people are picky about terminology) that will go through the entire domain and copy the content to flat files. I just need a snapshot of the domain, but I can't find any software to do it in Linux, so I have turned to PHP.
Does anyone know if I can get a spider simulator like the Sim Spider on WebmasterWorld. I would like to offer my visitors something like that. So they could put their URL in the http field and have it output a list of pages found on their site or the site they want to spider.
I've been trying to code a FTP spider, to function as a search engine for the FTP servers on our network.
The code writes down all of the files with the ftp server name in front in text files. It's about to work, just one little problem...
When I want to spider a directory with directories inside of it I have a small problem. The spider only writes down the first dir in the list. However, when there are only files inside of a directory that I want to spider the script writes them down perfectly...
A week or two ago I had gotten this almost working, however because of the end of school and such I had to postpone the coding, now I'm kinda lost in my own code...
I need to write a program that runs from a web server, logs into a site that uses .htaccess on a secure server, pass a string to the script on the secure server, spider all the results of the string then input all the results into a MySQL database.
Is there a way for php to log into a web site that uses .htaccess?
Is there a way to write a spider that will spider the results and populate the results into a MySQL database?
I am looking to create a spider preferably in PHP (if this cannot be done in php then any other language) to check an entire website for updates. I want to be able to have something set up to check a site when an update is made. That is the mostly what I need, just so I know when and what files are updated. I would also like to be able to then scrape the site and compare the page that has been changed to one that I define. I have a site that I monitor and a global site that is very slightly different. I can match the pages up to compare to one another as to be aware of a change on the global site to change on my site.
I'm looking to have my own (prebuilt and free) spider and search index on my site...I'm long overdue and I'm looking for suggestions!
- PHP / MySQL - Admin Control Panel - Ability to configure the spider to crawl aautomated times to scan or manual scan. - Ability to have results on my own pages instead of some styled up page thats not mine nor looks like my site.
The first thing I came across is PHP Dig but I don't want to miss any gems if they make this one look like a rock.
How when you use fopen() to read a web page, do you parse the text of a web page and insert it into a database? How do you follow links? How do you set up your php spider to respect the robots.txt protocol.
If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right? I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.
I am writing a simple spider that will grab links off of a page. So far I have no problem grabbing all the links off the first seed page. But I am stuck on how I can get it to follow those links to the next page and grab additional links perpetually. Code:
I am supposed to construct a page that searches in specific websites to extract information, like those sites from where you can rent a car for example. There is a form in the site where the user selects some fields (for instance departure and drop-off date), then the data are submitted to the other page that searches 2-3 sites and finds which cars are available on those dates.
I wanted to ask if there are ready scripts to do that, if not, some hints on how to start. I am familiar with PHP forms and data extraction from mysql databases, but when you extract data from other sites, I have no clue how I can begin and deal with it...
I having a site built using PHP and MySQL to target individual searches. So my question is - how many dynamic elements on a page can the SE spider before giving up? Will mod rewrites help this issue? Are there any other work-arounds?
I am running a website with specific functions which collects informations about users preferences on that website. But often crawlers came to my site and my scripts insert records about their visits. Is there a quick and easy solution to identify crawlers so I could neglect crawler informations.
I'm writing a simple php crawler, essentially a class which recursively crawls the website by detecting link tags and going deeper. The problem is that I would like to contain it within the domain it's crawling, otherwise it will start following links to other domains and start a never ending chain reaction. My idea was to scrabble a regex that would dissect the bare domain name of the website (eg. domain.com) and check every link against it. The regex itself will have to be quite long, since I will have to include all TLDs in it.
UPDATE: parse_url is not a solution - only can give HOST name not DOMAIN name.