Google, Google, Go away
Since my journal is of a personal nature - I don’t really want it indexed on Google. I had previously posted about a way to “temporarily” remove your site from their index - unfortunately, it DOES time out! I’ve done this a number of times, and… well… it’s BAAACKK.
If you can’t beat ‘em - pretend you don’t exist, and maybe they’ll go away.
I’ve now added code to my blog so that if you get to it from a google search you get a “file not found” page. For those of you wishing to hide from Google - here’s how I did it:
At the top of (every) page (obviously - this is done as an include) - BEFORE any <html> tags I have this:
<?
$itsagoogle = ‘google.’;
$ref = getenv(”HTTP_REFERER”);
if (($ref) and (strstr($ref, $itsagoogle)) ) {
print(’<head><title>File Not Found</title></head><body><H1>File Not Found</h1><p>The requested URL was not found on this server.</p></BODY>’);
exit;
}
?>
That’s it! Now if you do a search in google, and my site comes up - you’ll get that generic “file not found page”. I’m all for simple solutions!!!
(Standard disclaimer: This requires that you can run php on your page and server)
update: For those of you who used this code previous to 11:00pm on 5/18 please note I made a slight change so that ANY google referrer would be blocked (ie. the original script didn’t work if they came from a www.google.ca search) but now it’s fixed…
Update 3/16/03: Ron had a hack elsewhere that would work here if you’d like to add bad more rejected referrers in addition to Google. With his hack, here’s how it goes: (this should be one of the first things on your page)
May 18th, 2002 at 4:26 pm
Thank you Jenn! I’ve had this same problem and am so tired of resubmitting to have my site not listed. This is going to be great!
May 18th, 2002 at 5:26 pm
Just so you know - this is still only a “superficial fix”, as anyone intent on getting to your site can just manually type in your url. hit enter. (Then hit refresh if neccessary) and they’re on. But, of course, they’d have to KNOW to do that! LOL!
May 18th, 2002 at 6:52 pm
That’s a good little trick! Especially if you have a blog listed on the Australian NineMSN blog directory. I don’t know who wrote those descriptions, but they’re most unflattering!
Hopefully now that I’ve changed domains and added meta info, I won’t get spidered at all. *crosses fingers*
May 18th, 2002 at 7:00 pm
It should also be noted that a better solution (IMHO) is to never be indexed in the first place. How? Most (google included) index-bots follow the robots.txt standard.
Info on Google’s bot can be found here: http://www.google.com/bot.html
Info on the robots.txt standard can be found here:
http://www.robotstxt.com
In theory, there are three or four pages on my site which should not be indexed by search engines, and after almost 1,000 search referrals, they’ve never been hit. One of them is the page of “interesting search referrals” so this does work for google, if done correctly.
May 18th, 2002 at 7:16 pm
Pete - I’ve done that… but I still get indexed
May 18th, 2002 at 7:19 pm
Here’s the link on this site
May 18th, 2002 at 8:09 pm
Google ignored my robots.txt, too, and I know of at least one other person who it ignored it for as well. Most annoying!
May 18th, 2002 at 8:22 pm
This is a great little trick Jennifer!! It could also be used if some particular person or site was linking to you and you didn’t want the traffic, you could shut them out and take up considerably less bandwidth. That doesn’t happen all too often, but I’m sure it happens sometimes (I know it’s happened to me before) so perhaps this would be a thought for that as well.
As far as google goes, I must be a lucky one. On my posh site, I just put a noindex, nofollow meta tag - I haven’t been crawled by google on Posh for months now.
May 18th, 2002 at 8:53 pm
this would work great for blocking that stupid “portal of evil” site or whatever it’s called. wish i had known about this a while back.
i finally got google to stop indexing me by using this text in my robots.txt file:
User-agent: *
Disallow: /
User-agent: DittoSpyder
User-agent: Googlebot
Disallow: /
May 20th, 2002 at 11:26 am
I’m curious. Can you have TOO many parts of text in your .htaccess code? Will one override the other and etc. or possibly cause the blocking text to not work completely?
May 20th, 2002 at 1:52 pm
I have only a few lines in my .htaccess, and now only the text above in my robots.txt file… don’t think that’s the problem.
I once read somewhere that some of the “keyword to link” associations that Google does is based on how often throughout the net a particular (keyword) is used as the link to a page…
So in this case, my name: I leave a lot of comments on people’s blogs - and on that page, my name is linked to my site. Therefore if you do a search for “Jennifer”, Google knows that “Jennifer” is very often linked to “www.scriptygoddess.com”…
The article I read talked about how you could use that to play a prank on someone. Let’s say your friend’s site is http://www.joe.com. If on your page, every where you used the (keyword) “jerk” you linked to your friends site, and you asked a ton of friends to do the same - Google doesn’t even have to spider http://www.joe.com... from it’s spidering of OTHER pages, it draws the association of jerk and http://www.joe.com... so you do a search on the keyword “jerk” and up will pop your friends site…
Wish I could find that article…
May 20th, 2002 at 1:55 pm
…more proof that THAT is what’s going on here… You’ll notice any search that returns my site, the “cached” option is not available… it’s because Google IS NOT spidering my site, but that doesn’t fix my problem of not coming up on searches.
May 21st, 2002 at 3:20 am
Interesting piece of script. I recently discovered that several people found me by typing my name into google and hitting the “I feel lucky” option. I don’t mind being found, but that is a little *too* easy, since there are people that I really rather didn’t know I have a weblog.
May 22nd, 2002 at 1:53 am
A question/suggestion: How about sniffing out the IP of Google’s crawler (which I believe has ‘googlebot’ in it) instead of the referer? This would make sure that the Google cache gets a ‘bogus’ copy (for those nefarious types, like myself, who sometimes look at the cached copy of a site). Also, to mask that they used Google, they could simply copy the URL into the clipboard and paste it into a different browser (another favourite technique of mine when I want to mask what I’m searching for).
May 22nd, 2002 at 3:15 am
Jennifer, that google trick you were talking about is called a google bomb and the article is here.
May 22nd, 2002 at 5:45 am
Robert - Yup that’s the article!! Thanks for the link!
Richard - re: crawler… see the comments above: My site isn’t actually being crawled, and there isn’t a cached version of my site available on Google. As for people copying and pasting the URL… if they’re going to that extreme to see my site, then fine. I think most people will see the “page not found” and move along. I’m not trying to block everyone, just random people hitting my site, who aren’t really interested in blogs in the first place.
May 22nd, 2002 at 11:15 am
There’s a nice little tutorial on using robots.txt at http://www.searchengineworld.com/robots/robots_tutorial.htm
May 24th, 2002 at 10:40 pm
Hi,
I was looking for MT hacks in google and found you site here
http://www.google.com/search?hl=en&lr=&ie=UTF8&oe=UTF8&q=moveable+type+hacks
Thought you might like to know..
Selena
May 24th, 2002 at 10:44 pm
Selena - that brings up scriptygoddess. That’s okay. I was hiding another site. That’s where I’m using the code.
June 27th, 2002 at 10:59 am
jenn, i modified your code to look for the term “search” within the referrer link… this has enabled me to block out many other search engines, such as hotbot, msn, and altavista.
thanks for the code!!!
<?
$itsasearch = ’search’;
$ref = getenv(”HTTP_REFERER”);
if (($ref) and (strstr($ref, $itsasearch)) ) {
print(’<head><title>File Not Found</title></head><body><H1>File Not Found</h1>The requested URL was not found on this server.<p></BODY>’);
exit;
}
?>
July 26th, 2002 at 3:53 pm
If the robots.txt wasn’t working right for you, the other possiblity is to use the META tags for such things. Here’s a site I found that has some good data on it: http://www.ceebanff.ca/help/tags/
November 4th, 2002 at 11:50 am
I found a link directly off the Google’s site on how to remove content from their indexes. There are various methods, but basically:
“If you want to prevent all robots from indexing individual pages on your site, then you can place the following meta tag element into the page’s HTML code:
If you want to allow other robots to index individual pages on your site, preventing only Google’s robots from indexing the pages, use the following tag:
More information on this standard meta tag element is available here: http://www.robotstxt.org/wc/exclusion.html#meta.”
Word for word from that link - which by the way, I can’t remember where I found on the Google site, but that I fortunately saved to my hardrive the last time.
November 4th, 2002 at 11:51 am
Oh. Here’s the link: http://www.google.com/remove.html
November 4th, 2002 at 2:01 pm
Elisa - the only problem is that I was still getting listed in Google searches, even after doing everything they said on the page…
Read through the comments above for the explanation why…
January 13th, 2003 at 9:15 pm
I modified Jennifer’s script by adding additional conditionals to check for more than one search engine, thusly:
<?
$google = ‘google.’;
$altavista = ‘altavista.’;
$ref = getenv(”HTTP_REFERER”);
$goaway = ‘404 Not FoundFile Not
FoundThe requested URL was not found on this server.
‘;
if (($ref) and (strstr($ref, $google)) ) {
print($goaway);
exit;
} elseif (($ref) and (strstr($ref, $altavista)) ) {
print($goaway);
exit;
}
?>
This can be expanded infinitely, although Jess’ substitution of “search” for “google” in Jennifer’s original script may be the most effective method to combat the bots (aside from not getting indexed in the first place).
February 24th, 2003 at 7:26 pm
I think i’m going to have to add this to my trackback pages because that’s all google wants to index it seems.
March 1st, 2003 at 1:25 pm
Fantastic!!! I have removed myself from google so many times….I can’t even bear the thought of doing it anymore! I have my robots.txt set up and meta tags to turn them away but no luck - this is amazing!
Now that I don’t work in the clin lab anymore, I don’t really give a crap if anyone can find my site but I can’t deal with the pervs that hit on Brittany’s site…..that’s why that one is password protected now.
Thank you so much for this
March 5th, 2003 at 5:02 pm
Thank you, thank you, thank you for this script!
March 6th, 2003 at 6:24 pm
Can I use this tags with blogger -pro?
March 17th, 2003 at 9:52 am
A faster method (this code must be placed before all HTML code (top of the file)):
<?php
$blocked = Array(”google”, “search”);
$ref = getenv(”HTTP_REFERER”);
if(in_array($ref,$blocked))
header(”HTTP/1.0 404 Not Found”);
?>
March 18th, 2003 at 12:37 pm
This is WONDERFUL! It works great on my site. I do have a question, though, what would I need to do to get the site in question here (http://www.geekgrrl.com/archives/001891.php) to quit indexing me. Is there a way to do it using this script?
March 18th, 2003 at 12:47 pm
I think the only way to do that is using .htaccess: see here
March 19th, 2003 at 9:40 am
this is great…i have put in the revised script. however, some of the code is showing up on my comment popup pages. if you go to my site and click on the comment link, you’ll see what i mean. it’s happening in comment preview as well. odd.
March 19th, 2003 at 10:24 am
You can not put this code on typical pop up comments because it they are a CGI page and therefore can not process PHP code. As far as I know - those pages are being generated dynamically - so I don’t think Google or other search engines can index them - so it shouldn’t be a problem.
March 19th, 2003 at 10:37 am
as usual, you’re marvelous, jennifer…i had forgotten they were dynamically generated. thanks so much!
May 12th, 2003 at 1:20 pm
slight modification to the code above so you can also tack on IPs to “hide” from as well…
<?
function isBadReferrer($ref, $ip) {
if (
(strstr($ref, “google.”)) or
(strstr($ref, “aolsearch.aol.com”)) or
(strstr($ref, “search.yahoo.com”)) or
(strstr($ref, “search.msn.com”)) or
(strstr($ref, “hotbot.com”)) or
(strstr($ip, “123.456.7890″)
/*
add more like the above line to add more “rejected” referrals. The “123.456.7890″ is a dummy ip to show you how you enter in IPs…
*/
)
{
return true;
} else {
return false;
}
}
$ref = getenv(”HTTP_REFERER”);
if ($_SERVER['HTTP_X_FORWARD_FOR']) {
$ip = $_SERVER['HTTP_X_FORWARD_FOR'];
} else {
$ip = $_SERVER['REMOTE_ADDR'];
}
if (($ref) and (isBadReferrer($ref, $ip) )) {
print(’<head><title>File Not Found</title></head><body><H1>File Not Found</h1><p>The requested URL was not found on this server.</p></BODY>’);
exit;
}
?>
July 10th, 2003 at 7:00 pm
I’ve been using this code and it’s been working fine, but I’ve been wondering how I could get it to be even more foolproof. In doing a search to find out how the code Kim posted above would work, I came up with this solution.
In the code Jennifer posted, replace:
print(’<head><title>File Not Found</title></head><body><H1>File Not Found</h1><p>The requested URL was not found on this server.</p></BODY>’)
with
header(”Location: http://www.google.com/“)
so your whole code snippet will look something like:
<?
function isBadReferrer($ref)
{
if (
(strstr($ref, “google.”)) or
(strstr($ref, “aolsearch.aol.com”)) or
(strstr($ref, “search.yahoo.com”)) or
(strstr($ref, “search.msn.com”)) or
(strstr($ref, “hotbot.com”))
//add more like the above line to add more “rejected” referrals
)
{
return true;
}
else
{
return false;
}
}
$ref = getenv(”HTTP_REFERER”);
if (($ref) and (isBadReferrer($ref) )) {
header(”Location: http://www.google.com/“);
exit;
}
?>
This code wil re-direct anyone who comes to your page via google or any of the other listed search engines to the site you specify in the header tags (in this case I’ve made it http://www.google.com, but it can be anywhere you want). It works a charm — anyone who comes to my site via the “blocked” search engines is redirected back to the google home page.
September 23rd, 2003 at 7:30 am
I was just curious if you could tack this code at the bottom of your cookiecheck.php if you have your blog skinned???
October 6th, 2003 at 8:44 pm
I can’t seem to get the code with the IP addresses working.. it keeps telling me I have a parse error…
February 12th, 2004 at 7:28 pm
hi, i’ve tried entering in any of the codes onto my blogspot.com page, and if when i use the last one listed on this page, it has zero effect; when i use the first one listed on this page, it prints the “error” message and my blog, no matter if i enter the address directly, or if found via a search engine.
right now, i have the “redirect to http://www.google.com” code on my page.
all help is appreciated… thanks, *M
February 12th, 2004 at 7:33 pm
I’m reasonably certain that if you’re hosting your site on blogspot - they don’t let you run server side scripts like PHP (which is what you need to do with this script)… May I recommend getting your own hosting account with Blogomania?
March 10th, 2004 at 8:23 pm
Some nice links on your frontpage there!
Oh, by the way, I found this entry through a google search on “hide from google”…
March 10th, 2004 at 8:32 pm
Oh, you were probably speaking about another blog. Anyway, still nice links.
April 10th, 2004 at 11:20 pm
If there a way you could post that script in a way that I could paste it into my LiveJournal’s code (on the S2 system)? Thanks.
April 19th, 2004 at 8:59 pm
I found and came to this site from Google. Script does not work.
I don’t have any program/firewall/whatever that would prevent sending the referrer (tested it on another site).
April 19th, 2004 at 9:03 pm
Actually it does work - I’m not using that script on THIS site.
I should note that there are A LOT of scripts posted on this site. VERY FEW of them are actually being used here. Mostly I’m just sharing information/scripts for other people