Difference between revisions of "Search engines"

From YobiWiki
Jump to navigation Jump to search
Line 17: Line 17:
   
 
At the end, their cached copy is removed for 6 months but the url is still referred in the results.
 
At the end, their cached copy is removed for 6 months but the url is still referred in the results.
  +
  +
==How to avoid indexing of part of your website==
  +
cf http://www.robotstxt.org/
  +
  +
But remember this is purely voluntary, if the slurper does not want to take care about robots.txt...
  +
<br>And you robots.txt is also a nice way to tell to everybody you've sth to hide ;-)
  +
<br>E.g. http://www.ibm.com/robots.txt

Revision as of 23:54, 18 January 2007

How to remove a page cached by Google?

10 years ago you could find a lot of sites "how to get registered by search engines" but now the problem is reversed, sometimes engines index stuff you didn't really want to and worse they provide a cached copy so even if you remove it from your website this remains on the engine cache, e.g. Google cache :-(

Google provides a tool to remove cached versions.

To work properly, you have first to reply with an error 404 when someone wants to access your page then you submit your request to Google and wait for, oh, a week or more...

With Apache you can force such error 404 even on pages really existing, e.g. a page of your wiki as your wiki will never send an error 404 by itself. Example:

<IfModule mod_rewrite.c>  
Redirect 404 /mywiki/index.php/MyPersonalData
</IfModule>

I saw hits from mc-out-f136.google.com, dc-out-f136.google.com and hs-out-f136.google.com hits every 6 hours (some hits were missing) for 8 days, 16 hits in total.

First client string was easily recognisable: "googlebot-urlconsole" then all the other ones were seen as "Java/1.5.0_04"

At the end, their cached copy is removed for 6 months but the url is still referred in the results.

How to avoid indexing of part of your website

cf http://www.robotstxt.org/

But remember this is purely voluntary, if the slurper does not want to take care about robots.txt...
And you robots.txt is also a nice way to tell to everybody you've sth to hide ;-)
E.g. http://www.ibm.com/robots.txt