I, robot: How do search engine spiders and robots work?

December 25, 2009 by The Big SEO  
Filed under robots.txt

Some internet surfers still hold on to the mistaken belief that actual people visit each and every website and then input it for inclusion in the search engine’s database. Imagine, if these were true! With billions of websites available on the internet and with a majority of these sites offering fresh content it will take thousands of people to achieve the tasks made by search engine spiders and robots – and even then they won’t be as efficient or as thorough.

Search engine spiders and robots are pieces of code or software that have only one aim – seek content on the internet and within each and every individual web page out there. These tools have a very important role in how effectively search engines operate.

Search engine spiders and robots visit websites and get the necessary information that it needs to determine the nature and content of the website and then adds the data to the search engine’s index. Search engine spiders and robots follow links from one website to another so that it can consistently and infinitely gather the necessary information. The ultimate goal of search engine spiders and robots is to compile a comprehensive and valuable database that can deliver the most relevant results to the search queries of visitors.

But how exactly do search engine spiders and robots work?

The whole process begins when a web page is sent to a search engine for submission. The submitted URL is added to the queue of websites that will be visited by the search engine spider. Submissions can be optional though because most spiders will be able to find the content in a web page if other websites link to the page. This is the reason why it is a good idea to build reciprocal links with other website. By enhancing the link popularity of your website and getting links from other sites that have the same topic as your website.

When the search engine spider robot visits the website, it checks if there is an existing robots.txt file. The file tells the robot which areas of the site are off limits to its probe – like certain directories that have no use for search engines. All search engine bots look for this text file so it is a good idea to put one even if it is blank.

The robots list and store all of the links found on a page and they follow each link to its destination website or page.

The robots then submit all of this information to the search engine, which in turn compiles the data received from all the bots and builds the search engine database. This part of the process already has the intervention of search engine engineers who write the algorithms employed in evaluating and scoring the information that the search engine bots compiled. The moment all of the information is added to the search engine database this information is already made available to search engine visitors who are making search queries in the search engine.

The Truth About Robots – Robot Travel

December 8, 2009 by The Big SEO  
Filed under robots.txt

There is one thing you have learned about robots, it is that there is
absolutely no pattern to them. Most robots are stupid and wander randomly.
For example, 50% of robot hits to my sites ask for the robots.txt page and
then go away never asking for anything else. Then they come back a week
later, ask for the same thing and then go away, again. This happens over
and over again for months. You will never never figure it out. What are
they doing? If they wanted to see if the Web site was really a Web site,
they could just Ping it. This would be much faster and much more efficient.
They seldom visit another page and if they do, they ask for one other page
every visit or so. Some come in and issue rapid-fire requests for every
page in the Web site. How rude! You have to quit worrying so much about
robots. It takes 6 months before they request enough pages to do you any
good. We really quit thinking about them a long time ago. Build a lot of
pages correctly, and, if you have reciprocal links to them, the robots will
find them someday.

Try this: Go to AltaVista and type into the search box “link:YourSite.com”
(Leave off the www). This will list the reciprocal links to your Web site.
Try link:crownjewels.com and you get 136 links to it. Think about this now:
The robots say to themselves, “Here is a site that must be popular or why
would so many Web sites SIMILAR to it have it’s link on their pages?” Remember
that only SIMILAR sites with SIMILAR THEMES would probably have a link to
your site. They give more importance to this than you submitting your link
to them. Wouldn’t you?

Go to heavily trafficked sites matching your Web site’s Themes and use AltaVista
to find out how many reciprocal links they have. This will prove to you
we are right.

Search engines are nothing more than a measure of reciprocal links to your
site. The problem is, you are constantly having to fight for your positioning
in the search query listings. Forget about that. Leave the fighting to people
who are able to spend 24 hours a day trying to trick everybody. Quit trying
to compete with the large organizations pouring millions into their marketing.
Completely forget about Search Engines after submitting to them and go after
the reciprocal links. The Search Engines will then believe you are a heavily
visited site because you will be. You will now be getting the traffic you
so richly deserve.

Search engine visitors to your site, are oftentimes not qualified visitors.
Too many visitors pop into your home page for 2 seconds and then leave.
You know how it is. We all do it when we are using the search engines. Either
it wasn’t the information we were looking for, or they had this huge graphic
on this stupid portal page, which just took forever to load. These visitors
shouldn’t even count, but they get counted as 12-18 hits in your server
logs. Hits are requests to the server. One page request can incur a lot
of hits: requests to the page itself plus the graphics, each count as a
hit.

Reciprocal links bring in qualified visitors. These are visitors who were
already on a Web site which had matching Themes to yours. They already have
a good idea of what type of site you are. They will come into your site
and actually stay awhile. These visitors should count as double credit,
they are so good.

We know which type of visitor we would rather have.

How do you get people to WANT to put your link on their Web sites? Why would
a similar site put a link to your site on theirs? Simple, you have similar
Themes. You are similar, but not competition.

There is one very important lesson to be learned from this crazy robot behavior.
You need to make the navigation in your Web site so easy that a visitor
can find any page within 2 clicks of your home page. One way of doing this
is installing hidden DotLinks. Dotlinks are little periods that are linked
to other pages which are not really noticeable on your page if you put it
as a period. Although they are not easily seen by the human eye, they are
a link that a robot can follow in your Web site. When you do this, robots
can find your pages faster and more easily.

To read other interesting articles go to: http://www.harvestmoney.ws

Search Engine Spiders And Your Robots.txt File

February 25, 2009 by The Big SEO  
Filed under robots.txt




In this article we will discuss search engine spiders and what they do. You will also learn how to create a robots.txt file and why you might need one.

Search engine spiders are automated software programs that crawl the Web looking for pages to feed to search engines. They are also called crawlers, robots and bots. Spiders are one of the most useful programs on the internet. They are a key part in how the search engines operate. Spiders allow your site to be found by the millions of people who use search engines. Feed the spiders right and they will tell the search engines about your site.

How Spiders Work

A search engine is an index to the Internet, search engines point to relevant web sites depending on your search. Search engines need a tool that is able to visit websites, navigate the websites, decide what the website is about and add that data to the search engine.

Spiders are essentially programs that “crawl” sites and report back to their boss their findings. Their purpose in life is to make it easy for your site to get listed in search engines.

Spiders work by finding links to web sites, visiting those web sites, going through the content of a web site and then reporting the content of the site back to the database of the search engine they work for. From there, the information is added to the search engine, and the site then shows up in search results.

The robots.txt file

By defining a few rules, you can tell robots to not crawl certain directories or files, within your site. Web sites do not absolutely have to have a robots.txt file, they can get along just fine without one. Most spiders look for a robots.txt file as soon as they arrive on your site. Take a look at your site statistics. If your statistics has a “files not found” section, you may see many entries where spiders failed to find the file on your site.

The default behavior is to allow all unless you have a Disallow for that resource. If you wish to exclude some of your pages from search engine indexing, this is the tool approved by the search engines. Creating a robots.txt file that guides spiders is simple.

If you want to allow the spiders to crawl your site but exclude directories of your choice, copy and paste the following into a blank txt file:

User-agent: *

Disallow: /directory1/

Disallow: /directory2/

Disallow: /directory3/

To exclude files of your choice, type in the path to the files you want to exclude:

User-agent: *

Disallow: /directory1/page1.html

Disallow: /directory2/page2.html

Disallow: /directory3/page3.html

To exclude all the search engine spiders from your entire web site, copy and paste the following into the txt file:

User-agent: *

Disallow: /

This will keep a specific search engine spider from indexing your site:

User-agent: Name_of_Robot

Disallow: /

To allow a single robot and exclude all other robots:

User-agent: Googlebot

Disallow:

User-agent: *

Disallow: /

There can only be one robots.txt on a site, and you may not have blank lines in a record. Once you have it the way you want, save the file as “robots” and as a .txt file. Uploading the file to the root directory of your site, that is the directory where your home page or index page is. Put the robots.txt file right alongside the index file.