Chapter 4. Notes for Search Engine beginners

Table of Contents

Introducing Search Engine Robots & Spiders
How Robots Follow Links to Find Pages
Breadth-First Crawling
Depth-First Crawling
Other Indexing Issues
Spidering Depth
Server Load
Robots and private data
Password-Protected Pages
Encrypted Data
Communicating With Robots
User Agents in Web Server Logs
Robots.txt
The indexing process
Details on Following Links and Crawling Sites
The Site Map Solution and its Limitations
Frames and Framesets
JavaScript Pages
JavaScript Menus and Lists
Redirecting Links
Other Web Interfaces
Graphic Text
Plug-ins
Acrobat
Java
ActiveX
Non-Text File Formats
Dealing with Dynamic Data
Cookies
Dynamic Web Applications and Black Holes
Detecting Duplicate Pages
Updating Indexes
Update Schedules
Locating changed pages

This chapter is intended to give general information regarding the way search engines collect information to fill the search index. Also discussed are potential problems that spiders may have with certain types of content and how to overcome those problems.

Note

Information contained here are not specific for searchbox and can be found in any other enterprise class search engine platform.

Introducing Search Engine Robots & Spiders

Search engine robots are programs that follow links on Web sites, gathering data for search engine indexes. Other names for these programs are spider or crawler (because they follow links on the Web), worm, wanderer, gatherer, and so on. The spider used by searchbox is of a special type because it not only crawl web sites but also local and remote filing systems, FTP sites, and many other types of sources.

You may be wondering why the robots crawl your site at all -- why not just search the files when someone types a word and hits a button? The reason is efficiency: thirty years of Information Retrieval research have found that storing data in indexes reduces the load on the server, allows very large collections to be searched by many people at the same time, and lets the search engine list the pages in order by the relevance of their text to the search terms. Note: The other way search indexers may gather data is to read files from specified folders. This is faster than using a robot, and it can use the file system information to check whether a file has been changed since the last index. However, robots can access multiple servers without having to mount volumes on a network, and they are also less likely to index private or obsolete data, because they follow links rather than indexing everything in the folders. This introduction explains how robots gather data for search engines, how they follow links, how to control robots indexing and updating, and more. If you're running a search engine indexing robot, you may notice trouble with certain sites. Sometimes the robot never gets anywhere, stops after the first page, or misses a whole section of the site. This is usually because the Web page creators made links that are easy for humans using the Mozilla or Microsoft IE browsers to follow, but are somehow invisible to the robots.

How Robots Follow Links to Find Pages

Robots start at a given page, usually the main page of a site, read the text of the page just like a Web browser, and follow the links to other pages. If you think of a site as a web, the robot starts in the center, and follows links from strand to strand until it has reached every one.

Breadth-First Crawling

The idea of breadth-first indexing is to retrieve all the pages around the starting point before following links further away from the start. This is the most common way that robots follow links. If your robot is indexing several hosts, this approach distributes the load quickly, so that no one server must respond constantly. It's also easier for robot writers to implement parallel processing for this system.

Figure 4.1. Breadth-First Crawing scheme

Breadth-First Crawing scheme

In this diagram the breadth-first strategy for a simple graph is shown by a numbered sequence of visited nodes.

Depth-First Crawling

The alternate approach, depth-first indexing, follows all the links from the first link on the starting page, then the first link on the second page, and so on. Once it has indexed the first link on each page, it goes on to the second and subsequent links, and follows them. Some unsophisticated robots use this method, as it can be easier to code.

Figure 4.2. Depth-First Crawling

Depth-First Crawling

In this diagram the breadth-first strategy for a simple graph is shown by a numbered sequence of visited nodes.

Other Indexing Issues

Spidering Depth

Another issue with robots is how deep they will go into a site. In the example above of depth-first searching, the starting point is level 0, and the grays indicate an additional three levels of linking. For some sites, the most important information is near the starting point and the pages deep in the site are less relevant to the topic. For other sites, the first few levels contain mainly lists of links, and the detailed content is deeper in. In the second case, make sure the robot will index the detailed pages, as they are valuable to those who come to search the site. Some Web-wide robots will only index the top few levels of Web sites, to save space.

Server Load

Search engine robots, like Web browsers, may use multiple connections to read data from the Web server. However, this can overwhelm servers, forcing them to respond to robot requests to the detriment of human site visitors. When monitoring the server or analyzing logs, the Web administrator may notice many requests for pages from the same IP address. Many search engine robots allow administrators to throttle down the robot, so it does not request pages from the server at top speed. This is particularly important when the robot is limited to a single host and the server is busy with user interactions.

Overloaded servers, especially those with large files that do not change very often, or which pay per byte, may prefer robots that use the HTTP commands HEAD or CONDITIONAL GET. These commands allow a client, in this case the robot, to get meta information about a page, in particular, the date it was last modified. This means that the server only sends pages when they have been changed, rather than sending every page, when the robot traverses the site -- which can reduce the load on the server considerably.

Robots and private data

Password-Protected Pages

If a site has a private, password-protected section, the robot may not be able to index the data stored there. In most cases, that is the right behavior, as a search on private data could reveal damaging or embarrassing information. In other situations, you may want to allow the robot access to the private section. Some local search engine robots, such as those used for sites or intranets, can store user names and passwords, and index those private sections.

Encrypted Data

The SSL (Secure Sockets Layer) HTTPS protocol ensures that private data is encrypted and authenticated during transit through the Web. In addition to uploading credit card numbers, Web servers can use this system to encrypt banking information, patient records, student grades, and other private data that should be kept confidential. Browsers contain special code to check the authenticity of a page and decrypt the text.

To index and search the private data, a search indexing robot must either include the SSL decryption code, access the pages without encryption (which can cause security problems), use a backend database rather than a search engine, or index using the file system rather than a robot.

Note that the data is not encrypted in the search engine index, so search administrators should treat the index as sensitive and confidential, and implement stringent security measures.

Communicating With Robots

You'd think it would be hard to communicate with robots -- they are computer programs written by someone else, after all. But the designers of the robots have set up some ingenious ways for Webmasters to identify robots, track them and tell them where they are welcome or unwanted.

User Agents in Web Server Logs

Search engine robots will identify themselves when they request a page from a server. The Web HTTP communication protocol includes some information called a header, which contains data about the capabilities of the client (a Web browser or robot program), what kinds of information the request is for, and so on. One of these header fields is User-Agent, which stores the name of the client (the program requesting the information), whether it's a browser or a robot. If the server monitor or log system stores this information, the Web server administrator can see what the robot has done. For example, if the searchbox robot is crawling a web site, the text Mozilla/4.0 (compatible; focuseekbot) will be in the log. If a Web administrator has a problem with this robot, they can look up the information and contact the search administrator. The best search engine indexers allow the search administrator to set the text of this data, so it includes contact information, such as an email address or a URL to a Web page with an explanation of the robot and a feedback form. The default searchbox User Agent is "focuseekbot," but search administrators can add their contact Web site and email address editing the configuration file. Customizing this information improves communication with the administrators of the sites indexed, allowing them to contact the owner of the spider if they have any questions.

Robots.txt

Robots should also check a special file in the root of each server called robots.txt, which is, as you may guess, a plain text file (not HTML). Robots.txt implements the Robots Exclusion Protocol, which allows the Web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to cgi, private and temporary directories, for example, because they do not want pages in those areas indexed.

searchbox allows for full robots.txt support according to the robots.txt standard. Administrators can customize the user agent to meet their needs. There is only one robots.txt file per Web server, and it must be located in the root directory, for example http://www.domain.com/robots.txt. The syntax of this file is obscure to most of us: it tells robots not to look at pages that have certain paths in their URLs (that's why it's called an "exclusion" protocol). Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. A robot may reasonably access any directory path in a URL that is not explicitly disallowed in this file. This protocol is a form of communication between Web site administrators and robots; it does nothing to prevent a robot from accessing a file. There are other ways in which certain domains can be excluded from sites, if necessary. Your search indexing robot should honor these settings in most cases; contact the Webmaster for changes. The only exception is if an unresponsive Webmaster excludes all robots from the entire domain, and the person who is responsible for part of that domain would like to be indexed.

The indexing process

Indexing is the process of extracting useful data from the source, in this case a Web page, and storing it in a file for later retrieval by the search engine. Indexing starts by extracting individual words from the text of a page. Simple search indexers just read every word that's not in a tag, while others, look for words by using specific algorithms. searchbox uses a totally different approach to indexing called Rendering. Such approach extract also some layout information from the page and, depending from analysis modules installed, provides a semantic indexing too

Details on Following Links and Crawling Sites

For many sites, following links is easy and the robot spider logic is quite simple. So what is a link? A link is a legitimate URL, usually in the form <A HREF="page.html"> (for a file in the same directory on the same server) or <A HREF="http://www.domain.com/page.html"> (for files on other servers), and so on. Seems fairly straightforward, so even a dumb robot can handle these links.

However, some sites are hard to crawl. JavaScript, frames, image maps and other features added to HTML will confuse robots which can only follow links in text areas of the page. When you encounter a site which your robot can't crawl, investigation will reveal which of the many techniques the Web page creator has used to implement linking. This section will help you identify the code and offer suggestions for improvements.

Note

Visually-impaired people, and the browsers designed to help them surf the Web, have many of the same problems locating links, and need to follow links from pages rather than refer to a site map page. When sites are designed with accessibility in mind, robot spiders benefit as well. For more information, see the W3C Web Content Accessibility Guidelines.

The Site Map Solution and its Limitations

One solution for these problems is to create and maintain a site map or page list, with links to all pages on your site. Then you could just use that as the starting page, and have the robot follow all links from there. There are three main problems with this plan:

  • The larger the site, the more complex and confusing the site map page will become. This is somewhat of a problem with robot starting pages, but a more serious problem for humans trying to understand the site. Most useful maps for site visitors are carefully designed to convey information and organization using text, colors, shapes and/or spatial relationships. They do not link to every page or attempt to be complete, because that would be too confusing.

  • It is extremely difficult to keep track of all changes on a site. In most cases, it will be easier for content creators and site designers to make robot-readable links on their pages, rather than trying to update the main site map page every time they add or change a link.

  • To generate the site map automatically, a software program can gather the pages by following links or traversing folders. If it tries to follow links, it will have the same difficulties as a search robot indexer. If it uses the local file system to traverse folders, the administrator must be constantly careful about not adding files before they are public, removing obsolete files and excluding secure areas. These are difficult tasks and require a great deal of attention

If possible, request that Web page creators provide alternate navigation on the same page whenever they use one of the special navigation systems described below.

Frames and Framesets

Frames are widely used for Web site navigation, but they are not easy for robots to navigate. In most cases, the first page loaded is the frameset page, which links all the pages together. Many automated and graphical HTML authoring tools do this automatically, so page creators may not even know that this frameset page exists! Robots do know, and they need some help navigating frames.

Some robots will simply not follow links inside the <FRAMESET> tag. Web page creators should always include links within the <NOFRAMES> tag section to provide alternate access. If you are running a robot like that, and there are no helpful links in NOFRAMES, try to locate a site map or text listing of links for this site.

In addition to problems locating framed pages, both Web-wide and local search engines display those pages individually, rather than loading the frameset first. Many of these pages were never meant to be seen separately, so they don't have any navigation features. This puts the searcher into a dead end, with no way to find the home page of the site, much less related pages. Search administrators should encourage all content creators to add links to the home page and, if possible, to the parent page or frameset page.

JavaScript Pages

JavaScript is a programming language that allows page creators to manipulate elements of Web browsers. For example, JavaScript can open a new window, switch images when a mouse passes over a button to produce a "rollover" effect, validate data in forms, and so on. JavaScript can also create HTML text within a browser window, and that can cause many problems for robot spidering and indexing text.

JavaScript can write text in HTML files, using the document.write or document.writeln command. Unless a robot contains a JavaScript engine, it's hard for them to recognize the links within these scripts. A robot can scan the script looking for "A HREF", ".htm", ".html", and "http://", but few do. In addition, many scripts create URLs dynamically, putting elements together on the fly, and cannot be detected programmatically unless a client includes a full JavaScript interpreter. searchbox does not include a JavaScript interpreter, so it cannot recognize or follow these links.

JavaScript Menus and Lists

Many sites use JavaScript menus and scrolling lists as navigation systems. This allows users to select an item and go to the associated page. Unfortunately, most robots can't easily follow these paths, because the browser's scripting system puts together the "href" and the file path after a user has selected an item. While a robot could just try all items starting with "http://" or ending with "htm" or "html", most do not. In addition, some of these menus are built by on the fly by scripts that use customer IDs and other special codes, which are not available to robots.

Again, the solution is to request that the Web page creators use the <NOSCRIPT> tag, duplicate their popup menus and scrolling lists as bullet lists, and thus allow the robots to follow the links.

Redirecting Links

Redirect links provide a path from an old or inappropriate URL to a working URL. Redirects are a great thing for Web site creators, because they do not break bookmarks, old search engine indexes, or other links. When a request comes in for an old page, the redirect code tells the browser to get the new URL instead. However, they can cause problems for robots following these links.

Many Web servers recognize a standard redirect format, although they may require a special file type or setting. This includes an HTTP version line and another line with the target URL in the standard format. When the server gets a request for the original page, it locates the redirect file and works with the browser or other HTTP client to send the target page defined in the redirect. Most browsers and robots accept the target in place of the original URL, and most search indexers will store the target as the URL in the index. However, some robots may be storing the original URL as an empty file, or match the contents of the target page with the original URL. If the redirect file is ever removed, the search engine may lose track of the target file. searchbox spider correctly accepts the target in place of the original URL, storing the target as the document in the index.

Other Web Interfaces

Other Web interfaces, such as Java, and plug-ins such as Macromedia Flash, do not generate HTML Web pages. They are not part of the browser, although the page creator might not be aware of that. If you encounter one of these pages, request that the page creator provide an alternate HTML version. Some applications, such as Macromedia's AfterShock, will do this automatically (if the page creator uses the options to list all text and URLs).

The Web used to have a simple interface: just HTML and graphics, as rendered by a browser. These formats are too simple for many of the interactive and graphical features that programmers have wanted, so they have invented other ways of interacting with end-users, within the context of the Web browser.

Graphic Text

Some designers like to control the look of their text on the page and generate GIFs or JPEG files with text. Although a human can read these words, a robot cannot read or index them.

Plug-ins

Web designers use Plug-Ins to add features that are not available in the browsers. The most important of these are the Shockwave and Flash animation programs from Macromedia. These display within the browser window, so end users and even designers may not be aware that they are not HTML and are therefore invisible to robot spiders.

Acrobat

Adobe Acrobat displays documents in a printable format with all layout and design intact, either as a standalone application or as a browser plug-in. Most of these documents were generated from word processing or layout programs, and they contain both the visual interface and the document text, while others were created by scanning, so they don't have any machine-readable text. Some search engine robots can follow links to PDF files, reading and indexing any available text, while others cannot.

Java

Java applets are supported by the major browsers, and can display within the browser windows, but again, they are invisible to robots. In general, these applets should be used only for optional interactive features, such as mortgage rate calculation. Unfortunately, some sites use Java to generate site maps, which are unusable by robot indexers. Different is the case of Java applications on the server, such as those written to the Java Servlet API, can generate straightforward HTML pages, which are perfectly compatible with robots.

ActiveX

Microsoft Internet Explorer browsers have an additional interface, connecting the browser to the Windows operating system. ActiveX interacts directly with the end user and sends back information to the server. None of this data is accessible to any Web robot. Any site using this format should also have a simple HTML version as well, for cross-platform compatibility, so the robot should be able to read and follow links in that version.

Non-Text File Formats

Some Web sites serve files in formats other than HTML and text, with file name extensions such as ".doc" (Microsoft Word),".xls" (Microsoft Excel), and so on. When a site visitor clicks on a link to one of these pages, the browser downloads the file to the local hard drive, then works with the operating system to launch an application to read that file. Some search indexing robots have compatibility modules that can translate those formats to text, so they can index the files. In that case, it will follow links with the operative file name extension, receive the file information, convert the text, and store the data in the index. searchbox have a modular parsing architecture that includes by default parser for the most common document formats.

Dealing with Dynamic Data

Dynamic data is text and HTML generated by a server application, such as a database or content- management system. Sometimes, this data is inserted into a normal HTML page, and no browser, robot or other client will ever know where it came from. In other cases, the page is created in response to a user entry in a form, and may only be applicable to a specific person at a specific time, rather than a static page. For example, a sales person might look up their commission for the quarter-to-date, or a reporter might check the votes counted for a candidate. Neither of these pages is appropriate for search indexing, but dynamically-generated catalog entries and informational pages are.

Cookies

Cookies are stored user identification in a format set up by browsers. They are generally required for shopping baskets and other interactive processes. Active Server Pages have cookies built in, but many never use them. Some search engines, can recognize stored cookie information; searchbox does too.

Dynamic Web Applications and Black Holes

Another form of dynamic data is generated in real time by systems such as Web Calendars. These can automatically create new year pages and new day pages almost endlessly -- they do not expect people to follow the "Next year" link or to click on every day's link. Robots will continue to request these URLs, even five or ten years into the future, although there is no information to index. They can't tell that they are simply having a mechanical conversation with an automated system.

Some search engine robots can limit the number of hops from the Root URL. The limit reduces the likelihood that the robot will get into some other kind of endless loop, such as a long-term calendar or other dynamic data source, that may increment a date rather than add a directory layer. So the robot might follow links in the calendar for 100 months but no more.

In addition, a limit to the number of directories (delimited by slashes) avoids the situation where a file can accidentally link to itself, such as an alias, shortcut or symbolic link loop. For example, it might look like this:

www.domain.com/test/test/test/test/test/test/

The right solution to this problem is to always set the maximum deep level parameter of the crawler.

Detecting Duplicate Pages

A search indexing robot may encounter a page following links from several other pages, but it should only index it once. It must keep a list of the URLs it has encountered before, but in some cases the URL is different but the pages are the same. Search indexes are usually programmed to recognize when two pages are the same using fuzzy algorithms based on the actual content rather than the URL.

Updating Indexes

As sites change, a search engine robot must return periodically to update the index. Searchers justifiably hate to work with an index that does not match the current state of the data exactly, so the index updating system should be synchronized to the site content updates.

Update Schedules

The simplest updating scheme allows a search administrator or Webmaster to have the robot check for new data when they publish new content on the site. This simply tells the search indexing robot to start following links and reading pages again.

Locating changed pages

If the Web server reports the page modification date properly, some robots will only retrieve changed pages, and just get notification if the page was the same. However, some servers do not report the modification date correctly. Other robots just retrieve every page and re-index it, or compare it to the contents of the index and update if the page has changed.

Unfortunately the date returned by a web servers are almost always wrong so the default approach used by searchbox is to re-crawl pages every time and re-index them only if there are changed.