Chapter 8. Gathering

Table of Contents

Creating a new Source
Adding a new Seed to a Source
Web site
FTP site
Gopher site
Usenet site
Filesystem
Mailbox
WebDav share
SMB share
ODBC
Other
Configuring the Gathering Depth Limit
robots.txt checking activation
Gathering "side metadata"
Collecting metadata embedded in HTML documents
Configuring the authentication method
Basic authentication
Cookie authentication
SSL Certificate authentication
Excluding portions of HTML from gathering
Configuration of a Fetching Plugin
Activation of a Fetching Plugin for a Source
Creating a custom gathering filter
Creating a new Archive
Checking status of current gathering activity
Gathering control
Manual reprocessing of documents
Resetting the content of an Archive
Accessing to gathering logs
Exporting gathering logs
Configuring Gathering limits
Scheduling automatic Gathering
Configuring the Garbage Collector
Making a query on Archive
Showing a document from Archive
Text FFF view
Metadata FFF view
Raw FFF view
Manual add/remove documents to/from an Archive
Getting the ID of a document

This chapter describes in detail all use cases regarding searchbox gathering. The involved basic concepts are:

Creating a new Source

Select the Sources Tab of the left pane

Figure 8.1. The Sources Tab

The Sources Tab

and press the New button at the bottom of the same pane. A Source labelled as (new) is shown.

Figure 8.2. A new unconfigured Source

A new unconfigured Source

Assign a Name and a Description to the newly created Source and press the Apply button to confirm provided information.

Figure 8.3. Setting Name and Description of a new Source

Setting Name and Description of a new Source

Note

The ID is a progressive number that is assigned to any object created into searchbox. Even if you remove an object (the Source in this case) its ID will not used anymore. So don't care if the ID is greater than the number of objects in your searchbox configuration.

Now that the Source is created you need to add almost one Seed to it.

Adding a new Seed to a Source

Just after the creation of a Source the list of Seeds it is composed of is empty.

Figure 8.4. The Seed box

The Seed box

Clicking on the Add... the following window is shown.

Figure 8.5. The Seed window

The Seed window

In the left side of the window the list of available types of Seed is possible to add to the current Source.

Warning

Some problems can arise if you mix different types of Seed in the same Source configuration (the Control Panel makes no consistency check) that require authentication. searchbox let you specify only one set of credential for each Source thus you cannot put different authentication methods and/or set of credential into the same Source configuration. In such cases a single-seed Source must be used.

Here it is a detailed description of the configuration issues of any available type of Seed.

Web site

For a Seed related to a Web Site accessible through HTTP protocol the only information searchbox needs are the URL of the page where the gathering process must begin from and if the site needs a secure access.

Figure 8.6. Web site Seed

Web site Seed

Checking the Secure checkbox the URL prefix will change from http:// to https://

FTP site

For FTP Seed the only needed parameter is its URL.

Figure 8.7. FTP site Seed

FTP site Seed

Gopher site

For Gopher sites just the URL is needed

Figure 8.8. Gopher site Seed

Gopher site Seed

For more information about the Gopher protocol you can see: http://en.wikipedia.org/wiki/Gopher_protocol

Usenet site

In this case both server name and the name of the newsgroup we want to gather from are needed.

Figure 8.9. Usenet site Seed

Usenet site Seed

Filesystem

Documents must reside in the local filesystem or in a filesystem remotely mounted on the server where searchbox is running. The complete path must be specified.

Note

Remote folders locally mounted are guaranteed to work only for Linux OSX and Windows 2000. For Windows XP the SMB protocol must be used.

Figure 8.10. Filesystem Seed

Filesystem Seed

Mailbox

The complete address of the incoming mail server and its type. POP3/S and IMAP/S are the SSL version of POP3 e IMAP servers.

Figure 8.11. Mailbox Seed

Mailbox Seed

For all types of servers the messages are not removed from the original location and are not marked as read.

WebDav share

The address of remote server complete of the full path is required. Check the Secure checkbox if the server requires a secure access.

Figure 8.12. WebDav share Seed

WebDav share Seed

SMB share

It needs the server which is publishing a folder using the SAMBA protocol and the full path where the gathering agent must start its job.

Figure 8.13. SMB share Seed

SMB share Seed

ODBC

An ODBC seed is made of three parts:

  • The ODBC connection string. It is usually a System DSN[10]. Note that your database might impose access restrictions that will stop searchbox to access it even if you can access the database itself from your desktop. E.g. the database might be configured to refuse queries coming from the searchbox server computer.

  • The query. This is basically any sql query with the additional requirement that exactly one of the special strings --!!!PKW!!!-- or --!!!PKA!!!-- must be present. These special strings are described below.

  • The keys. This is a subset of the names of the columns of the result set returned by query whose value uniquely identifies a row in the result set itself. For example if query involves a single table you can use the table primary key. You must specify the order of the fields[11] and whether each field contains a string or a numeric value.

Figure 8.14. ODBC Seed

ODBC Seed

As stated above query must contain exactly one of --!!!PKW!!!-- or --!!!PKA!!!--. searchbox will expand these with value taken from the keys fields to access specific rows in the database. The two strings are expanded nearly in the same way but --!!!PKW!!!-- begins with an sql where keyword, while --!!!PKA!!!-- begins with and. Thus you must use the latter when you have placed a where clause in query (typically you append --!!!PKA!!!-- after your clause) while the former is useful when you don't have any explicit where clause in your query. For example:

select * from tableA --!!!PKW!!!--

but

select * from tableB, tableC
  where tableB.id = tableC.id
  --!!!PKA!!!--

If you don't specify the --!!!PKW!!!-- nor the --!!!PKA!!!-- placeholder then the controlpanel will complain.

Figure 8.15. ODBC Seed: no placeholder warning

ODBC Seed: no placeholder warning

In order to successfully crawl an odbc source adding a seed is not enough: you also have to specify a set of rules to let searchbox turn the odbc result set into documents. These rules are specified in an odbc plugin configuration. When you add an odbc seed to a source you must then enable the suitable plugin configuration on the source.

Figure 8.16. ODBC Seed: enabling the odbc plugin

ODBC Seed: enabling the odbc plugin

If you don't searchbox complains.

Figure 8.17. ODBC Seed: no plugin warning

ODBC Seed: no plugin warning

Other

This type of seed is reserved for all protocols that are not natively available in searchbox nor in one of its bundled plugins.

Figure 8.18. "other" Seed

"other" Seed

Please see the documentation of the fetching plugin you have to use to properly configure this seed.

All Seeds specified will be listed in the URL box of the Info section of the Source configuration tab.

Figure 8.19. The list of active Seeds

The list of active Seeds

Pressing the Apply... button or any other widget of the Control Panel interface the following warning message will pop up

Figure 8.20. No configured filters warning

No configured filters warning

In order to let any Source to gather contents from specified Seeds almost one inclusion filter for each of them must be specified. If you do not specify any the Control Panel will automatically add the default one. The default inclusion filter will include all subsections starting from the specified Seed entry.

Experienced users who need to perform a selective gathering will have to edit manually filters.

Figure 8.21. Source configured with default inclusion filters

Source configured with default inclusion filters

Configuring the Gathering Depth Limit

Any gathering activity starting from a specific seed is characterized by a discrete distance from the starting point. Such distance is the number of links that the gatherer (spider for the web) had followed to reach a document. Such distance is called Depth Limit.

Figure 8.22. Depth Limit

Depth Limit

The above picture shows a graph with nodes labelled with their Depth Level calculated from the Seed (labelled as 0).

Even if the Depth Limit concept seems correct only for tree data structures (because they have a "root" node) it is used for graphs too as in the case of the Web.

In the following table describes how the Depth Limit concept is used by each type of Seed available in searchbox:

Web siteNumber of hypelinks followed by the gatherer and calculated from the Seed. Hyperlinks are followed in accordance to inclusion/exclusion filters configured for that Seed.
FTP siteNumber of directory levels calculated from the path specified into the URL. All documents contained in each directory will be gathered in accordance to inclusion/exclusion filters configured for that Seed.
Gopher siteSame as FTP
Usenet newsMessage index at depth = 0. Messages content at depth = 1 (all messages are at the same level. Threads are not considered)
FilesystemSame as FTP
Mailbox (POP)Message index at depth = 0. Messages content at depth = 1 (pop protocol has no folders so only the Inbox will be considered)
Mailbox (IMAP)Message index at depth = 0. Messages content at depth = 1 (at this moment the IMAP gathering does not support folders so only messages of the Inbox will be gathered)
WebDav shareSame as FTP
SMB shareSame as FTP
ODBC databaseContent at depth = 1.

Note

In all cases if a gathered document contains hyperlinks they will be followed in accordance to inclusion/exlusion filters configured for that Seed and the Depth Limit parameter of the Source.

In the Option Tab of the Source configuration section is possible to specify the Depth Limit parameter.

Figure 8.23. Depth Limit parameter

Depth Limit parameter

As default this option is unchecked that means "no Depth Limit".

Note

In order to avoid gathering a huge amount of useless documents especially from Web sources you should choose this parameter carefully.

robots.txt checking activation

This option is checked as default and force searchbox to respect the robots.txt protocol.

Figure 8.24. robots.txt parameter

robots.txt parameter

If you want to override the default of this option please be sure that the owner of Source is informed. Ignoring the robots.txt directives can be a sufficient reason to be banned from accessing to a Web source.

Gathering "side metadata"

Figure 8.25. Has side-by-side metadata option

Has side-by-side metadata option

In some cases it is very useful to gather explicit metadata referred to a document directly from the Source even if they cannot be embedded into document itself. For this cases searchbox can manage a proprietary data format that can be adopted in the case we can control how the source publish such information.

The basic idea is to associate to each document of a Source a set of metadata contained in another XML formatted document (resource document) in the same location of the original one and with the same name but with a special extension.

Given a document named as mydocument.pdf the associated resource document will be mydocument.pdf.sbm

If this option is checked, for each file of the source, the gathering agent will check if the corresponding resource file is present and will merge the metadata contained in it with the original document.

The resource document has the following structure:

<?xml version="1.0"?>
<metainfo>
<meta type="documentwide" sliceid=”slicename” key="name1">value1</meta>
<meta type="documentwide" sliceid=”slicename” key="name2">value2</meta>
...
</metainfo>

where:

  • type can be only “documentwide” at this moment. Other values will be possible in future release of searchbox

  • sliceid is the name of the index slice where we want to store the metadata

  • key is the metadata name (i.e. “year”)

  • valueX is the metadata value (i.e. 2003)

Moreover the following optional attributes can be used in a meta tag:

Tokenized and normalized attributes

  • tokenized Specifies whether the metadata value should be split into words as normal text is. Allowed values are 0 (don't split, the default) or 1 (split).

  • normalized Specifies whether the metadata value should be case and utf8 normalized ormal text is. Allowed values are 0 (don't touch it, the default) or 1 (normalize it).

Collecting metadata embedded in HTML documents

searchbox lets you put in your HTML documents special tags that it can use to extract metadata to your documents. This feature is disabled by default. To enable it you must set the USE_SEARCHBOX_METAS parameter in your searchbox configuration file as detailed in the section called “Handling of <meta name="searchbox-xxxx"> tags in HTML documents”.

You specify metadata by placing suitably formatted <META> tags in your HTML document <HEAD> section.

The full format is:

<META name="searchbox-NT-SliceName-MetaKey" content="MetaValue" />

where:

NTSpecifies optional processing for the metadata value. Can assume the following values: NT itself (the value will be normalized and tokenized), N (the value will be normalized), T (the value will be tokenized) or empty (the value will be used as is). See Tokenized and normalized attributes for details on the meaning of these operations.
SliceNameThe name of the slice the metadata will be put in. See Table 8.1, “Sliceids” for possible values.
MetaKeyThe metadata key.
MetaValueThe metadata value.

For example the following HTML excerpt:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
        	<meta name="searchbox-NT-centralNorm-metakey0" content="metavalue0" />
	        <meta name="searchbox--centralNorm-metakey1" content="metavalue1" />
<!-- more stuff follows ... -->

will add two metadata: a normalized, tokenized metadata with key metakey0 and value metavalue0 and another, literal one, with key metakey1 and value metavalue1. Both metadata will be placed in the centralNorm slice.

Configuring the authentication method

Some web sources grant access only to authenticated users. This happens for both sources that requires username and password and sources that use cookies. The searchbox gathering agent is able to simulate the way a real user explore a web site using a standard web browser so usually such Sources can be easily gathered once configured.

This option let you chose between three different types of authentication methods for a Source.

Figure 8.26. Authentication methods

Authentication methods

Basic authentication

Mainly used for Intranet site crawling. It does not need any further configuration apart the username/password information provided into the configuration of the Archive associated to the Source.

Cookie authentication

With this modality searchbox can acquire authentication cookies from a server simulating what a user will do with his/her browser (filling a form and clicking the "submit" button). All what searchbox needs to log into a web site is to perform a POST action in the right forms so that a configuration with all variable names and its values is needed.

Let's imagine that the authentication is performed through the page http://somesite.com/login.php: the login name would be a variable called username and the password a variable called password. In this case searchbox will provide a POST action of these variables at another URL (i.e. http://somesite.com/authcheck.php) and will store all cookies that will be provided. An problem that can occur is that page http://somesite.com/authcheck.php is a redirect to another page. In this case we must refer the POST action to the destination page.

Some sites generates cookies in the login page and in other pages between the home page and the login page itself. In this case searchbox must be configured to manage this situation. Such pages must be specified into Cookie Pre URLs fields.

Clicking on the Details... button the following window appears

Figure 8.27. Cookie authentication details

Cookie authentication details
  • The URL field must be filled with the URL of the page where a normal user authenticate itself.

  • The Action parameter specifies if data are passed with GET or POST command and the maximum age the cookie can have before a new login is required.

  • Into Username and Password fields the name of the parameter for username and password must be specified (i.e. USERID and PASSWORD). Values of these parameters have to be specified into the configuration of the Archive associated to the Source.

  • In the Other parameters section more parameter/value couple can be specified if needed by the above Action

  • In the Cookie Pre URLs section the list of pages between the home page and the login page that generate cookies.

SSL Certificate authentication

Same as Basic Authentication but with a digital certificate with or without username/password additional information.

Excluding portions of HTML from gathering

This is a specific feature of searchbox for HTML documents. into the Gathering Tab of each Source is possible to specify two regular expressions, one for the "start ignoring stuff" marker and the other for the "stop ignoring stuff" marker.

Figure 8.28. "Exclude text between" option

"Exclude text between" option

All the text between the specified markers will not be passed to the rendering module.

Configuration of a Fetching Plugin

Select the Sources Tab of the left panel.

Figure 8.29. The Plugin status panel

The Plugin status panel

Select the Fetching item and click on New button.

To create a new plugin there are two different way: New plugin or Inherit from existing plugin as shown in the following window:

Figure 8.30. The new plugin window

The new plugin window

Choosing the New plugin option select the one of available types. In the following example the odbcplugin is selected.

Figure 8.31. A new odbc type plugin

A new odbc type plugin

The list of available type of plugin may vary depending from your searchbox Engine installation.

After clicking on the Ok button the Plugin configuration panel appear.

Figure 8.32. The plugin configuration panel

The plugin configuration panel

In order fully configure the plugin the following action are required:

  • Fill the Name: field.

  • Fill the Description: field

  • Edit all required parameters listed in the Parameters box. The number and type of parameters depends from the type of plugin you choose. Please see the pluging documentation for a detailed description of all parameters.

As soon the configuration is completed the new plugin is listed in the left panel under its category (Under fetching plugings in this case).

Figure 8.33. A freshly configured plugin

A freshly configured plugin

Note

During this configuration process searchbox cannot make any consistency check because DLLs are not aware of their final role in the plugins chain. A plugin can discover its role only when its configuration parameter are set.

Activation of a Fetching Plugin for a Source

A Fetching Plugin is used to add to a specific Source a new fetching protocol not natively managed by searchbox. In order to configure a Source to use such type of plugin it must be previously installed and configured from the Plugins Tab of the left panel of Control Panel (see Configuration of a Fetching Plugin).

Creating a custom gathering filter

The crawl filters are used to define the gathering agents behaviour using a list of constraints on the syntax about the URLs that must be included or excluded by the gathering action. For this purpose two types of filters exists: inclusion and exclusion filters.

In the Crawling Tab of a Source a default inclusion filter can be added checking a specific checkbox. The purpose of such default filter is to limit the action of the gathering agent all pages contained in the original domain of the seed page. For instance with the seed:

www.repubblica.it

the automatically generated inclusion filter is:

^http://www\.repubblica\.it/.*$

where:

^begin of string
$end of string
\.the "." character
.*an unspecified number of characters

Such filter is able to gather all reachable pages from the home page of Italian newspaper "La Repubblica".

Other examples:

^http://www\.repubblica\.it/indici/cronaca/cronaca\.htm$

The crawler starts from a subsection of "La repubblica" web site gathering pages with .htm extension only.

^http://www\.repubblica\.it/200[0-9]/[a-zA-Z]/sezioni/cronaca/.*$

The crawler gathers all the pages contained in a subsection of "La Repubblica" web site with a parametric path. In the specific case there is a division by years so the regular expression 200[0-9] has been used to consider all years between 2000 al 2009. The same thing for the other directory level that is described by one alphabetic char.

^http://www\.agronotizie\.com/(/)?sezioni/articolo\.cfm\?(C|c)odic
e=[0-9]+&codcanale=[0-9]+&codargomento=.*$

This is a more complex case that is used to gather dynamic pages. In the example the regular expression (/)? take into account that a single or a double slash can be present while the (C|c)odice expression that the parameter can be with capital letter or not.

An detailed discussion about the use of regular expressions is beyond thee scope of this manual. A good reference for this topic can be found in every manual of a programming language with regular expression management like Perl, Python, Grep, Lex, etc. Alternatively you can read the Wikipedia regular expressions page: http://en.wikipedia.org/wiki/Regular_expressions. searchbox implements the POSIX modern (extended) regular expressions variant described on that page.

Creating a new Archive

Select the Sources Tab of the left pane

Figure 8.34. The Archive Tab

The Archive Tab

and press the New button at the bottom of the same pane. A pop-up window with the list of active Sources is shown.

Figure 8.35. List of active sources

List of active sources

To create an Archive you need to choose the Source to gather from. If the Source has been already configured is shown in this list and can be chosen.

Pressing the Ok button a new source named by default as the corresponding Source is shown.

Figure 8.36. A new unconfigured Archive

A new unconfigured Archive

To finish the basic configuration of an Archive you have to choose the depth and the type of the document cache.

In the Cache depth section by default only the last gathered document is stored. Once a new one is gathered the older is thrown away. Selecting the All document versions option different versions of the same document are historicized and are thrown away only in accordance with the garbage collection configuration.

In the Cache type section there are three possible choices:

  • Full. It is the default setting. The entire original document is stored into the Archive

  • Context only. Only the XML converted document (FFF) is stored

  • None. Only minimal required information about the document are stored

Checking status of current gathering activity

From the Info Tab of the Archive configuration panel it is possible to obtain some synthetic information about the current gathering process:

Figure 8.37. Gathering infos

Gathering infos
  • Status: can be Idle or Running

  • Archive size: it is the number of documents into the archive and their total disk occupation in KBytes.

  • Last gathering statistics: it is the number of documents gathered until now if the Status is Running or the total number of documents gathered by the last gathering activity if the Status is Idle and the number of errors occurred during the last or current gathering activity

  • Last gathering start: it is date and time (GMT) when the last gathering activity started

  • Last gathering end: it is date and time (GMT) when the last gathering activity stopped

  • Next gathering start: it is date and time (GMT) when the next gathering activity will start

Note

In the case you need to follow online the crawling activity a manual refresh of the Control Panel infos can be forced using the F5 key or from the View menu. Normally the Control Panel is synchronized with the engine every 5 minutes or every time the Apply button is pressed.

Gathering control

For testing purposes or if you need to run gathering only once the Start and Stop buttons are available into the Info Tab of the Archive configuration panel.

Figure 8.38. Start/Stop Gathering buttons

Start/Stop Gathering buttons

If you stop a scheduled gathering activity it will start again according with its scheduling.

Manual reprocessing of documents

When the plugin chain associated to a Source changes if you need to update with the new configuration the documents already stored into the Archive you need to perform a reprocess operation. Such operation is equivalent to a Gathering apart that documents are not fetched from the original Source but from the local Archive.

Figure 8.39. The reprocessing button

The reprocessing button

Clicking the Reprocessing documents button an immediate gathering from the local archive starts.

Warning

The reprocessing procedure is available only if the cache type for the Archive is Full.

Resetting the content of an Archive

Toi completely remove the content if an archive without deleting itself use the Reset button in the Archive Maintencance section of Info Tab of of the Archive configuration panel.

Figure 8.40. The Reset button

The Reset button

After clicking the above button a confirmation alert will be displayed before resetting the Archive.

Accessing to gathering logs

In the Info Tab of the Archive configuration panel the View gathering logs... button

Figure 8.41. The View gathering logs button

The View gathering logs button

shows a pop-up window that reports a detailed live log of the gathering process.

Figure 8.42. The Gathering Logs windows

The Gathering Logs windows

Each row of the log is structured as:

Time. The date and time of the gathering action

URL. The URL of the gathered document

Status. One of the following:

Status codeDescriptionPossible actions
OkThe document has been gathered and redendered correctly 
RedirectThis URL is a redirect to another URL (HTTP only) 
IndexThis URL is a directory that only contains pointers to other documents (common in gathering of file systems). This index document is not rendered and stored. 
Not foundThis URL has no document associated 
Rendering errorThe document has been correctly gathered but cannot be rendered because errors (see Description field for details)Check the plugin chain, If you developed a custom plugin it is possible that it does not work properly. If any active plugin fails the entire processing of the documents is aborted. Another possibility is that the document you are trying to fetch is in a not parsable format or the mimetype returned by the server is wrong.
Bad redirectThis URL is a redirect to another URL that is invalid or can generate a loop 
BlockedThe gathering of this URL is blocked by filters or robots.txt of the remote site (if the control is enabled in searchbox)If you think that this URL should be fetched probably there is an error in the gathering filters configuration for this Source.
Network errorThe URL gathering failed due to a generic network errorCheck your network connection and if it is working try to access to the remote server with a different software than searchbox (i.e. a browser if the Source is a web site)
Engine errorThis is a generic searchbox error due to a component that failed to accomplish its job in some way during the document processing chain. This error does not affect the overall functionality of the platform apart that the current document is not correctly processed. 
UnchangedThe documents already exists in searchbox and it is not changed from last time was gathered 
Document limit reachedsearchbox has reached the maximum number of documents it can manage (see license limits).Free some space in your searchbox Engine installation deleting some Archives or setting a more strict garbage collection parameter. Also you can increase limits of your searchbox Engine license.
Authentication requiredThe gathering agent need to be authenticated in order to fetch this URL 
Internal errorAn unknown unrecoverable internal error has occurred.This kind of error is due to a software bug or a unknown situation. Please produce a bug report and send it to support@focuseek.com.
Remote server errorThe remote server returned an error code 
Plugin errorOne plugin (gathering, parsing, rendering) returned an error code. The processing of the current document is aborted. 
Cannot fetchThe fetching protocol required by the Source is not supported by the current searchbox installation. 
Processing errorOne component of the document processing pipeline reported an error. 
Timeout errorOne component of the document processing pipeline timed out 

Description. More details about the status generated by specific errors.

Exporting gathering logs

Stopping the on-line log download (the default when the window is opened) the Export... button became active.

Figure 8.43. Gathering Logs export

Gathering Logs export

A Choose Export File window let you choose location where to save the full log in plain text format.

Configuring Gathering limits

In almost every situation is strongly recommended you accurately configure gathering parameters from the Options Tab of the Archive configuration panel.

Figure 8.44. Gathering limits configuration

Gathering limits configuration

The first parameter is called HTTP User Agent and it is the string that the web crawler presents to the HTTP server to describe itself. This string should be descriptive enough to let web master of the remote site to recognize the owner of the spider. Someone prefers to use the same User Agent of a popular browser like Firefox or Internet Explorer in order to simulate that the spidering has been done by a real human user.

The second parameter is called Throttling and represents the minimum number of seconds that must pass between two fetching action performed for the current Archive. Good web citizens behave nicely with other sites and limit the load they put on remote servers and searchbox self-limits its page download speed to a user configured setting with a default value of 20 seconds. In the case you are gathering from a file system or from a partner site you can decrease this limit up to 0 seconds.

The three other parameters (by default they are ignored) limit:

  • the maximum number of documents that an Archive can gather during a single gathering cycle. In the above example the gathering cycle ends after 100 documents or after one hour.

  • the maximum duration of a spidering cycle. Sometime it happens that due to the slowness of the network the actual spidering time is longer that the scheduled one, in these cases such upper bound should be configured.

  • the mazimum size of the fetching queue. When searchbox fetches a document it also expand all its outlinks and put their URLs into a queue. Actual statistics about Web say that and average http document has 5 outlinks so that the fetching queue can became huge in a very few spidering steps (i.e. 3 fetched documents produce new 125 entries in the fetching queue). This parameter say to searchbox to stop fetching new documents if the fetching queue is out of bounds.

Scheduling automatic Gathering

The Gathering process is "by definition" a cycling activity. The gathering agent periodically visits the Source and put into the Archive only new documents or new versions of existing ones.

Some different options are possible from the Option Tab of the Archive configuration panel:

Figure 8.45. Gathering periodicity: hourly

Gathering periodicity: hourly

The hours option configures a gathering cycle of n hours (8 hours in the above example)

Figure 8.46. Gathering periodicity: daily

Gathering periodicity: daily

The days option configures a gathering cycle of n days starting from a specific hour. In the above example the gathering starts at midnight of every day.

Figure 8.47. Gathering periodicity: weekly

Gathering periodicity: weekly

The week option configures a weekly gathering cycle starting from at specific hour of a specific day of the week. In the above example the gathering starts every week on Monday at midnight.

Figure 8.48. Gathering periodicity: monthly

Gathering periodicity: monthly

The month option configures a monthly gathering cycle starting from a specific hour of a specific day of the month.

Configuring the Garbage Collector

From the Option Tab of the Archive configuration panel is possible to configure how searchbox throw out documents from Archives.

The mechanism is called Garbage Collecting and it is an automated process that periodically remove documents that satisfy some limits from the Archive.

Figure 8.49. The Garbage Collector configuration

The Garbage Collector configuration

Three parameters can be configured:

  • the fresheness of archived documents. Documents older than the specified amount of time will be deleted by the garbage collection agent.

  • the frequency of garbage collectioning action. The garbage collection process has impact on the performance of the system. This parameter must be accurately configured taking into account the gathering flux. If you don't want that archive size is indefinitely increased.

  • the number of documents deleted at every garbage collection cycle. This parameter must be set accurately in order to avoid the used disk space used by searchbox will increase indefinitely. It must be related to the number of active sources and their crawling speed.

Making a query on Archive

The Control Panel lets you browsing the content of an Archive using many parameters. In the following picture the result of a simple query is shown.

Figure 8.50. Browsing Archive content

Browsing Archive content

The text box at the top of the windows accepts the user query. The syntax is the standard one of searchbox (See “The query syntax” section).

The next row of widget specify how query results must be shown:

Results: are the number of results shown after pressing Search button. To show more query results click more times on the Search button or increase this value.

Sort: the ranking algorithm used to sort the query results. Possible choices are:

  • Standard. The default ranking mechanism. Let searchbox to decide the best ranking algorithm.

  • Relevance. At this moment exactly the same as Standard. Subjected to change in a future version

  • Score. It is based only on the intrinsic score assigned to the document.

  • Newer. Results are ordered by timestamp. Newer ones are at the top.

  • Older. Results are ordered by timestamp. Older ones are at the top.

Set date... Let you set a time interval where documents of query results must be contained. The time is referred to the date of gathering.

Weights... Modify the weight of each slice of the index in order to better control the ranking mechanism (See section about rendering) In the following table all available slices are described with their default value.

Table 8.1. Sliceids

NumberNameWeightDescription
1author1Author
2keyword1Keywords associated to the document
3abstract1Abstract
4invisible1Almost invisible text
5marginalNorm1Plain text with marginal relevance
6marginalEmph2Emphasized text with marginal relevance
7marginalLink1Links from marginal relevance text
8---
9marginalHeader3Header of a marginal relevance text
10centralNorm2Plain text with central relevance
11centralEmph4Emphasized text with central relevance
12centralLink2Links from central relevance text
13---
14centralHeader6Header of central relevance text
15title12Title
16 - 31custom0 - customF12custom slice

The rank of a document is calculated following the rule:

The Rank of a document is the absolute weight of the document multiplied for weights of those slices that contains at least a term specified in the query string.

Figure 8.51. Query Weights

Query Weights

Values can be changed with the following conventions:

ValueEffect
-1Set to the default value
0Does not consider this slice in the rank calculation
nSet to n the weight of the slice

Selecting one of the query results in the Context box a short extract of the document related to the query terms is shown.

Figure 8.52. The Context Box

The Context Box

Showing a document from Archive

From the Browsing Tab of Archive configuration is possible to show any document listed as result of a query in three possible ways:

  • Cached. The document as saved into the cache

  • Live. The document shown live from the original source

  • FFF. The internal searchbox XML format of the documents.

Figure 8.53. Show document buttons

Show document buttons

In order to access to any cached document you need to give to the viewer the authentication for searchbox (you username/password). Usually operating systems let you to store usernama/password in a keychain so that typing such information are needed the first time only.

Figure 8.54. Authentication for searchbox cache

Authentication for searchbox cache

The FFF is the XML format that searchbox internally use to store all documents extracted content and metadata. Viewing the FFF is useful for debugging purposes.

Please not the address of the page shown in the web browser (in the case of an html document)

Figure 8.55. Cached document address

Cached document address

It is an URL that points to the platform where searchbox Engine is installed (localhost in this case).

Three different views of the FFF are available.

Text FFF view

Figure 8.56. Text view of FFF

Text view of FFF

The full-text extracted from the FFF of the selected document. This text is distributed on different slices by the rendering process so that selecting and deselecting the appropriate checkboxes on the left side of the windows is possible to see what is contained in each slice. By default all slices are shown.

Metadata FFF view

Figure 8.57. Metadata view of FFF

Metadata view of FFF

Both standard and template metadata containted into the FFF representation of the selected document.

Raw FFF view

Figure 8.58. Raw view of FFF

Raw view of FFF

The raw XML format of FFF internal representation of the selected document.

Manual add/remove documents to/from an Archive

In the Browsing Tab of the Archive the Add... and Remove buttons are used to manually add and remove documents from the current Archive.

Figure 8.59. Add and Remove buttons

Add and Remove buttons

The following windows appears clicking on the Add... button

Figure 8.60. Manual add parameters

Manual add parameters

File: the complete path of the document that must be added to the Archive. It is automatically filled browsing the local file system by the ... button.

URL: the URL assigned to the document. If it is not specified the path of the file in the local file system is taken.

MIME: the MIMETYPE assigned to the document. It can be one of the following types:

  • text/html

  • text/plain

  • text/rtf

  • application/pdf

  • application/msword

  • message/rfc822

  • application/vnd.focuseek-fff

  • custom

Selecting the custom option the textbox under MIME filed will be ready to accept the custom string.

Getting the ID of a document

The Copy ID button in the Browsing Tab of Archive configuration panel let you copy into clipboard the ID of the document currently selected in the list of results.

Figure 8.61. Copy ID button

Copy ID button


[10] If you use a User DSN you must define the DSN as the user searchbox runs as. Refer to the section called “searchbox process identity” for details.

[11] The order of the fields influences the order searchbox will use to fetch the rows from the db. Also note that URLs differing only in the order of the key are considered different urls anyway, i.e. they are crawled two times.