Table of Contents
This chapter describes in detail all use cases regarding searchbox gathering. The involved basic concepts are:
Seeds
Sources
Archives
Fetching plugins
Select the Sources Tab of the left pane
and press the button at the bottom of the same pane. A Source labelled as (new) is shown.
Assign a Name and a Description to the newly created Source and press the button to confirm provided information.
The ID is a progressive number that is assigned to any object created into searchbox. Even if you remove an object (the Source in this case) its ID will not used anymore. So don't care if the ID is greater than the number of objects in your searchbox configuration.
Now that the Source is created you need to add almost one Seed to it.
Just after the creation of a Source the list of Seeds it is composed of is empty.
Clicking on the the following window is shown.
In the left side of the window the list of available types of Seed is possible to add to the current Source.
Some problems can arise if you mix different types of Seed in the same Source configuration (the Control Panel makes no consistency check) that require authentication. searchbox let you specify only one set of credential for each Source thus you cannot put different authentication methods and/or set of credential into the same Source configuration. In such cases a single-seed Source must be used.
Here it is a detailed description of the configuration issues of any available type of Seed.
For a Seed related to a Web Site accessible through HTTP protocol the only information searchbox needs are the URL of the page where the gathering process must begin from and if the site needs a secure access.
Checking the checkbox the URL prefix will change from http:// to https://
For Gopher sites just the URL is needed
For more information about the Gopher protocol you can see: http://en.wikipedia.org/wiki/Gopher_protocol
In this case both server name and the name of the newsgroup we want to gather from are needed.
Documents must reside in the local filesystem or in a filesystem remotely mounted on the server where searchbox is running. The complete path must be specified.
Remote folders locally mounted are guaranteed to work only for Linux OSX and Windows 2000. For Windows XP the SMB protocol must be used.
The complete address of the incoming mail server and its type. and are the SSL version of e servers.
For all types of servers the messages are not removed from the original location and are not marked as read.
The address of remote server complete of the full path is required. Check the Secure checkbox if the server requires a secure access.
It needs the server which is publishing a folder using the SAMBA protocol and the full path where the gathering agent must start its job.
An ODBC seed is made of three parts:
The ODBC connection string. It is
usually a System DSN[10]. Note that your database might impose access
restrictions that will stop searchbox to access it even if you can
access the database itself from your desktop. E.g. the database
might be configured to refuse queries coming from the searchbox
server computer.
The query. This is basically any
sql query with the additional requirement that exactly one of the
special strings --!!!PKW!!!-- or
--!!!PKA!!!-- must be present. These special
strings are described below.
The keys. This is a subset of the
names of the columns of the result set returned by
query whose value uniquely identifies a
row in the result set itself. For example if
query involves a single table you can use
the table primary key. You must specify the order of the
fields[11] and whether each field contains a string or a numeric
value.
As stated above query must contain
exactly one of --!!!PKW!!!-- or
--!!!PKA!!!--. searchbox will expand these with
value taken from the keys fields to access
specific rows in the database. The two strings are expanded nearly in
the same way but --!!!PKW!!!-- begins with an sql
where keyword, while
--!!!PKA!!!-- begins with
and. Thus you must use the latter when
you have placed a where clause in query
(typically you append --!!!PKA!!!-- after your
clause) while the former is useful when you don't have any explicit
where clause in your query. For example:
select * from tableA --!!!PKW!!!--
but
select * from tableB, tableC where tableB.id = tableC.id --!!!PKA!!!--
If you don't specify the --!!!PKW!!!-- nor
the --!!!PKA!!!-- placeholder then the
controlpanel will complain.
In order to successfully crawl an odbc source adding a seed is not enough: you also have to specify a set of rules to let searchbox turn the odbc result set into documents. These rules are specified in an odbc plugin configuration. When you add an odbc seed to a source you must then enable the suitable plugin configuration on the source.
If you don't searchbox complains.
This type of seed is reserved for all protocols that are not natively available in searchbox nor in one of its bundled plugins.
Please see the documentation of the fetching plugin you have to use to properly configure this seed.
All Seeds specified will be listed in the URL box of the Info section of the Source configuration tab.
Pressing the button or any other widget of the Control Panel interface the following warning message will pop up
In order to let any Source to gather contents from specified Seeds almost one inclusion filter for each of them must be specified. If you do not specify any the Control Panel will automatically add the default one. The default inclusion filter will include all subsections starting from the specified Seed entry.
Experienced users who need to perform a selective gathering will have to edit manually filters.
Any gathering activity starting from a specific seed is characterized by a discrete distance from the starting point. Such distance is the number of links that the gatherer (spider for the web) had followed to reach a document. Such distance is called Depth Limit.
The above picture shows a graph with nodes labelled with their Depth Level calculated from the Seed (labelled as 0).
Even if the Depth Limit concept seems correct only for tree data structures (because they have a "root" node) it is used for graphs too as in the case of the Web.
In the following table describes how the Depth Limit concept is used by each type of Seed available in searchbox:
| Web site | Number of hypelinks followed by the gatherer and calculated from the Seed. Hyperlinks are followed in accordance to inclusion/exclusion filters configured for that Seed. |
| FTP site | Number of directory levels calculated from the path specified into the URL. All documents contained in each directory will be gathered in accordance to inclusion/exclusion filters configured for that Seed. |
| Gopher site | Same as FTP |
| Usenet news | Message index at depth = 0. Messages content at depth = 1 (all messages are at the same level. Threads are not considered) |
| Filesystem | Same as FTP |
| Mailbox (POP) | Message index at depth = 0. Messages content at depth = 1 (pop protocol has no folders so only the Inbox will be considered) |
| Mailbox (IMAP) | Message index at depth = 0. Messages content at depth = 1 (at this moment the IMAP gathering does not support folders so only messages of the Inbox will be gathered) |
| WebDav share | Same as FTP |
| SMB share | Same as FTP |
| ODBC database | Content at depth = 1. |
In all cases if a gathered document contains hyperlinks they will be followed in accordance to inclusion/exlusion filters configured for that Seed and the Depth Limit parameter of the Source.
In the Option Tab of the Source configuration section is possible to specify the Depth Limit parameter.
As default this option is unchecked that means "no Depth Limit".
In order to avoid gathering a huge amount of useless documents especially from Web sources you should choose this parameter carefully.
This option is checked as default and force searchbox to respect the robots.txt protocol.
If you want to override the default of this option please be sure that the owner of Source is informed. Ignoring the robots.txt directives can be a sufficient reason to be banned from accessing to a Web source.
In some cases it is very useful to gather explicit metadata referred to a document directly from the Source even if they cannot be embedded into document itself. For this cases searchbox can manage a proprietary data format that can be adopted in the case we can control how the source publish such information.
The basic idea is to associate to each document of a Source a set of metadata contained in another XML formatted document (resource document) in the same location of the original one and with the same name but with a special extension.
Given a document named as mydocument.pdf the
associated resource document will be
mydocument.pdf.sbm
If this option is checked, for each file of the source, the gathering agent will check if the corresponding resource file is present and will merge the metadata contained in it with the original document.
The resource document has the following structure:
<?xml version="1.0"?> <metainfo> <meta type="documentwide" sliceid=”slicename” key="name1">value1</meta> <meta type="documentwide" sliceid=”slicename” key="name2">value2</meta> ... </metainfo>
where:
type can be only “documentwide” at this
moment. Other values will be possible in future release of
searchbox
sliceid is the name of the index slice
where we want to store the metadata
key is the metadata name (i.e.
“year”)
valueX is the metadata value (i.e.
2003)
Moreover the following optional attributes can be used in a
meta tag:
Tokenized and normalized attributes
tokenized Specifies whether the metadata
value should be split into words as normal text is. Allowed values are
0 (don't split, the default) or
1 (split).
normalized Specifies whether the metadata
value should be case and utf8 normalized ormal text is. Allowed values
are 0 (don't touch it, the default) or
1 (normalize it).
searchbox lets you put in your HTML documents special tags that it can
use to extract metadata to your documents. This feature is disabled by
default. To enable it you must set the
USE_SEARCHBOX_METAS parameter in your searchbox
configuration file as detailed in the section called “Handling of <meta name="searchbox-xxxx"> tags in HTML documents”.
You specify metadata by placing suitably formatted
<META> tags in your HTML document
<HEAD> section.
The full format is:
<META name="searchbox-NT-SliceName-MetaKey" content="MetaValue" />
where:
NT | Specifies optional processing for the metadata value. Can assume the following values: NT itself (the value will be normalized and tokenized), N (the value will be normalized), T (the value will be tokenized) or empty (the value will be used as is). See Tokenized and normalized attributes for details on the meaning of these operations. |
SliceName | The name of the slice the metadata will be put in. See Table 8.1, “Sliceids” for possible values. |
MetaKey | The metadata key. |
MetaValue | The metadata value. |
For example the following HTML excerpt:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="searchbox-NT-centralNorm-metakey0" content="metavalue0" />
<meta name="searchbox--centralNorm-metakey1" content="metavalue1" />
<!-- more stuff follows ... -->
will add two metadata: a normalized, tokenized metadata with key
metakey0 and value metavalue0
and another, literal one, with key metakey1 and value
metavalue1. Both metadata will be placed in the
centralNorm slice.
Some web sources grant access only to authenticated users. This happens for both sources that requires username and password and sources that use cookies. The searchbox gathering agent is able to simulate the way a real user explore a web site using a standard web browser so usually such Sources can be easily gathered once configured.
This option let you chose between three different types of authentication methods for a Source.
Mainly used for Intranet site crawling. It does not need any further configuration apart the username/password information provided into the configuration of the Archive associated to the Source.
With this modality searchbox can acquire authentication cookies from a server simulating what a user will do with his/her browser (filling a form and clicking the "submit" button). All what searchbox needs to log into a web site is to perform a POST action in the right forms so that a configuration with all variable names and its values is needed.
Let's imagine that the authentication is performed through the
page http://somesite.com/login.php: the login name would be a
variable called username and the password a variable
called password. In this case searchbox will provide
a POST action of these variables at another URL (i.e.
http://somesite.com/authcheck.php) and will store all cookies
that will be provided. An problem that can occur is that page
http://somesite.com/authcheck.php is a redirect to another
page. In this case we must refer the POST action to the destination
page.
Some sites generates cookies in the login page and in other pages between the home page and the login page itself. In this case searchbox must be configured to manage this situation. Such pages must be specified into Cookie Pre URLs fields.
Clicking on the button the following window appears
The URL field must be filled with the URL of the page where a normal user authenticate itself.
The Action parameter specifies if data are passed with or command and the maximum age the cookie can have before a new login is required.
Into Username and Password fields the name of the parameter for username and password must be specified (i.e. USERID and PASSWORD). Values of these parameters have to be specified into the configuration of the Archive associated to the Source.
In the Other parameters section more parameter/value couple can be specified if needed by the above Action
In the Cookie Pre URLs section the list of pages between the home page and the login page that generate cookies.
This is a specific feature of searchbox for HTML documents. into the Gathering Tab of each Source is possible to specify two regular expressions, one for the "start ignoring stuff" marker and the other for the "stop ignoring stuff" marker.
All the text between the specified markers will not be passed to the rendering module.
Select the Sources Tab of the left panel.
Select the item and click on button.
To create a new plugin there are two different way: New plugin or Inherit from existing plugin as shown in the following window:
Choosing the New plugin option select the one of available types. In the following example the is selected.
The list of available type of plugin may vary depending from your searchbox Engine installation.
After clicking on the button the Plugin configuration panel appear.
In order fully configure the plugin the following action are required:
Fill the Name: field.
Fill the Description: field
Edit all required parameters listed in the Parameters box. The number and type of parameters depends from the type of plugin you choose. Please see the pluging documentation for a detailed description of all parameters.
As soon the configuration is completed the new plugin is listed in the left panel under its category (Under fetching plugings in this case).
During this configuration process searchbox cannot make any consistency check because DLLs are not aware of their final role in the plugins chain. A plugin can discover its role only when its configuration parameter are set.
A Fetching Plugin is used to add to a specific Source a new fetching protocol not natively managed by searchbox. In order to configure a Source to use such type of plugin it must be previously installed and configured from the Plugins Tab of the left panel of Control Panel (see Configuration of a Fetching Plugin).
The crawl filters are used to define the gathering agents behaviour using a list of constraints on the syntax about the URLs that must be included or excluded by the gathering action. For this purpose two types of filters exists: inclusion and exclusion filters.
In the Crawling Tab of a Source a default inclusion filter can be added checking a specific checkbox. The purpose of such default filter is to limit the action of the gathering agent all pages contained in the original domain of the seed page. For instance with the seed:
www.repubblica.it
the automatically generated inclusion filter is:
^http://www\.repubblica\.it/.*$
where:
| ^ | begin of string |
| $ | end of string |
| \. | the "." character |
| .* | an unspecified number of characters |
Such filter is able to gather all reachable pages from the home page of Italian newspaper "La Repubblica".
Other examples:
^http://www\.repubblica\.it/indici/cronaca/cronaca\.htm$
The crawler starts from a subsection of "La repubblica" web site gathering pages with .htm extension only.
^http://www\.repubblica\.it/200[0-9]/[a-zA-Z]/sezioni/cronaca/.*$
The crawler gathers all the pages contained in a subsection of "La Repubblica" web site with a parametric path. In the specific case there is a division by years so the regular expression 200[0-9] has been used to consider all years between 2000 al 2009. The same thing for the other directory level that is described by one alphabetic char.
^http://www\.agronotizie\.com/(/)?sezioni/articolo\.cfm\?(C|c)odic e=[0-9]+&codcanale=[0-9]+&codargomento=.*$
This is a more complex case that is used to gather dynamic pages. In the example the regular expression (/)? take into account that a single or a double slash can be present while the (C|c)odice expression that the parameter can be with capital letter or not.
An detailed discussion about the use of regular expressions is beyond thee scope of this manual. A good reference for this topic can be found in every manual of a programming language with regular expression management like Perl, Python, Grep, Lex, etc. Alternatively you can read the Wikipedia regular expressions page: http://en.wikipedia.org/wiki/Regular_expressions. searchbox implements the POSIX modern (extended) regular expressions variant described on that page.
Select the Sources Tab of the left pane
and press the button at the bottom of the same pane. A pop-up window with the list of active Sources is shown.
To create an Archive you need to choose the Source to gather from. If the Source has been already configured is shown in this list and can be chosen.
Pressing the Ok button a new source named by default as the corresponding Source is shown.
To finish the basic configuration of an Archive you have to choose the depth and the type of the document cache.
In the Cache depth section by default only the last gathered document is stored. Once a new one is gathered the older is thrown away. Selecting the All document versions option different versions of the same document are historicized and are thrown away only in accordance with the garbage collection configuration.
In the Cache type section there are three possible choices:
Full. It is the default setting. The entire original document is stored into the Archive
Context only. Only the XML converted document (FFF) is stored
None. Only minimal required information about the document are stored
From the Info Tab of the Archive configuration panel it is possible to obtain some synthetic information about the current gathering process:
Status: can be Idle or Running
Archive size: it is the number of documents into the archive and their total disk occupation in KBytes.
Last gathering statistics: it is the number of documents gathered until now if the Status is Running or the total number of documents gathered by the last gathering activity if the Status is Idle and the number of errors occurred during the last or current gathering activity
Last gathering start: it is date and time (GMT) when the last gathering activity started
Last gathering end: it is date and time (GMT) when the last gathering activity stopped
Next gathering start: it is date and time (GMT) when the next gathering activity will start
In the case you need to follow online the crawling activity a manual refresh of the Control Panel infos can be forced using the F5 key or from the menu. Normally the Control Panel is synchronized with the engine every 5 minutes or every time the button is pressed.
For testing purposes or if you need to run gathering only once the and buttons are available into the Info Tab of the Archive configuration panel.
If you stop a scheduled gathering activity it will start again according with its scheduling.
When the plugin chain associated to a Source changes if you need to update with the new configuration the documents already stored into the Archive you need to perform a reprocess operation. Such operation is equivalent to a Gathering apart that documents are not fetched from the original Source but from the local Archive.
Clicking the button an immediate gathering from the local archive starts.
The reprocessing procedure is available only if the cache type for the Archive is Full.
Toi completely remove the content if an archive without deleting itself use the button in the Archive Maintencance section of Info Tab of of the Archive configuration panel.
After clicking the above button a confirmation alert will be displayed before resetting the Archive.
In the Info Tab of the Archive configuration panel the button
shows a pop-up window that reports a detailed live log of the gathering process.
Each row of the log is structured as:
Time. The date and time of the gathering action
URL. The URL of the gathered document
Status. One of the following:
| Status code | Description | Possible actions |
|---|---|---|
| Ok | The document has been gathered and redendered correctly | |
| Redirect | This URL is a redirect to another URL (HTTP only) | |
| Index | This URL is a directory that only contains pointers to other documents (common in gathering of file systems). This index document is not rendered and stored. | |
| Not found | This URL has no document associated | |
| Rendering error | The document has been correctly gathered but cannot be rendered because errors (see Description field for details) | Check the plugin chain, If you developed a custom plugin it is possible that it does not work properly. If any active plugin fails the entire processing of the documents is aborted. Another possibility is that the document you are trying to fetch is in a not parsable format or the mimetype returned by the server is wrong. |
| Bad redirect | This URL is a redirect to another URL that is invalid or can generate a loop | |
| Blocked | The gathering of this URL is blocked by filters or robots.txt of the remote site (if the control is enabled in searchbox) | If you think that this URL should be fetched probably there is an error in the gathering filters configuration for this Source. |
| Network error | The URL gathering failed due to a generic network error | Check your network connection and if it is working try to access to the remote server with a different software than searchbox (i.e. a browser if the Source is a web site) |
| Engine error | This is a generic searchbox error due to a component that failed to accomplish its job in some way during the document processing chain. This error does not affect the overall functionality of the platform apart that the current document is not correctly processed. | |
| Unchanged | The documents already exists in searchbox and it is not changed from last time was gathered | |
| Document limit reached | searchbox has reached the maximum number of documents it can manage (see license limits). | Free some space in your searchbox Engine installation deleting some Archives or setting a more strict garbage collection parameter. Also you can increase limits of your searchbox Engine license. |
| Authentication required | The gathering agent need to be authenticated in order to fetch this URL | |
| Internal error | An unknown unrecoverable internal error has occurred. | This kind of error is due to a software bug or a unknown situation. Please produce a bug report and send it to support@focuseek.com. |
| Remote server error | The remote server returned an error code | |
| Plugin error | One plugin (gathering, parsing, rendering) returned an error code. The processing of the current document is aborted. | |
| Cannot fetch | The fetching protocol required by the Source is not supported by the current searchbox installation. | |
| Processing error | One component of the document processing pipeline reported an error. | |
| Timeout error | One component of the document processing pipeline timed out |
Description. More details about the status generated by specific errors.
Stopping the on-line log download (the default when the window is opened) the button became active.
A Choose Export File window let you choose location where to save the full log in plain text format.
In almost every situation is strongly recommended you accurately configure gathering parameters from the Options Tab of the Archive configuration panel.
The first parameter is called HTTP User Agent and it is the string that the web crawler presents to the HTTP server to describe itself. This string should be descriptive enough to let web master of the remote site to recognize the owner of the spider. Someone prefers to use the same User Agent of a popular browser like Firefox or Internet Explorer in order to simulate that the spidering has been done by a real human user.
The second parameter is called Throttling and represents the minimum number of seconds that must pass between two fetching action performed for the current Archive. Good web citizens behave nicely with other sites and limit the load they put on remote servers and searchbox self-limits its page download speed to a user configured setting with a default value of 20 seconds. In the case you are gathering from a file system or from a partner site you can decrease this limit up to 0 seconds.
The three other parameters (by default they are ignored) limit:
the maximum number of documents that an Archive can gather during a single gathering cycle. In the above example the gathering cycle ends after 100 documents or after one hour.
the maximum duration of a spidering cycle. Sometime it happens that due to the slowness of the network the actual spidering time is longer that the scheduled one, in these cases such upper bound should be configured.
the mazimum size of the fetching queue. When searchbox fetches a document it also expand all its outlinks and put their URLs into a queue. Actual statistics about Web say that and average http document has 5 outlinks so that the fetching queue can became huge in a very few spidering steps (i.e. 3 fetched documents produce new 125 entries in the fetching queue). This parameter say to searchbox to stop fetching new documents if the fetching queue is out of bounds.
The Gathering process is "by definition" a cycling activity. The gathering agent periodically visits the Source and put into the Archive only new documents or new versions of existing ones.
Some different options are possible from the Option Tab of the Archive configuration panel:
The option configures a gathering
cycle of n hours (8 hours in the above
example)
The option configures a gathering
cycle of n days starting from a specific hour. In
the above example the gathering starts at midnight of every day.
The option configures a weekly gathering cycle starting from at specific hour of a specific day of the week. In the above example the gathering starts every week on Monday at midnight.
The option configures a monthly gathering cycle starting from a specific hour of a specific day of the month.
From the Option Tab of the Archive configuration panel is possible to configure how searchbox throw out documents from Archives.
The mechanism is called Garbage Collecting and it is an automated process that periodically remove documents that satisfy some limits from the Archive.
Three parameters can be configured:
the fresheness of archived documents. Documents older than the specified amount of time will be deleted by the garbage collection agent.
the frequency of garbage collectioning action. The garbage collection process has impact on the performance of the system. This parameter must be accurately configured taking into account the gathering flux. If you don't want that archive size is indefinitely increased.
the number of documents deleted at every garbage collection cycle. This parameter must be set accurately in order to avoid the used disk space used by searchbox will increase indefinitely. It must be related to the number of active sources and their crawling speed.
The Control Panel lets you browsing the content of an Archive using many parameters. In the following picture the result of a simple query is shown.
The text box at the top of the windows accepts the user query. The syntax is the standard one of searchbox (See “The query syntax” section).
The next row of widget specify how query results must be shown:
Results: are the number of results shown after pressing button. To show more query results click more times on the button or increase this value.
Sort: the ranking algorithm used to sort the query results. Possible choices are:
. The default ranking mechanism. Let searchbox to decide the best ranking algorithm.
. At this moment exactly the same as Standard. Subjected to change in a future version
. It is based only on the intrinsic score assigned to the document.
. Results are ordered by timestamp. Newer ones are at the top.
. Results are ordered by timestamp. Older ones are at the top.
Let you set a time interval where documents of query results must be contained. The time is referred to the date of gathering.
Modify the weight of each slice of the index in order to better control the ranking mechanism (See section about rendering) In the following table all available slices are described with their default value.
Table 8.1. Sliceids
Number | Name | Weight | Description |
| 1 | author | 1 | Author |
| 2 | keyword | 1 | Keywords associated to the document |
| 3 | abstract | 1 | Abstract |
| 4 | invisible | 1 | Almost invisible text |
| 5 | marginalNorm | 1 | Plain text with marginal relevance |
| 6 | marginalEmph | 2 | Emphasized text with marginal relevance |
| 7 | marginalLink | 1 | Links from marginal relevance text |
| 8 | - | - | - |
| 9 | marginalHeader | 3 | Header of a marginal relevance text |
| 10 | centralNorm | 2 | Plain text with central relevance |
| 11 | centralEmph | 4 | Emphasized text with central relevance |
| 12 | centralLink | 2 | Links from central relevance text |
| 13 | - | - | - |
| 14 | centralHeader | 6 | Header of central relevance text |
| 15 | title | 12 | Title |
| 16 - 31 | custom0 - customF | 12 | custom slice |
The rank of a document is calculated following the rule:
The Rank of a document is the absolute weight of the document multiplied for weights of those slices that contains at least a term specified in the query string.
Values can be changed with the following conventions:
Value | Effect |
-1 | Set to the default value |
0 | Does not consider this slice in the rank calculation |
n | Set to n the weight of the slice |
Selecting one of the query results in the Context box a short extract of the document related to the query terms is shown.
From the Browsing Tab of Archive configuration is possible to show any document listed as result of a query in three possible ways:
. The document as saved into the cache
. The document shown live from the original source
. The internal searchbox XML format of the documents.
In order to access to any cached document you need to give to the viewer the authentication for searchbox (you username/password). Usually operating systems let you to store usernama/password in a keychain so that typing such information are needed the first time only.
The FFF is the XML format that searchbox internally use to store all documents extracted content and metadata. Viewing the FFF is useful for debugging purposes.
Please not the address of the page shown in the web browser (in the case of an html document)
It is an URL that points to the platform where searchbox Engine is installed (localhost in this case).
Three different views of the FFF are available.
The full-text extracted from the FFF of the selected document. This text is distributed on different slices by the rendering process so that selecting and deselecting the appropriate checkboxes on the left side of the windows is possible to see what is contained in each slice. By default all slices are shown.
Both standard and template metadata containted into the FFF representation of the selected document.
In the Browsing Tab of the Archive the and buttons are used to manually add and remove documents from the current Archive.
The following windows appears clicking on the button
File: the complete path of the document that must be added to the Archive. It is automatically filled browsing the local file system by the button.
URL: the URL assigned to the document. If it is not specified the path of the file in the local file system is taken.
MIME: the MIMETYPE assigned to the document. It can be one of the following types:
Selecting the custom option the textbox under MIME filed will be ready to accept the custom string.
The Copy ID button in the Browsing Tab of Archive configuration panel let you copy into clipboard the ID of the document currently selected in the list of results.
[10] If you use a User DSN you must define the DSN as the user searchbox runs as. Refer to the section called “searchbox process identity” for details.
[11] The order of the fields influences the order searchbox will use to fetch the rows from the db. Also note that URLs differing only in the order of the key are considered different urls anyway, i.e. they are crawled two times.