Chapter 13. Engine administration

Table of Contents

Import/Export
Windows
Unixes
Reindex
Windows
Unixes
Reset
Windows
Unixes
Optimize the index
Windows
Unixes
Check Status
Crash recovery
Datadir change
Global configuration parameters
searchbox process identity
Program and data location
Pidfile location (unix only)
Number of processes
Platform access
Proxy access
Notifications
Logging
Default User agent
Index optimization
Index flushes
DB Sync level
Internal HTTP Web Services
Enterprise Search Portal
Worker timeout
Minimum disk space
Global crawl control
Handling of <meta name="searchbox-xxxx"> tags in HTML documents

In this chapter the basic searchbox Engine administration procedure are described.

All basic administration procedures must be performed from command line interface by system administrator with administrator permissions. The main searchboxd executable must be invoked with some switches in order to perform various actions. Such procedure is slightly different from Windows to Unix boxes.

Under Windows the searchboxd process runs as a system service so must be invoked by the specific Windows Service Manager.

Under Unixes searchboxd is a command that can be invoked from a standard terminal. Under Linux Red Hat for instance the searchboxd command can be found in:

Import/Export

Sometimes you want to copy just the configuration among to searchbox installations. To do this you may export the configuration and then import it back on the other system.

Note

The import operations implies a reset.

Note

This is an uninterruptible operation.

The operations required to perform an export and/or an import differ depending on the operating system searchbox runs on.

Windows

To perform an export, open Windows Service Manager and open the property sheet for Searchbox Engine service. Stop the service if it is running by pressing the Stop button. Enter /export dir in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the export is finished.

dir is a directory that will contain one file for each source, archive, collection, watch, plugin etc. configured on the system. The directory will be created by searchboxd and must not exist before the export.

To perform an import, open Windows Service Manager and open the property sheet for Searchbox Engine service. Stop the service if it is running by pressing the Stop button. Enter /import dir in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the import is finished.

dir is a directory created by a previous export between double quotes if the path contains spaces.

Unixes

To perform an export first stop searchboxd, then run

# searchboxd --export=dir

Where dir is a directory that will contain one file for each source, archive, collection, watch, plugin etc. configured on the system. The directory will be created by searchboxd and must not exist before the export.

To perform an import first stop searchboxd, then run

# searchboxd --import=dir

Where dir is a directory created by a previous --export.

Reindex

On some rare occasions (basically after a system crash or to migrate to a newer version of searchbox) you may want to rebuild searchbox indexes based on the documents stored in searchbox.

Note

Indexes for archives with DocumentCache = NOCACHE cannot be regenerated in this way. The will be empty after a reindex operation

Note

Depending on the amount of data the reindex operation can be really time consuming.

Note

This is an uninterruptible operation.

The operations required to perform a reindex differ depending on the operating system searchbox runs on.

Windows

To perform a reindex, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /reindex in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the reindex is finished.

Unixes

To rebuild them first stop searchboxd, then run:

# searchboxd --reindex

Reset

The reset operation lets you remove data and indexes from searchbox while retaining your configuration i.e. sources, archives, collections and watches, as well as plugin configurations, users and so on.

Note

Obviously all the documents and the indexes will be lost.

Note

This is an uninterruptible operation.

The operations required to perform a reset differ depending on the operating system searchbox runs on.

Windows

To perform a reset, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /reset in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the reset is finished.

Unixes

First stop searchboxd, then run:

# searchboxd --reset

Optimize the index

searchbox is an "always on system" that means that it usually performs all its internal optimization issues concurrently with its other normal service operations. ???

Windows

To perform an optimization, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /optimize in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the optimization is finished.

Unixes

Run the following command:

# searchboxd --optimize

Check Status

Both on Windows and Unixes some information about searchbox status can be obtained from the File-Status... menu choice of Control Panel.

Figure 13.1. The Server status window of Control Panel

The Server status window of Control Panel

The Current version parameter is the version of the searchbox Engine the Control Panel is currently logged on. This number is formatted as major.minor[.revision[.build]]

The Available update parameter warn if a new version of the Engine is available for the download. This parameter is effective only if the automated check of new releases is activated in the focuseek.cfg configuration file.

The Status parameter is the current status of the searchbox Engine. It can be:

  • running. The searchbox Engine is currently running.

  • crawlstopped. The searchbox Engine has the crawling activity stopped due to global configuration setting. For more information see focuseek.cfg configuration file.

  • lowdiskspace. The crawling activity is currently stopped because the volume where the datadir is located has not enough space to guarantee the searchbox Engine works well. See MINADDFREEDISK and MINFREEDISK parameters in the focuseek.cfg configuration file.

Other information are about the Server load and the Activation key currently used by the searchbox Engine.

KeyValue
number of documentsthe number of documents stored in all searchbox archives
size of documents (MB)the size of documents stored in all searchbox archives
number of archivesthe number of archives
free disk spaces (MB)the free disk space available in the volume where the datadir is located
OS typethe operating system compatible with the current activation key
platform versionthe searchbox Engine version compatible with the current activation key
expiration timewhen the current activation key expires
maximum number of documentsthe maximum number of documents allowed by the current activation key
maximum size of documents (MB)the maximum size of documents allowed by the current activation key
maximum number of archivesthe maximum number of archives allowed by the current activation key
maximum number of usersthe maximum number of users allowed by the current activation key
custom plugin typeYes, if the current activation key supports installation of custom plugins
custom pluginsYes, if the current activation key supports configuration of custom plugins
historicizationYes, if the historicization feature is enabled
APIYes, if the SOAP API is available

Other more detailed information about status can be obtained under Unixes running the following command:

# searchboxd --status

Possible outputs are:

  • searchboxd not running: searchbox is not running

  • searchboxd(<pid>) starting: searchbox is starting; this is only a transitory state; soon it will be “running”[21]

  • searchboxd(<pid>) running: searchbox is running[21]

  • searchboxd(<pid>) stopping: searchbox is stopping; this is only a transitory state; soon it will be “not running”[21]

  • searchboxd(<pid>) uninterruptible operation: searchbox is performing some uninterruptible and possibly time consuming operation[1] such as rebuilding the whole index. When running such operations searchboxd cannot be stopped but instead stops on its own when the operation is finished[2]. Uninterruptible operations must be explicitly stared running searchboxd with some special switch; see later.

Under Windows equivalent information can be obtained from the Services Administrative Tool.

Crash recovery

While we have extensively tested searchbox in a variety of conditions, software is never perfect and 100% bug free. We appreciate and encourage your feedback on searchbox issues, but we'd like to also suggest a few tips that might help you get back up to speed faster. Should searchbox crash, you can try the following corrective actions, in order of preference:

  • Try stopping and restarting the platform. This is the simplest corrective action and could resolve temporarily problems like component crashes due to, for example, very low memory situations. This action should not cause any data loss.

  • On Unix systems if searchboxd crashes or the power is lost then a file might be left around preventing correct searchboxd start/stop. If you are absolutely sure that searchboxd is not running[22] then look for the /opt/searchbox/data/runpid and, if it's there and for the second time you are absolutely certain that searchboxd is not running, remove it.

  • If you suspect that the index might be corrupt because e.g. You get wrong query results or searchbox crashes while performing queries but other operations are correctly accomplished you may try and rebuild the index; see “Forcing index to be rebuild”. Be aware that rebuilding the index from scratch might be very time consuming.

  • Stop the platform and try to export and then import the configuration; see above in system-specific sections to know how. Then, restart the platform. You will lose all the indexes and all the cached documents, but you will retain your configuration data.

Datadir change

All data generated by your searchbox installation are contained in the so called datadir. Such datadir is a folder created during the first launch of the searchbox in the default location as specified in the searchbox.cfg (see "Program and data location" in the next section) configuration file. The datadir contains the physical state of your installed searchbox so it is possible to have multiple datadirs and switch from one to the other simply shutting down searchbox and changing the DATADIR parameter in the focuseek.cfg configuration file.

Note

searchbox can manage only one datadir at the same time

Warning

In case the datadir comes from another searchbox installation on a different server you must be sure to set the file permissions as those generated by the default datadir (permissions are platform dependent).

Global configuration parameters

Some other global searchbox parameters are stored in a global configuration file, called focuseek.cfg, that is usually placed in the config directory inside the program installation directory, that is c:\Program Files\focuseek\searchbox\config\focuseek.cfg on windows, /opt/searchbox/lib/platform/config/searchbox.cfg on linux and /Applications/searchbox/lib/platform/searchbox.cfg on osx. On linux and osx it's location can be changed editing the searchbox_env file, located under /opt/searchbox/etc/ on linux and under /Applications/searchbox/etc/searchbox.cfg on osx.

Configuration file parameters affect the whole searchbox platform and are only used on startup. You can change the parameters at any time but you'll have to restart searchbox to enable them.

The following sections document the configuration file options. Not all options are documented, the undocumented options are related to internal platform coordination or to experimental features and shouldn't be changed.

searchbox process identity

The User and Group configuration parameters determine the process identity, these are Unix-only options. searchbox needs to be run as root on startup, but for security purposes it will switch to the specified user and group after the initial startup phase.

On Windows you configure the process identity through the Service Control Manager.

Program and data location

DEPLOYDIR and DATADIR point to the program installation location and the data file location. You don't usually need to change these, as they are set by the installation application.

Pidfile location (unix only)

searchbox records that it's running in a file called the pidfile whose location is specified in the PIDFILE parameter. It is set to /var/run/searchbox.pid and you should not change it unless you are fiddling with system startup/shutdown.

Number of processes

WORKERS is number of processes that concurrently perform the gathering/parsing/rendering chain. The default value is 5 but can be modified according with the size of your pool of sources and the available amount of RAM and CPU.

Platform access

These directives let you control how to access searchbox from external applications:

  • HTTP_PORT - This parameter has the form address:port, searchbox integrated web server will listen on the specified IP address and port. The specified IP address should be associated to a network interface card in the system. A special value of * can be used as an IP address to have searchbox listen on all network interface cards. The default port is 2200. For example 127.0.0.1:2200 opens searchbox only to applications running on the local computer, while *:2200 opens searchbox to requests from any network interface card.

  • HTTP_USER - This parameter holds the builtin user name. This user has administrator rights over all the searchbox objects.

  • HTTP_MD5PASS - This parameter holds the MD5 hash of the password for the user named after HTTP_USER.

Proxy access

If you're running searchbox from behind a proxy, the following configuration directives let you set the necessary parameters:

  • HTTP_PROXY - Use this parameter to specify the proxy server that searchbox should use to access the "external" Internet. This parameter should have the form host:port, for example proxy:80.

  • HTTP_PROXYUSERPWD - In some circumstances the proxy might have access control enabled, you can specify it using this option in the form user:pass.

  • HTTP_NOPROXY - This parameter lets you distinguish local servers from the "external" Internet. Local servers will be accessed without going through the proxy. You should specify local servers as a list of hostnames separated by spaces, such as www.our.net our.server.net. The specified hostnames are intended as suffixes, so our.net would match www.our.net ftp.our.net and all other third level servers.

Notifications

searchbox can send notifications by email or instant messages using external facilities:

  • SMTPServer - Outgoing mail server host:port

  • SMTPSS - yes if we should try SMTP/SSL

  • SMTPUser - SMTP authentication user

  • SMTPPasswd - SMTP authentication password

  • SMTPSecureAuth - If yes don't use SMTP insecure authentication methods

  • SMTPFrom - SMTP From address

  • SMTPReplyTo - SMTP Reply-to address

  • JabberServer - Jabber server host:port port is optional and defaults to 5222

  • JabberID - Jabber ID searchbox will use (name@domain or name@domain/resource) default resource is searchbox

  • JabberPasswd - Jabber password for JabberID

  • JabberSASL - yes if the server uses SASL auth (most don't)

  • JabberNotificationResultsPerMessage - Number of notification results in each jabber message

  • JabberNotificationMessageDelay - Delay in seconds between two notification jabber messages

Logging

LOGCFG determines how much information will be logged to the files inside the log directory in the searchbox data directory. It is a space-separated list of facility:level pairs. Facility names are:

APIAPI calls
FFFFFF handling
HTTPSERVERinternal HTTP operations
IPCMARSHALipc marshallers
IPCSOCKipc sockets
FETCHEDDOCSfetched documents
KEYKey checks
LOGlog system itself
PLUGINSMessages from plugins
PLUGINSYSTEMThe plugin framework
WORKERother worker activities
WORKERFETCHERSfetchers
WORKERPARSERSparsers
WORKERSCHEDULINGscheduling of workers

ossible level values, in decreasing order of verbosity, are BORING, DEBUG, INFO, WARNING, ERROR, CRITICAL. The default for all facilities is info, and it's generally not necessary to change it, except possibly for debugging purposes.

Default User agent

focuseek searchbox is a well-behaved web spider and implements the standard for robots exclusion (http://www.robotstxt.org/wc/norobots.html), usually referred to as the robots.txt standard. The robots.txt standard lets webmasters disable spidering of portions of their site by placing a suitably formatted robots.txt file in the root directory of their site. The robots.txt standard requires that each spider identifies itself using a user agent string. This string is used also to disable some robots spidering the site while still allowing others to access it. See the standard for the details.

You can specify the user agent used by searchbox with the UserAgent parameter in the configuration file. It defaults to focuseekbot. Such parameters can be overridden by the specific configuration for each Archive (see the Control Panel documentation)

The UserAgent parameter is also used to identify the spider in the HTTP User-Agent request header. focuseek searchbox uses the following HTTP user agent:

Mozilla/4.0 (compatible; agent)

Where agent is the value of the UserAgent parameter in the configuration fine. For further details on the HTTP user agent see for example http://www.mozilla.org/build/user-agent-strings.html

Index optimization

Index optimization will occur daily. Since it is a very disk anc cpu consuming operation, you can choose the exact hour at which it will occur. The OptimizeAt parameter holds the GMT hour at which index optimization will occur. You can disable index optimization by setting the OptimizeAt parameter to -1.

When the optimization is enabled some other parameters can be tuned:

  • MERGEFACTOR - A number that says how often the indexer merges index segments, when adding documents. Merging takes time, so a low merge-factor number (resulting in many merges per number of documents added), will slow down indexing. However, using a low-number here keeps the number of index segments low, which speeds up searching, and causes the indexer to use less file handles.

  • MINMERGEDOCS - (default value is 10) Determines how often segments are created (how many documents that are added before they are «assembled» into a segment).

  • MAXMERGEDOCS - The maximum number of documents that can be stored in an index segment. When this value is reached in a segement it will not grow any larger. As a result thyere may be more segments than expected from looking at the merge factor

  • MAXCLAUSECOUNT - The maximum number of term allowed in and/or expressions into queries

  • MAXRAM - The maximum amount of RAM that the indexing service process process can use

  • MAXFIELDLENGTH - The maximum number of tokens used to generate the index entry for a document. So for full text indexing, only this number of tokens will be considered.

  • MAXWORDLENGTH - The maximum acceptable length for tokens used to generate the index entry for a document. So for full text indexing, long words will be truncated to this value.

Index flushes

For performace reasons, documents are not added to index one at once, but in batches. However, an index flush will be forced from time to time. The FlushLatency parameter holds the number of seconds between index flushes.

DB Sync level

The SyncLevel parameter settles how often the internal database flushes its data on disk.

SyncLevel valueDescription
0No sync, DB will be corrupted after a power failure
1Normal sync (default), very low probability of corruption after power failure
2Full sync, maximum safety but slower performance

Internal HTTP Web Services

searchbox has its own internal http server used to implement the Web Service interface. The following parameters determines its behaviour:

  • HTTP_PORT - The HTTP port number used by searchbox to expose itself as Web Service (configurable from the Control Panel)

  • HTTP_USER - The administrator username

  • HTTP_MD5PASS - The MD5 encrypted administator password (configurable from the Control Panel)

  • HTTP_THREADS - Number of availble threads

    Warning

    HTTP_THREADS impacts a lot on memory footprint as for each thread when a query process is created.

  • HTTP_URL_PREFIX - The public address for the HTTP portal server. Defaults to http://<hostname>:<HTTP_PORT>/

Enterprise Search Portal

Inthe case you want to disable access to searchbox through the internal Enterprise Search Portal set the PORTAL_ENABLED parameter value as NO.

Worker timeout

Worker is the searchbox component responsible of all processing activities on every single document. Sometime it happens that a worker process hangs due to the impossibility to process a document. The WORKER_TIMEOUT parameter sets the number of seconds that searchbox waits before killing a hanged process.

Minimum disk space

Gathering a large document set can lead to disk space exhaustion. Searchbox will check for free disk space and halt the gathering process when disk free space falls below the value set in the MINADDFREEDISK parameter (in MB).

Global crawl control

You can allow the gathering process to run only during a certain period of the day. For example, you could allow gathering only during the night. To do this, you can change the CRAWLSTART and CRAWLSTOP parameters, which hold the start and stop hours (GMT). You can completely stop the gathering by setting one of these parameters to -1.

Note that the crawls will be put in a sleep state outside of this interval of time, and they will resume from the point where they left. This is different from stopping and restarting a crawl: in this case gathering will start again from seeds.

Handling of <meta name="searchbox-xxxx"> tags in HTML documents

searchbox can process suitably formattted metadata placed in HTML files. To enable this feature you must set the USE_SEARCHBOX_METAS parameter to 1. The default value is 0. See the section called “Collecting metadata embedded in HTML documents” for further explanations.



[21] <pid> is the process id of the running searchbox

[22] Beware, don't trust searchboxd --status to report the correct status when doing this kind of maintenance, use ps(1) and look for searchboxd in the list of running processes.